ai-assistant-discord-bot/docs/LOGGING.md

# Logging and Error Reporting

This document describes the comprehensive logging and error reporting system implemented for Claude Discord Coordinator.

## Overview

The bot uses structured JSON logging with rotation for production-grade observability and debugging. All errors are tracked with unique IDs and user-friendly messages are displayed in Discord without exposing internal implementation details.

## Components

### 1. Structured Logging (`logging_config.py`)

#### JSONFormatter

Outputs log records as JSON objects with consistent fields:

```json
{
    "timestamp": "2026-02-13T19:06:35.545176+00:00",
    "level": "INFO",
    "module": "bot",
    "function": "_handle_claude_request",
    "line": 123,
    "message": "Processing message in channel 12345 for project my-project",
    "channel_id": "12345",
    "session_id": "abc-def-ghi",
    "cost_usd": 0.05
}
```

**Fields:**
- `timestamp`: ISO 8601 UTC timestamp
- `level`: Log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
- `module`: Python module name
- `function`: Function name
- `line`: Source line number
- `message`: Log message
- `exception`: Full traceback (if exception occurred)
- Custom fields via `extra={}` parameter

#### ErrorTracker

Generates unique error IDs for tracking and support:

```python
from claude_coordinator.logging_config import ErrorTracker

error_id = ErrorTracker.generate_error_id()  # Returns 8-char ID like "abc12345"
```

Use `log_error_with_id()` to log with automatic error ID:

```python
error_id = ErrorTracker.log_error_with_id(
    logger,
    logging.ERROR,
    "Failed to process request",
    exc_info=exception,
    channel_id="12345",
    session_id="abc-def"
)
```

#### setup_logging()

Configures application-wide logging with rotation:

```python
from claude_coordinator.logging_config import setup_logging, get_log_directory

log_dir = get_log_directory()
setup_logging(
    log_level="INFO",
    log_file=str(log_dir / "bot.log"),
    json_logs=True,
    max_bytes=10 * 1024 * 1024,  # 10MB
    backup_count=5
)
```

**Parameters:**
- `log_level`: DEBUG, INFO, WARNING, ERROR, CRITICAL
- `log_file`: Path to log file (creates directory if needed)
- `json_logs`: Use JSON formatter (True) or plain text (False)
- `max_bytes`: Max file size before rotation (default: 10MB)
- `backup_count`: Number of rotated files to keep (default: 5)

### 2. Log File Locations

**Determined by `get_log_directory()`:**

1. **Production**: `/var/log/claude-coordinator/` (requires write permissions)
2. **Fallback**: `~/.claude-coordinator/logs/`

Log files:
- `bot.log` - Current log file
- `bot.log.1` - First rotated backup
- `bot.log.2` - Second rotated backup
- etc. (up to `backup_count`)

### 3. Discord Error Messages

User-friendly error messages without internal details:

```python
from claude_coordinator.logging_config import format_error_for_discord

discord_msg = format_error_for_discord(
    error_type="timeout",
    error_id="abc12345"
)
```

**Supported Error Types:**

| Error Type | Message | Use Case |
|------------|---------|----------|
| `timeout` | ⏱️ Request Timeout | Claude took too long |
| `claude_error` | ❌ Claude Error | Claude command failed |
| `parse_error` | ⚠️ Parse Error | Malformed JSON response |
| `config_error` | ⚙️ Configuration Error | Channel not configured |
| `permission_error` | 🔒 Permission Error | Insufficient permissions |
| `session_error` | 💾 Session Error | Session save/load failure |
| `lock_error` | 🔄 Processing Error | Lock acquisition failed |

**Example Output:**

```
⏱️ **Request Timeout**
Claude took longer than expected. Please try again.
If this continues, try a simpler request.

_Error ID: `abc12345`_
_Please reference this ID if requesting support._
```

### 4. Combined Error Logging and Formatting

Convenience function for common pattern:

```python
from claude_coordinator.logging_config import log_and_format_error

error_id, discord_msg = log_and_format_error(
    logger,
    error_type="timeout",
    message="Request timeout in channel",
    channel_id="12345",
    duration=300
)

# Logs to structured logs with error_id and context
# Returns tuple: (error_id, discord_friendly_message)

await channel.send(discord_msg)
```

## Key Events Logged

### Bot Lifecycle

- **Startup**: Version, config path, loaded projects
- **Ready**: Connected guilds, channels, bot user info
- **Shutdown**: Graceful vs crash, cleanup status

### Message Processing

- **Message Received**: Channel, user, message length, project
- **Project Matched**: Channel → project name mapping
- **Session State**: New vs resumed, session_id
- **Claude Command**: Full command for debugging (via extra fields)
- **Claude Subprocess**: Start, completion, duration, cost
- **Response Sent**: Chunk count, total chars, delivery status

### Errors (with Context)

All errors include:
- Error ID for support reference
- Channel ID and session ID
- Relevant context (duration, exit code, etc.)
- Full exception traceback (in logs only, not Discord)

### Performance Metrics

- **Request Duration**: Total time from message to response
- **Claude CLI Duration**: Subprocess execution time
- **Session Operations**: Database query times
- **Message Queue Depth**: Concurrent requests per channel

## Usage Examples

### Basic Logging

```python
import logging

logger = logging.getLogger(__name__)

# Simple log
logger.info("Processing message")

# Log with context
logger.info(
    "Processing message in channel",
    extra={
        'channel_id': '12345',
        'user_id': '67890',
        'message_length': 150
    }
)
```

### Error Handling with Tracking

```python
try:
    result = await some_operation()
except TimeoutError as e:
    error_id, discord_msg = log_and_format_error(
        logger,
        error_type="timeout",
        message="Operation timeout",
        error=e,
        channel_id=channel_id,
        duration=300
    )
    await channel.send(discord_msg)
```

### Performance Tracking

```python
import time

start_time = time.time()
result = await long_operation()
duration_ms = int((time.time() - start_time) * 1000)

logger.info(
    "Operation completed",
    extra={
        'operation': 'claude_cli',
        'duration_ms': duration_ms,
        'result_size': len(result)
    }
)
```

## Environment Variables

- `LOG_LEVEL`: Set log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
  - Default: INFO

```bash
# Development with debug logging
export LOG_LEVEL=DEBUG
python -m claude_coordinator.bot

# Production with info logging
export LOG_LEVEL=INFO
python -m claude_coordinator.bot
```

## Log Analysis

### Querying JSON Logs

Use `jq` to query structured logs:

```bash
# Get all errors
cat bot.log | jq 'select(.level == "ERROR")'

# Get errors for specific channel
cat bot.log | jq 'select(.channel_id == "12345" and .level == "ERROR")'

# Calculate average Claude CLI duration
cat bot.log | jq 'select(.claude_duration_ms) | .claude_duration_ms' | awk '{sum+=$1; count++} END {print sum/count}'

# Find all timeouts
cat bot.log | jq 'select(.message | contains("timeout"))'

# Get error IDs from today
cat bot.log | jq 'select(.error_id) | {timestamp, error_id, message}'
```

### Finding Errors by ID

```bash
# Find specific error ID in logs
cat bot.log* | jq 'select(.error_id == "abc12345")'

# Get full context for error
cat bot.log* | jq 'select(.error_id == "abc12345") | .'
```

### Performance Analysis

```bash
# Get slowest Claude CLI calls
cat bot.log | jq 'select(.claude_duration_ms) | {duration_ms: .claude_duration_ms, channel_id, session_id}' | jq -s 'sort_by(.duration_ms) | reverse | .[0:10]'

# Cost tracking
cat bot.log | jq 'select(.cost_usd) | .cost_usd' | awk '{sum+=$1} END {print "Total: $" sum}'
```

## Testing

Comprehensive test suite in `tests/test_logging.py`:

```bash
# Run logging tests
pytest tests/test_logging.py -v

# Manual testing
python test_logging_manual.py
```

**Test Coverage:**
- JSON formatter validation
- Required fields presence
- Exception handling
- Extra fields inclusion
- Error ID generation and uniqueness
- Discord message formatting
- Privacy protection (no internal details exposed)
- Log rotation configuration

## Privacy and Security

### What Gets Logged

**✓ Safe to log:**
- Channel IDs (public Discord identifiers)
- Session IDs (UUIDs)
- Timestamps, durations, costs
- Error messages and stack traces
- Operation types and results

**✗ Never log:**
- User message content (only length)
- Discord tokens
- Authentication credentials
- Personal identifiable information (PII)

### What Gets Shown in Discord

**✓ User-friendly messages:**
- Error type and description
- Error ID for support reference
- Actionable suggestions

**✗ Never exposed:**
- Stack traces
- File paths
- Internal error details
- Session IDs
- Database information

## Monitoring Recommendations

### Real-time Monitoring

```bash
# Follow logs in real-time
tail -f ~/.claude-coordinator/logs/bot.log | jq .

# Follow errors only
tail -f ~/.claude-coordinator/logs/bot.log | jq 'select(.level == "ERROR")'
```

### Alerting

Monitor for:
- High error rates (>5 errors/minute)
- Repeated timeouts (same channel)
- Session save failures
- Database errors
- Cost spikes (>$1/hour)

### Log Retention

- **Active log**: `bot.log` (up to 10MB)
- **Rotated logs**: `bot.log.1` through `bot.log.5`
- **Total retention**: ~50MB (5 rotations × 10MB)
- **Time retention**: Depends on activity (typically 1-7 days)

For longer retention, implement external log aggregation (e.g., rsyslog, Loki, CloudWatch).

## Troubleshooting

### Logs Not Appearing

1. Check log file path:
   ```bash
   python -c "from claude_coordinator.logging_config import get_log_directory; print(get_log_directory())"
   ```

2. Verify write permissions:
   ```bash
   mkdir -p ~/.claude-coordinator/logs
   touch ~/.claude-coordinator/logs/test.log
   ```

3. Check log level:
   ```bash
   export LOG_LEVEL=DEBUG
   ```

### Invalid JSON in Logs

If JSON parsing fails, check for:
- Multi-line log messages (should be escaped)
- Special characters in message strings
- Corrupted log rotation

### Log Rotation Not Working

Verify configuration:
```python
setup_logging(
    max_bytes=10 * 1024 * 1024,  # Must be positive integer
    backup_count=5  # Must be > 0 for rotation
)
```

## Future Enhancements

Potential improvements:
- [ ] Integrate with external logging service (Loki, CloudWatch)
- [ ] Add log aggregation dashboard
- [ ] Implement log compression for rotated files
- [ ] Add performance profiling hooks
- [ ] Create alerting rules for critical errors
- [ ] Add cost tracking and budget alerts

---

**Last Updated**: 2026-02-13
**Version**: MED-002 Implementation