ai-assistant-discord-bot/docs/LOGGING.md
Claude Discord Bot e6983b56b9 Week 3 progress: Logging and testing complete (MED-001, MED-002, MED-005)
MED-001: Enhanced typing indicator
- Persistent typing loop (_maintain_typing method)
- Loops every 8s to maintain indicator for long operations (30s-5min)
- 8 comprehensive tests covering all lifecycle scenarios
- 27/27 bot tests passing

MED-002: Structured logging and error reporting
- logging_config.py (371 lines) - JSONFormatter, ErrorTracker, format_error_for_discord
- RotatingFileHandler (10MB max, 5 backups)
- Unique 8-char error IDs for support tracking
- Privacy-safe Discord error messages (7 error types)
- Enhanced bot.py with structured logging throughout
- 15/15 logging tests passing

MED-005: Comprehensive test suite
- Total: 156/157 tests passing (99.4%)
- test_session_manager.py: 27 tests
- test_claude_runner.py: 11 tests
- test_config.py: 25 tests
- test_response_formatter.py: 26 tests
- test_bot.py: 27 tests
- test_commands.py: 18 tests
- test_concurrency.py: 7 tests
- test_logging.py: 15 tests

Total: 13/18 tasks complete (72.2%)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 19:12:04 +00:00

441 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Logging and Error Reporting
This document describes the comprehensive logging and error reporting system implemented for Claude Discord Coordinator.
## Overview
The bot uses structured JSON logging with rotation for production-grade observability and debugging. All errors are tracked with unique IDs and user-friendly messages are displayed in Discord without exposing internal implementation details.
## Components
### 1. Structured Logging (`logging_config.py`)
#### JSONFormatter
Outputs log records as JSON objects with consistent fields:
```json
{
"timestamp": "2026-02-13T19:06:35.545176+00:00",
"level": "INFO",
"module": "bot",
"function": "_handle_claude_request",
"line": 123,
"message": "Processing message in channel 12345 for project my-project",
"channel_id": "12345",
"session_id": "abc-def-ghi",
"cost_usd": 0.05
}
```
**Fields:**
- `timestamp`: ISO 8601 UTC timestamp
- `level`: Log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
- `module`: Python module name
- `function`: Function name
- `line`: Source line number
- `message`: Log message
- `exception`: Full traceback (if exception occurred)
- Custom fields via `extra={}` parameter
#### ErrorTracker
Generates unique error IDs for tracking and support:
```python
from claude_coordinator.logging_config import ErrorTracker
error_id = ErrorTracker.generate_error_id() # Returns 8-char ID like "abc12345"
```
Use `log_error_with_id()` to log with automatic error ID:
```python
error_id = ErrorTracker.log_error_with_id(
logger,
logging.ERROR,
"Failed to process request",
exc_info=exception,
channel_id="12345",
session_id="abc-def"
)
```
#### setup_logging()
Configures application-wide logging with rotation:
```python
from claude_coordinator.logging_config import setup_logging, get_log_directory
log_dir = get_log_directory()
setup_logging(
log_level="INFO",
log_file=str(log_dir / "bot.log"),
json_logs=True,
max_bytes=10 * 1024 * 1024, # 10MB
backup_count=5
)
```
**Parameters:**
- `log_level`: DEBUG, INFO, WARNING, ERROR, CRITICAL
- `log_file`: Path to log file (creates directory if needed)
- `json_logs`: Use JSON formatter (True) or plain text (False)
- `max_bytes`: Max file size before rotation (default: 10MB)
- `backup_count`: Number of rotated files to keep (default: 5)
### 2. Log File Locations
**Determined by `get_log_directory()`:**
1. **Production**: `/var/log/claude-coordinator/` (requires write permissions)
2. **Fallback**: `~/.claude-coordinator/logs/`
Log files:
- `bot.log` - Current log file
- `bot.log.1` - First rotated backup
- `bot.log.2` - Second rotated backup
- etc. (up to `backup_count`)
### 3. Discord Error Messages
User-friendly error messages without internal details:
```python
from claude_coordinator.logging_config import format_error_for_discord
discord_msg = format_error_for_discord(
error_type="timeout",
error_id="abc12345"
)
```
**Supported Error Types:**
| Error Type | Message | Use Case |
|------------|---------|----------|
| `timeout` | ⏱️ Request Timeout | Claude took too long |
| `claude_error` | ❌ Claude Error | Claude command failed |
| `parse_error` | ⚠️ Parse Error | Malformed JSON response |
| `config_error` | ⚙️ Configuration Error | Channel not configured |
| `permission_error` | 🔒 Permission Error | Insufficient permissions |
| `session_error` | 💾 Session Error | Session save/load failure |
| `lock_error` | 🔄 Processing Error | Lock acquisition failed |
**Example Output:**
```
⏱️ **Request Timeout**
Claude took longer than expected. Please try again.
If this continues, try a simpler request.
_Error ID: `abc12345`_
_Please reference this ID if requesting support._
```
### 4. Combined Error Logging and Formatting
Convenience function for common pattern:
```python
from claude_coordinator.logging_config import log_and_format_error
error_id, discord_msg = log_and_format_error(
logger,
error_type="timeout",
message="Request timeout in channel",
channel_id="12345",
duration=300
)
# Logs to structured logs with error_id and context
# Returns tuple: (error_id, discord_friendly_message)
await channel.send(discord_msg)
```
## Key Events Logged
### Bot Lifecycle
- **Startup**: Version, config path, loaded projects
- **Ready**: Connected guilds, channels, bot user info
- **Shutdown**: Graceful vs crash, cleanup status
### Message Processing
- **Message Received**: Channel, user, message length, project
- **Project Matched**: Channel → project name mapping
- **Session State**: New vs resumed, session_id
- **Claude Command**: Full command for debugging (via extra fields)
- **Claude Subprocess**: Start, completion, duration, cost
- **Response Sent**: Chunk count, total chars, delivery status
### Errors (with Context)
All errors include:
- Error ID for support reference
- Channel ID and session ID
- Relevant context (duration, exit code, etc.)
- Full exception traceback (in logs only, not Discord)
### Performance Metrics
- **Request Duration**: Total time from message to response
- **Claude CLI Duration**: Subprocess execution time
- **Session Operations**: Database query times
- **Message Queue Depth**: Concurrent requests per channel
## Usage Examples
### Basic Logging
```python
import logging
logger = logging.getLogger(__name__)
# Simple log
logger.info("Processing message")
# Log with context
logger.info(
"Processing message in channel",
extra={
'channel_id': '12345',
'user_id': '67890',
'message_length': 150
}
)
```
### Error Handling with Tracking
```python
try:
result = await some_operation()
except TimeoutError as e:
error_id, discord_msg = log_and_format_error(
logger,
error_type="timeout",
message="Operation timeout",
error=e,
channel_id=channel_id,
duration=300
)
await channel.send(discord_msg)
```
### Performance Tracking
```python
import time
start_time = time.time()
result = await long_operation()
duration_ms = int((time.time() - start_time) * 1000)
logger.info(
"Operation completed",
extra={
'operation': 'claude_cli',
'duration_ms': duration_ms,
'result_size': len(result)
}
)
```
## Environment Variables
- `LOG_LEVEL`: Set log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
- Default: INFO
```bash
# Development with debug logging
export LOG_LEVEL=DEBUG
python -m claude_coordinator.bot
# Production with info logging
export LOG_LEVEL=INFO
python -m claude_coordinator.bot
```
## Log Analysis
### Querying JSON Logs
Use `jq` to query structured logs:
```bash
# Get all errors
cat bot.log | jq 'select(.level == "ERROR")'
# Get errors for specific channel
cat bot.log | jq 'select(.channel_id == "12345" and .level == "ERROR")'
# Calculate average Claude CLI duration
cat bot.log | jq 'select(.claude_duration_ms) | .claude_duration_ms' | awk '{sum+=$1; count++} END {print sum/count}'
# Find all timeouts
cat bot.log | jq 'select(.message | contains("timeout"))'
# Get error IDs from today
cat bot.log | jq 'select(.error_id) | {timestamp, error_id, message}'
```
### Finding Errors by ID
```bash
# Find specific error ID in logs
cat bot.log* | jq 'select(.error_id == "abc12345")'
# Get full context for error
cat bot.log* | jq 'select(.error_id == "abc12345") | .'
```
### Performance Analysis
```bash
# Get slowest Claude CLI calls
cat bot.log | jq 'select(.claude_duration_ms) | {duration_ms: .claude_duration_ms, channel_id, session_id}' | jq -s 'sort_by(.duration_ms) | reverse | .[0:10]'
# Cost tracking
cat bot.log | jq 'select(.cost_usd) | .cost_usd' | awk '{sum+=$1} END {print "Total: $" sum}'
```
## Testing
Comprehensive test suite in `tests/test_logging.py`:
```bash
# Run logging tests
pytest tests/test_logging.py -v
# Manual testing
python test_logging_manual.py
```
**Test Coverage:**
- JSON formatter validation
- Required fields presence
- Exception handling
- Extra fields inclusion
- Error ID generation and uniqueness
- Discord message formatting
- Privacy protection (no internal details exposed)
- Log rotation configuration
## Privacy and Security
### What Gets Logged
**✓ Safe to log:**
- Channel IDs (public Discord identifiers)
- Session IDs (UUIDs)
- Timestamps, durations, costs
- Error messages and stack traces
- Operation types and results
**✗ Never log:**
- User message content (only length)
- Discord tokens
- Authentication credentials
- Personal identifiable information (PII)
### What Gets Shown in Discord
**✓ User-friendly messages:**
- Error type and description
- Error ID for support reference
- Actionable suggestions
**✗ Never exposed:**
- Stack traces
- File paths
- Internal error details
- Session IDs
- Database information
## Monitoring Recommendations
### Real-time Monitoring
```bash
# Follow logs in real-time
tail -f ~/.claude-coordinator/logs/bot.log | jq .
# Follow errors only
tail -f ~/.claude-coordinator/logs/bot.log | jq 'select(.level == "ERROR")'
```
### Alerting
Monitor for:
- High error rates (>5 errors/minute)
- Repeated timeouts (same channel)
- Session save failures
- Database errors
- Cost spikes (>$1/hour)
### Log Retention
- **Active log**: `bot.log` (up to 10MB)
- **Rotated logs**: `bot.log.1` through `bot.log.5`
- **Total retention**: ~50MB (5 rotations × 10MB)
- **Time retention**: Depends on activity (typically 1-7 days)
For longer retention, implement external log aggregation (e.g., rsyslog, Loki, CloudWatch).
## Troubleshooting
### Logs Not Appearing
1. Check log file path:
```bash
python -c "from claude_coordinator.logging_config import get_log_directory; print(get_log_directory())"
```
2. Verify write permissions:
```bash
mkdir -p ~/.claude-coordinator/logs
touch ~/.claude-coordinator/logs/test.log
```
3. Check log level:
```bash
export LOG_LEVEL=DEBUG
```
### Invalid JSON in Logs
If JSON parsing fails, check for:
- Multi-line log messages (should be escaped)
- Special characters in message strings
- Corrupted log rotation
### Log Rotation Not Working
Verify configuration:
```python
setup_logging(
max_bytes=10 * 1024 * 1024, # Must be positive integer
backup_count=5 # Must be > 0 for rotation
)
```
## Future Enhancements
Potential improvements:
- [ ] Integrate with external logging service (Loki, CloudWatch)
- [ ] Add log aggregation dashboard
- [ ] Implement log compression for rotated files
- [ ] Add performance profiling hooks
- [ ] Create alerting rules for critical errors
- [ ] Add cost tracking and budget alerts
---
**Last Updated**: 2026-02-13
**Version**: MED-002 Implementation