MED-001: Enhanced typing indicator - Persistent typing loop (_maintain_typing method) - Loops every 8s to maintain indicator for long operations (30s-5min) - 8 comprehensive tests covering all lifecycle scenarios - 27/27 bot tests passing MED-002: Structured logging and error reporting - logging_config.py (371 lines) - JSONFormatter, ErrorTracker, format_error_for_discord - RotatingFileHandler (10MB max, 5 backups) - Unique 8-char error IDs for support tracking - Privacy-safe Discord error messages (7 error types) - Enhanced bot.py with structured logging throughout - 15/15 logging tests passing MED-005: Comprehensive test suite - Total: 156/157 tests passing (99.4%) - test_session_manager.py: 27 tests - test_claude_runner.py: 11 tests - test_config.py: 25 tests - test_response_formatter.py: 26 tests - test_bot.py: 27 tests - test_commands.py: 18 tests - test_concurrency.py: 7 tests - test_logging.py: 15 tests Total: 13/18 tasks complete (72.2%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
441 lines
10 KiB
Markdown
441 lines
10 KiB
Markdown
# Logging and Error Reporting
|
||
|
||
This document describes the comprehensive logging and error reporting system implemented for Claude Discord Coordinator.
|
||
|
||
## Overview
|
||
|
||
The bot uses structured JSON logging with rotation for production-grade observability and debugging. All errors are tracked with unique IDs and user-friendly messages are displayed in Discord without exposing internal implementation details.
|
||
|
||
## Components
|
||
|
||
### 1. Structured Logging (`logging_config.py`)
|
||
|
||
#### JSONFormatter
|
||
|
||
Outputs log records as JSON objects with consistent fields:
|
||
|
||
```json
|
||
{
|
||
"timestamp": "2026-02-13T19:06:35.545176+00:00",
|
||
"level": "INFO",
|
||
"module": "bot",
|
||
"function": "_handle_claude_request",
|
||
"line": 123,
|
||
"message": "Processing message in channel 12345 for project my-project",
|
||
"channel_id": "12345",
|
||
"session_id": "abc-def-ghi",
|
||
"cost_usd": 0.05
|
||
}
|
||
```
|
||
|
||
**Fields:**
|
||
- `timestamp`: ISO 8601 UTC timestamp
|
||
- `level`: Log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
|
||
- `module`: Python module name
|
||
- `function`: Function name
|
||
- `line`: Source line number
|
||
- `message`: Log message
|
||
- `exception`: Full traceback (if exception occurred)
|
||
- Custom fields via `extra={}` parameter
|
||
|
||
#### ErrorTracker
|
||
|
||
Generates unique error IDs for tracking and support:
|
||
|
||
```python
|
||
from claude_coordinator.logging_config import ErrorTracker
|
||
|
||
error_id = ErrorTracker.generate_error_id() # Returns 8-char ID like "abc12345"
|
||
```
|
||
|
||
Use `log_error_with_id()` to log with automatic error ID:
|
||
|
||
```python
|
||
error_id = ErrorTracker.log_error_with_id(
|
||
logger,
|
||
logging.ERROR,
|
||
"Failed to process request",
|
||
exc_info=exception,
|
||
channel_id="12345",
|
||
session_id="abc-def"
|
||
)
|
||
```
|
||
|
||
#### setup_logging()
|
||
|
||
Configures application-wide logging with rotation:
|
||
|
||
```python
|
||
from claude_coordinator.logging_config import setup_logging, get_log_directory
|
||
|
||
log_dir = get_log_directory()
|
||
setup_logging(
|
||
log_level="INFO",
|
||
log_file=str(log_dir / "bot.log"),
|
||
json_logs=True,
|
||
max_bytes=10 * 1024 * 1024, # 10MB
|
||
backup_count=5
|
||
)
|
||
```
|
||
|
||
**Parameters:**
|
||
- `log_level`: DEBUG, INFO, WARNING, ERROR, CRITICAL
|
||
- `log_file`: Path to log file (creates directory if needed)
|
||
- `json_logs`: Use JSON formatter (True) or plain text (False)
|
||
- `max_bytes`: Max file size before rotation (default: 10MB)
|
||
- `backup_count`: Number of rotated files to keep (default: 5)
|
||
|
||
### 2. Log File Locations
|
||
|
||
**Determined by `get_log_directory()`:**
|
||
|
||
1. **Production**: `/var/log/claude-coordinator/` (requires write permissions)
|
||
2. **Fallback**: `~/.claude-coordinator/logs/`
|
||
|
||
Log files:
|
||
- `bot.log` - Current log file
|
||
- `bot.log.1` - First rotated backup
|
||
- `bot.log.2` - Second rotated backup
|
||
- etc. (up to `backup_count`)
|
||
|
||
### 3. Discord Error Messages
|
||
|
||
User-friendly error messages without internal details:
|
||
|
||
```python
|
||
from claude_coordinator.logging_config import format_error_for_discord
|
||
|
||
discord_msg = format_error_for_discord(
|
||
error_type="timeout",
|
||
error_id="abc12345"
|
||
)
|
||
```
|
||
|
||
**Supported Error Types:**
|
||
|
||
| Error Type | Message | Use Case |
|
||
|------------|---------|----------|
|
||
| `timeout` | ⏱️ Request Timeout | Claude took too long |
|
||
| `claude_error` | ❌ Claude Error | Claude command failed |
|
||
| `parse_error` | ⚠️ Parse Error | Malformed JSON response |
|
||
| `config_error` | ⚙️ Configuration Error | Channel not configured |
|
||
| `permission_error` | 🔒 Permission Error | Insufficient permissions |
|
||
| `session_error` | 💾 Session Error | Session save/load failure |
|
||
| `lock_error` | 🔄 Processing Error | Lock acquisition failed |
|
||
|
||
**Example Output:**
|
||
|
||
```
|
||
⏱️ **Request Timeout**
|
||
Claude took longer than expected. Please try again.
|
||
If this continues, try a simpler request.
|
||
|
||
_Error ID: `abc12345`_
|
||
_Please reference this ID if requesting support._
|
||
```
|
||
|
||
### 4. Combined Error Logging and Formatting
|
||
|
||
Convenience function for common pattern:
|
||
|
||
```python
|
||
from claude_coordinator.logging_config import log_and_format_error
|
||
|
||
error_id, discord_msg = log_and_format_error(
|
||
logger,
|
||
error_type="timeout",
|
||
message="Request timeout in channel",
|
||
channel_id="12345",
|
||
duration=300
|
||
)
|
||
|
||
# Logs to structured logs with error_id and context
|
||
# Returns tuple: (error_id, discord_friendly_message)
|
||
|
||
await channel.send(discord_msg)
|
||
```
|
||
|
||
## Key Events Logged
|
||
|
||
### Bot Lifecycle
|
||
|
||
- **Startup**: Version, config path, loaded projects
|
||
- **Ready**: Connected guilds, channels, bot user info
|
||
- **Shutdown**: Graceful vs crash, cleanup status
|
||
|
||
### Message Processing
|
||
|
||
- **Message Received**: Channel, user, message length, project
|
||
- **Project Matched**: Channel → project name mapping
|
||
- **Session State**: New vs resumed, session_id
|
||
- **Claude Command**: Full command for debugging (via extra fields)
|
||
- **Claude Subprocess**: Start, completion, duration, cost
|
||
- **Response Sent**: Chunk count, total chars, delivery status
|
||
|
||
### Errors (with Context)
|
||
|
||
All errors include:
|
||
- Error ID for support reference
|
||
- Channel ID and session ID
|
||
- Relevant context (duration, exit code, etc.)
|
||
- Full exception traceback (in logs only, not Discord)
|
||
|
||
### Performance Metrics
|
||
|
||
- **Request Duration**: Total time from message to response
|
||
- **Claude CLI Duration**: Subprocess execution time
|
||
- **Session Operations**: Database query times
|
||
- **Message Queue Depth**: Concurrent requests per channel
|
||
|
||
## Usage Examples
|
||
|
||
### Basic Logging
|
||
|
||
```python
|
||
import logging
|
||
|
||
logger = logging.getLogger(__name__)
|
||
|
||
# Simple log
|
||
logger.info("Processing message")
|
||
|
||
# Log with context
|
||
logger.info(
|
||
"Processing message in channel",
|
||
extra={
|
||
'channel_id': '12345',
|
||
'user_id': '67890',
|
||
'message_length': 150
|
||
}
|
||
)
|
||
```
|
||
|
||
### Error Handling with Tracking
|
||
|
||
```python
|
||
try:
|
||
result = await some_operation()
|
||
except TimeoutError as e:
|
||
error_id, discord_msg = log_and_format_error(
|
||
logger,
|
||
error_type="timeout",
|
||
message="Operation timeout",
|
||
error=e,
|
||
channel_id=channel_id,
|
||
duration=300
|
||
)
|
||
await channel.send(discord_msg)
|
||
```
|
||
|
||
### Performance Tracking
|
||
|
||
```python
|
||
import time
|
||
|
||
start_time = time.time()
|
||
result = await long_operation()
|
||
duration_ms = int((time.time() - start_time) * 1000)
|
||
|
||
logger.info(
|
||
"Operation completed",
|
||
extra={
|
||
'operation': 'claude_cli',
|
||
'duration_ms': duration_ms,
|
||
'result_size': len(result)
|
||
}
|
||
)
|
||
```
|
||
|
||
## Environment Variables
|
||
|
||
- `LOG_LEVEL`: Set log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
|
||
- Default: INFO
|
||
|
||
```bash
|
||
# Development with debug logging
|
||
export LOG_LEVEL=DEBUG
|
||
python -m claude_coordinator.bot
|
||
|
||
# Production with info logging
|
||
export LOG_LEVEL=INFO
|
||
python -m claude_coordinator.bot
|
||
```
|
||
|
||
## Log Analysis
|
||
|
||
### Querying JSON Logs
|
||
|
||
Use `jq` to query structured logs:
|
||
|
||
```bash
|
||
# Get all errors
|
||
cat bot.log | jq 'select(.level == "ERROR")'
|
||
|
||
# Get errors for specific channel
|
||
cat bot.log | jq 'select(.channel_id == "12345" and .level == "ERROR")'
|
||
|
||
# Calculate average Claude CLI duration
|
||
cat bot.log | jq 'select(.claude_duration_ms) | .claude_duration_ms' | awk '{sum+=$1; count++} END {print sum/count}'
|
||
|
||
# Find all timeouts
|
||
cat bot.log | jq 'select(.message | contains("timeout"))'
|
||
|
||
# Get error IDs from today
|
||
cat bot.log | jq 'select(.error_id) | {timestamp, error_id, message}'
|
||
```
|
||
|
||
### Finding Errors by ID
|
||
|
||
```bash
|
||
# Find specific error ID in logs
|
||
cat bot.log* | jq 'select(.error_id == "abc12345")'
|
||
|
||
# Get full context for error
|
||
cat bot.log* | jq 'select(.error_id == "abc12345") | .'
|
||
```
|
||
|
||
### Performance Analysis
|
||
|
||
```bash
|
||
# Get slowest Claude CLI calls
|
||
cat bot.log | jq 'select(.claude_duration_ms) | {duration_ms: .claude_duration_ms, channel_id, session_id}' | jq -s 'sort_by(.duration_ms) | reverse | .[0:10]'
|
||
|
||
# Cost tracking
|
||
cat bot.log | jq 'select(.cost_usd) | .cost_usd' | awk '{sum+=$1} END {print "Total: $" sum}'
|
||
```
|
||
|
||
## Testing
|
||
|
||
Comprehensive test suite in `tests/test_logging.py`:
|
||
|
||
```bash
|
||
# Run logging tests
|
||
pytest tests/test_logging.py -v
|
||
|
||
# Manual testing
|
||
python test_logging_manual.py
|
||
```
|
||
|
||
**Test Coverage:**
|
||
- JSON formatter validation
|
||
- Required fields presence
|
||
- Exception handling
|
||
- Extra fields inclusion
|
||
- Error ID generation and uniqueness
|
||
- Discord message formatting
|
||
- Privacy protection (no internal details exposed)
|
||
- Log rotation configuration
|
||
|
||
## Privacy and Security
|
||
|
||
### What Gets Logged
|
||
|
||
**✓ Safe to log:**
|
||
- Channel IDs (public Discord identifiers)
|
||
- Session IDs (UUIDs)
|
||
- Timestamps, durations, costs
|
||
- Error messages and stack traces
|
||
- Operation types and results
|
||
|
||
**✗ Never log:**
|
||
- User message content (only length)
|
||
- Discord tokens
|
||
- Authentication credentials
|
||
- Personal identifiable information (PII)
|
||
|
||
### What Gets Shown in Discord
|
||
|
||
**✓ User-friendly messages:**
|
||
- Error type and description
|
||
- Error ID for support reference
|
||
- Actionable suggestions
|
||
|
||
**✗ Never exposed:**
|
||
- Stack traces
|
||
- File paths
|
||
- Internal error details
|
||
- Session IDs
|
||
- Database information
|
||
|
||
## Monitoring Recommendations
|
||
|
||
### Real-time Monitoring
|
||
|
||
```bash
|
||
# Follow logs in real-time
|
||
tail -f ~/.claude-coordinator/logs/bot.log | jq .
|
||
|
||
# Follow errors only
|
||
tail -f ~/.claude-coordinator/logs/bot.log | jq 'select(.level == "ERROR")'
|
||
```
|
||
|
||
### Alerting
|
||
|
||
Monitor for:
|
||
- High error rates (>5 errors/minute)
|
||
- Repeated timeouts (same channel)
|
||
- Session save failures
|
||
- Database errors
|
||
- Cost spikes (>$1/hour)
|
||
|
||
### Log Retention
|
||
|
||
- **Active log**: `bot.log` (up to 10MB)
|
||
- **Rotated logs**: `bot.log.1` through `bot.log.5`
|
||
- **Total retention**: ~50MB (5 rotations × 10MB)
|
||
- **Time retention**: Depends on activity (typically 1-7 days)
|
||
|
||
For longer retention, implement external log aggregation (e.g., rsyslog, Loki, CloudWatch).
|
||
|
||
## Troubleshooting
|
||
|
||
### Logs Not Appearing
|
||
|
||
1. Check log file path:
|
||
```bash
|
||
python -c "from claude_coordinator.logging_config import get_log_directory; print(get_log_directory())"
|
||
```
|
||
|
||
2. Verify write permissions:
|
||
```bash
|
||
mkdir -p ~/.claude-coordinator/logs
|
||
touch ~/.claude-coordinator/logs/test.log
|
||
```
|
||
|
||
3. Check log level:
|
||
```bash
|
||
export LOG_LEVEL=DEBUG
|
||
```
|
||
|
||
### Invalid JSON in Logs
|
||
|
||
If JSON parsing fails, check for:
|
||
- Multi-line log messages (should be escaped)
|
||
- Special characters in message strings
|
||
- Corrupted log rotation
|
||
|
||
### Log Rotation Not Working
|
||
|
||
Verify configuration:
|
||
```python
|
||
setup_logging(
|
||
max_bytes=10 * 1024 * 1024, # Must be positive integer
|
||
backup_count=5 # Must be > 0 for rotation
|
||
)
|
||
```
|
||
|
||
## Future Enhancements
|
||
|
||
Potential improvements:
|
||
- [ ] Integrate with external logging service (Loki, CloudWatch)
|
||
- [ ] Add log aggregation dashboard
|
||
- [ ] Implement log compression for rotated files
|
||
- [ ] Add performance profiling hooks
|
||
- [ ] Create alerting rules for critical errors
|
||
- [ ] Add cost tracking and budget alerts
|
||
|
||
---
|
||
|
||
**Last Updated**: 2026-02-13
|
||
**Version**: MED-002 Implementation
|