# Logging and Error Reporting This document describes the comprehensive logging and error reporting system implemented for Claude Discord Coordinator. ## Overview The bot uses structured JSON logging with rotation for production-grade observability and debugging. All errors are tracked with unique IDs and user-friendly messages are displayed in Discord without exposing internal implementation details. ## Components ### 1. Structured Logging (`logging_config.py`) #### JSONFormatter Outputs log records as JSON objects with consistent fields: ```json { "timestamp": "2026-02-13T19:06:35.545176+00:00", "level": "INFO", "module": "bot", "function": "_handle_claude_request", "line": 123, "message": "Processing message in channel 12345 for project my-project", "channel_id": "12345", "session_id": "abc-def-ghi", "cost_usd": 0.05 } ``` **Fields:** - `timestamp`: ISO 8601 UTC timestamp - `level`: Log level (DEBUG, INFO, WARNING, ERROR, CRITICAL) - `module`: Python module name - `function`: Function name - `line`: Source line number - `message`: Log message - `exception`: Full traceback (if exception occurred) - Custom fields via `extra={}` parameter #### ErrorTracker Generates unique error IDs for tracking and support: ```python from claude_coordinator.logging_config import ErrorTracker error_id = ErrorTracker.generate_error_id() # Returns 8-char ID like "abc12345" ``` Use `log_error_with_id()` to log with automatic error ID: ```python error_id = ErrorTracker.log_error_with_id( logger, logging.ERROR, "Failed to process request", exc_info=exception, channel_id="12345", session_id="abc-def" ) ``` #### setup_logging() Configures application-wide logging with rotation: ```python from claude_coordinator.logging_config import setup_logging, get_log_directory log_dir = get_log_directory() setup_logging( log_level="INFO", log_file=str(log_dir / "bot.log"), json_logs=True, max_bytes=10 * 1024 * 1024, # 10MB backup_count=5 ) ``` **Parameters:** - `log_level`: DEBUG, INFO, WARNING, ERROR, CRITICAL - `log_file`: Path to log file (creates directory if needed) - `json_logs`: Use JSON formatter (True) or plain text (False) - `max_bytes`: Max file size before rotation (default: 10MB) - `backup_count`: Number of rotated files to keep (default: 5) ### 2. Log File Locations **Determined by `get_log_directory()`:** 1. **Production**: `/var/log/claude-coordinator/` (requires write permissions) 2. **Fallback**: `~/.claude-coordinator/logs/` Log files: - `bot.log` - Current log file - `bot.log.1` - First rotated backup - `bot.log.2` - Second rotated backup - etc. (up to `backup_count`) ### 3. Discord Error Messages User-friendly error messages without internal details: ```python from claude_coordinator.logging_config import format_error_for_discord discord_msg = format_error_for_discord( error_type="timeout", error_id="abc12345" ) ``` **Supported Error Types:** | Error Type | Message | Use Case | |------------|---------|----------| | `timeout` | ⏱️ Request Timeout | Claude took too long | | `claude_error` | ❌ Claude Error | Claude command failed | | `parse_error` | ⚠️ Parse Error | Malformed JSON response | | `config_error` | ⚙️ Configuration Error | Channel not configured | | `permission_error` | 🔒 Permission Error | Insufficient permissions | | `session_error` | 💾 Session Error | Session save/load failure | | `lock_error` | 🔄 Processing Error | Lock acquisition failed | **Example Output:** ``` ⏱️ **Request Timeout** Claude took longer than expected. Please try again. If this continues, try a simpler request. _Error ID: `abc12345`_ _Please reference this ID if requesting support._ ``` ### 4. Combined Error Logging and Formatting Convenience function for common pattern: ```python from claude_coordinator.logging_config import log_and_format_error error_id, discord_msg = log_and_format_error( logger, error_type="timeout", message="Request timeout in channel", channel_id="12345", duration=300 ) # Logs to structured logs with error_id and context # Returns tuple: (error_id, discord_friendly_message) await channel.send(discord_msg) ``` ## Key Events Logged ### Bot Lifecycle - **Startup**: Version, config path, loaded projects - **Ready**: Connected guilds, channels, bot user info - **Shutdown**: Graceful vs crash, cleanup status ### Message Processing - **Message Received**: Channel, user, message length, project - **Project Matched**: Channel → project name mapping - **Session State**: New vs resumed, session_id - **Claude Command**: Full command for debugging (via extra fields) - **Claude Subprocess**: Start, completion, duration, cost - **Response Sent**: Chunk count, total chars, delivery status ### Errors (with Context) All errors include: - Error ID for support reference - Channel ID and session ID - Relevant context (duration, exit code, etc.) - Full exception traceback (in logs only, not Discord) ### Performance Metrics - **Request Duration**: Total time from message to response - **Claude CLI Duration**: Subprocess execution time - **Session Operations**: Database query times - **Message Queue Depth**: Concurrent requests per channel ## Usage Examples ### Basic Logging ```python import logging logger = logging.getLogger(__name__) # Simple log logger.info("Processing message") # Log with context logger.info( "Processing message in channel", extra={ 'channel_id': '12345', 'user_id': '67890', 'message_length': 150 } ) ``` ### Error Handling with Tracking ```python try: result = await some_operation() except TimeoutError as e: error_id, discord_msg = log_and_format_error( logger, error_type="timeout", message="Operation timeout", error=e, channel_id=channel_id, duration=300 ) await channel.send(discord_msg) ``` ### Performance Tracking ```python import time start_time = time.time() result = await long_operation() duration_ms = int((time.time() - start_time) * 1000) logger.info( "Operation completed", extra={ 'operation': 'claude_cli', 'duration_ms': duration_ms, 'result_size': len(result) } ) ``` ## Environment Variables - `LOG_LEVEL`: Set log level (DEBUG, INFO, WARNING, ERROR, CRITICAL) - Default: INFO ```bash # Development with debug logging export LOG_LEVEL=DEBUG python -m claude_coordinator.bot # Production with info logging export LOG_LEVEL=INFO python -m claude_coordinator.bot ``` ## Log Analysis ### Querying JSON Logs Use `jq` to query structured logs: ```bash # Get all errors cat bot.log | jq 'select(.level == "ERROR")' # Get errors for specific channel cat bot.log | jq 'select(.channel_id == "12345" and .level == "ERROR")' # Calculate average Claude CLI duration cat bot.log | jq 'select(.claude_duration_ms) | .claude_duration_ms' | awk '{sum+=$1; count++} END {print sum/count}' # Find all timeouts cat bot.log | jq 'select(.message | contains("timeout"))' # Get error IDs from today cat bot.log | jq 'select(.error_id) | {timestamp, error_id, message}' ``` ### Finding Errors by ID ```bash # Find specific error ID in logs cat bot.log* | jq 'select(.error_id == "abc12345")' # Get full context for error cat bot.log* | jq 'select(.error_id == "abc12345") | .' ``` ### Performance Analysis ```bash # Get slowest Claude CLI calls cat bot.log | jq 'select(.claude_duration_ms) | {duration_ms: .claude_duration_ms, channel_id, session_id}' | jq -s 'sort_by(.duration_ms) | reverse | .[0:10]' # Cost tracking cat bot.log | jq 'select(.cost_usd) | .cost_usd' | awk '{sum+=$1} END {print "Total: $" sum}' ``` ## Testing Comprehensive test suite in `tests/test_logging.py`: ```bash # Run logging tests pytest tests/test_logging.py -v # Manual testing python test_logging_manual.py ``` **Test Coverage:** - JSON formatter validation - Required fields presence - Exception handling - Extra fields inclusion - Error ID generation and uniqueness - Discord message formatting - Privacy protection (no internal details exposed) - Log rotation configuration ## Privacy and Security ### What Gets Logged **✓ Safe to log:** - Channel IDs (public Discord identifiers) - Session IDs (UUIDs) - Timestamps, durations, costs - Error messages and stack traces - Operation types and results **✗ Never log:** - User message content (only length) - Discord tokens - Authentication credentials - Personal identifiable information (PII) ### What Gets Shown in Discord **✓ User-friendly messages:** - Error type and description - Error ID for support reference - Actionable suggestions **✗ Never exposed:** - Stack traces - File paths - Internal error details - Session IDs - Database information ## Monitoring Recommendations ### Real-time Monitoring ```bash # Follow logs in real-time tail -f ~/.claude-coordinator/logs/bot.log | jq . # Follow errors only tail -f ~/.claude-coordinator/logs/bot.log | jq 'select(.level == "ERROR")' ``` ### Alerting Monitor for: - High error rates (>5 errors/minute) - Repeated timeouts (same channel) - Session save failures - Database errors - Cost spikes (>$1/hour) ### Log Retention - **Active log**: `bot.log` (up to 10MB) - **Rotated logs**: `bot.log.1` through `bot.log.5` - **Total retention**: ~50MB (5 rotations × 10MB) - **Time retention**: Depends on activity (typically 1-7 days) For longer retention, implement external log aggregation (e.g., rsyslog, Loki, CloudWatch). ## Troubleshooting ### Logs Not Appearing 1. Check log file path: ```bash python -c "from claude_coordinator.logging_config import get_log_directory; print(get_log_directory())" ``` 2. Verify write permissions: ```bash mkdir -p ~/.claude-coordinator/logs touch ~/.claude-coordinator/logs/test.log ``` 3. Check log level: ```bash export LOG_LEVEL=DEBUG ``` ### Invalid JSON in Logs If JSON parsing fails, check for: - Multi-line log messages (should be escaped) - Special characters in message strings - Corrupted log rotation ### Log Rotation Not Working Verify configuration: ```python setup_logging( max_bytes=10 * 1024 * 1024, # Must be positive integer backup_count=5 # Must be > 0 for rotation ) ``` ## Future Enhancements Potential improvements: - [ ] Integrate with external logging service (Loki, CloudWatch) - [ ] Add log aggregation dashboard - [ ] Implement log compression for rotated files - [ ] Add performance profiling hooks - [ ] Create alerting rules for critical errors - [ ] Add cost tracking and budget alerts --- **Last Updated**: 2026-02-13 **Version**: MED-002 Implementation