ai-assistant-discord-bot/docs/LOGGING.md
Claude Discord Bot e6983b56b9 Week 3 progress: Logging and testing complete (MED-001, MED-002, MED-005)
MED-001: Enhanced typing indicator
- Persistent typing loop (_maintain_typing method)
- Loops every 8s to maintain indicator for long operations (30s-5min)
- 8 comprehensive tests covering all lifecycle scenarios
- 27/27 bot tests passing

MED-002: Structured logging and error reporting
- logging_config.py (371 lines) - JSONFormatter, ErrorTracker, format_error_for_discord
- RotatingFileHandler (10MB max, 5 backups)
- Unique 8-char error IDs for support tracking
- Privacy-safe Discord error messages (7 error types)
- Enhanced bot.py with structured logging throughout
- 15/15 logging tests passing

MED-005: Comprehensive test suite
- Total: 156/157 tests passing (99.4%)
- test_session_manager.py: 27 tests
- test_claude_runner.py: 11 tests
- test_config.py: 25 tests
- test_response_formatter.py: 26 tests
- test_bot.py: 27 tests
- test_commands.py: 18 tests
- test_concurrency.py: 7 tests
- test_logging.py: 15 tests

Total: 13/18 tasks complete (72.2%)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 19:12:04 +00:00

10 KiB
Raw Permalink Blame History

Logging and Error Reporting

This document describes the comprehensive logging and error reporting system implemented for Claude Discord Coordinator.

Overview

The bot uses structured JSON logging with rotation for production-grade observability and debugging. All errors are tracked with unique IDs and user-friendly messages are displayed in Discord without exposing internal implementation details.

Components

1. Structured Logging (logging_config.py)

JSONFormatter

Outputs log records as JSON objects with consistent fields:

{
    "timestamp": "2026-02-13T19:06:35.545176+00:00",
    "level": "INFO",
    "module": "bot",
    "function": "_handle_claude_request",
    "line": 123,
    "message": "Processing message in channel 12345 for project my-project",
    "channel_id": "12345",
    "session_id": "abc-def-ghi",
    "cost_usd": 0.05
}

Fields:

  • timestamp: ISO 8601 UTC timestamp
  • level: Log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
  • module: Python module name
  • function: Function name
  • line: Source line number
  • message: Log message
  • exception: Full traceback (if exception occurred)
  • Custom fields via extra={} parameter

ErrorTracker

Generates unique error IDs for tracking and support:

from claude_coordinator.logging_config import ErrorTracker

error_id = ErrorTracker.generate_error_id()  # Returns 8-char ID like "abc12345"

Use log_error_with_id() to log with automatic error ID:

error_id = ErrorTracker.log_error_with_id(
    logger,
    logging.ERROR,
    "Failed to process request",
    exc_info=exception,
    channel_id="12345",
    session_id="abc-def"
)

setup_logging()

Configures application-wide logging with rotation:

from claude_coordinator.logging_config import setup_logging, get_log_directory

log_dir = get_log_directory()
setup_logging(
    log_level="INFO",
    log_file=str(log_dir / "bot.log"),
    json_logs=True,
    max_bytes=10 * 1024 * 1024,  # 10MB
    backup_count=5
)

Parameters:

  • log_level: DEBUG, INFO, WARNING, ERROR, CRITICAL
  • log_file: Path to log file (creates directory if needed)
  • json_logs: Use JSON formatter (True) or plain text (False)
  • max_bytes: Max file size before rotation (default: 10MB)
  • backup_count: Number of rotated files to keep (default: 5)

2. Log File Locations

Determined by get_log_directory():

  1. Production: /var/log/claude-coordinator/ (requires write permissions)
  2. Fallback: ~/.claude-coordinator/logs/

Log files:

  • bot.log - Current log file
  • bot.log.1 - First rotated backup
  • bot.log.2 - Second rotated backup
  • etc. (up to backup_count)

3. Discord Error Messages

User-friendly error messages without internal details:

from claude_coordinator.logging_config import format_error_for_discord

discord_msg = format_error_for_discord(
    error_type="timeout",
    error_id="abc12345"
)

Supported Error Types:

Error Type Message Use Case
timeout ⏱️ Request Timeout Claude took too long
claude_error Claude Error Claude command failed
parse_error ⚠️ Parse Error Malformed JSON response
config_error ⚙️ Configuration Error Channel not configured
permission_error 🔒 Permission Error Insufficient permissions
session_error 💾 Session Error Session save/load failure
lock_error 🔄 Processing Error Lock acquisition failed

Example Output:

⏱️ **Request Timeout**
Claude took longer than expected. Please try again.
If this continues, try a simpler request.

_Error ID: `abc12345`_
_Please reference this ID if requesting support._

4. Combined Error Logging and Formatting

Convenience function for common pattern:

from claude_coordinator.logging_config import log_and_format_error

error_id, discord_msg = log_and_format_error(
    logger,
    error_type="timeout",
    message="Request timeout in channel",
    channel_id="12345",
    duration=300
)

# Logs to structured logs with error_id and context
# Returns tuple: (error_id, discord_friendly_message)

await channel.send(discord_msg)

Key Events Logged

Bot Lifecycle

  • Startup: Version, config path, loaded projects
  • Ready: Connected guilds, channels, bot user info
  • Shutdown: Graceful vs crash, cleanup status

Message Processing

  • Message Received: Channel, user, message length, project
  • Project Matched: Channel → project name mapping
  • Session State: New vs resumed, session_id
  • Claude Command: Full command for debugging (via extra fields)
  • Claude Subprocess: Start, completion, duration, cost
  • Response Sent: Chunk count, total chars, delivery status

Errors (with Context)

All errors include:

  • Error ID for support reference
  • Channel ID and session ID
  • Relevant context (duration, exit code, etc.)
  • Full exception traceback (in logs only, not Discord)

Performance Metrics

  • Request Duration: Total time from message to response
  • Claude CLI Duration: Subprocess execution time
  • Session Operations: Database query times
  • Message Queue Depth: Concurrent requests per channel

Usage Examples

Basic Logging

import logging

logger = logging.getLogger(__name__)

# Simple log
logger.info("Processing message")

# Log with context
logger.info(
    "Processing message in channel",
    extra={
        'channel_id': '12345',
        'user_id': '67890',
        'message_length': 150
    }
)

Error Handling with Tracking

try:
    result = await some_operation()
except TimeoutError as e:
    error_id, discord_msg = log_and_format_error(
        logger,
        error_type="timeout",
        message="Operation timeout",
        error=e,
        channel_id=channel_id,
        duration=300
    )
    await channel.send(discord_msg)

Performance Tracking

import time

start_time = time.time()
result = await long_operation()
duration_ms = int((time.time() - start_time) * 1000)

logger.info(
    "Operation completed",
    extra={
        'operation': 'claude_cli',
        'duration_ms': duration_ms,
        'result_size': len(result)
    }
)

Environment Variables

  • LOG_LEVEL: Set log level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
    • Default: INFO
# Development with debug logging
export LOG_LEVEL=DEBUG
python -m claude_coordinator.bot

# Production with info logging
export LOG_LEVEL=INFO
python -m claude_coordinator.bot

Log Analysis

Querying JSON Logs

Use jq to query structured logs:

# Get all errors
cat bot.log | jq 'select(.level == "ERROR")'

# Get errors for specific channel
cat bot.log | jq 'select(.channel_id == "12345" and .level == "ERROR")'

# Calculate average Claude CLI duration
cat bot.log | jq 'select(.claude_duration_ms) | .claude_duration_ms' | awk '{sum+=$1; count++} END {print sum/count}'

# Find all timeouts
cat bot.log | jq 'select(.message | contains("timeout"))'

# Get error IDs from today
cat bot.log | jq 'select(.error_id) | {timestamp, error_id, message}'

Finding Errors by ID

# Find specific error ID in logs
cat bot.log* | jq 'select(.error_id == "abc12345")'

# Get full context for error
cat bot.log* | jq 'select(.error_id == "abc12345") | .'

Performance Analysis

# Get slowest Claude CLI calls
cat bot.log | jq 'select(.claude_duration_ms) | {duration_ms: .claude_duration_ms, channel_id, session_id}' | jq -s 'sort_by(.duration_ms) | reverse | .[0:10]'

# Cost tracking
cat bot.log | jq 'select(.cost_usd) | .cost_usd' | awk '{sum+=$1} END {print "Total: $" sum}'

Testing

Comprehensive test suite in tests/test_logging.py:

# Run logging tests
pytest tests/test_logging.py -v

# Manual testing
python test_logging_manual.py

Test Coverage:

  • JSON formatter validation
  • Required fields presence
  • Exception handling
  • Extra fields inclusion
  • Error ID generation and uniqueness
  • Discord message formatting
  • Privacy protection (no internal details exposed)
  • Log rotation configuration

Privacy and Security

What Gets Logged

✓ Safe to log:

  • Channel IDs (public Discord identifiers)
  • Session IDs (UUIDs)
  • Timestamps, durations, costs
  • Error messages and stack traces
  • Operation types and results

✗ Never log:

  • User message content (only length)
  • Discord tokens
  • Authentication credentials
  • Personal identifiable information (PII)

What Gets Shown in Discord

✓ User-friendly messages:

  • Error type and description
  • Error ID for support reference
  • Actionable suggestions

✗ Never exposed:

  • Stack traces
  • File paths
  • Internal error details
  • Session IDs
  • Database information

Monitoring Recommendations

Real-time Monitoring

# Follow logs in real-time
tail -f ~/.claude-coordinator/logs/bot.log | jq .

# Follow errors only
tail -f ~/.claude-coordinator/logs/bot.log | jq 'select(.level == "ERROR")'

Alerting

Monitor for:

  • High error rates (>5 errors/minute)
  • Repeated timeouts (same channel)
  • Session save failures
  • Database errors
  • Cost spikes (>$1/hour)

Log Retention

  • Active log: bot.log (up to 10MB)
  • Rotated logs: bot.log.1 through bot.log.5
  • Total retention: ~50MB (5 rotations × 10MB)
  • Time retention: Depends on activity (typically 1-7 days)

For longer retention, implement external log aggregation (e.g., rsyslog, Loki, CloudWatch).

Troubleshooting

Logs Not Appearing

  1. Check log file path:

    python -c "from claude_coordinator.logging_config import get_log_directory; print(get_log_directory())"
    
  2. Verify write permissions:

    mkdir -p ~/.claude-coordinator/logs
    touch ~/.claude-coordinator/logs/test.log
    
  3. Check log level:

    export LOG_LEVEL=DEBUG
    

Invalid JSON in Logs

If JSON parsing fails, check for:

  • Multi-line log messages (should be escaped)
  • Special characters in message strings
  • Corrupted log rotation

Log Rotation Not Working

Verify configuration:

setup_logging(
    max_bytes=10 * 1024 * 1024,  # Must be positive integer
    backup_count=5  # Must be > 0 for rotation
)

Future Enhancements

Potential improvements:

  • Integrate with external logging service (Loki, CloudWatch)
  • Add log aggregation dashboard
  • Implement log compression for rotated files
  • Add performance profiling hooks
  • Create alerting rules for critical errors
  • Add cost tracking and budget alerts

Last Updated: 2026-02-13
Version: MED-002 Implementation