claude-home/monitoring/CONTEXT.md
Cal Corum 3b2e031f45 Update monitoring docs with Uptime Kuma monitors and Discord alerts
Document all 20 active monitors with targets and tags, Discord
notification configuration, and API access details for programmatic
management via uptime-kuma-api.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 23:02:49 -06:00

7.2 KiB

System Monitoring and Alerting - Technology Context

Overview

Comprehensive monitoring and alerting system for home lab infrastructure with focus on automated health checks, Discord notifications, and proactive system maintenance.

Architecture Patterns

Distributed Monitoring Strategy

Pattern: Service-specific monitoring with centralized alerting

  • Uptime Kuma: Centralized service uptime and health monitoring (status page)
  • Tdarr Monitoring: API-based transcoding health checks
  • Windows Desktop Monitoring: Reboot detection and system events
  • Network Monitoring: Connectivity and service availability
  • Container Monitoring: Docker/Podman health and resource usage

Alert Management

Pattern: Structured notifications with actionable information

# Discord webhook integration
curl -X POST "$DISCORD_WEBHOOK" \
  -H "Content-Type: application/json" \
  -d '{
    "content": "**System Alert**\n```\nService: Tdarr\nIssue: Staging timeout\nAction: Automatic cleanup performed\n```\n<@user_id>"
  }'

Core Monitoring Components

Tdarr System Monitoring

Purpose: Monitor transcoding pipeline health and performance Location: scripts/tdarr_monitor.py

Key Features:

  • API-based status monitoring with dataclass structures
  • Staging section timeout detection and cleanup
  • Discord notifications with professional formatting
  • Log rotation and retention management

Windows Desktop Monitoring

Purpose: Track Windows system reboots and power events Location: scripts/windows-desktop/

Components:

  • PowerShell monitoring script
  • Scheduled task automation
  • Discord notification integration
  • System event correlation

Uptime Kuma (Centralized Uptime Monitoring)

Purpose: Centralized service uptime, health checks, and status page for all homelab services Location: LXC 227 (10.10.0.227), Docker container URL: https://status.manticorum.com (internal: http://10.10.0.227:3001)

Key Features:

  • HTTP/HTTPS, TCP, DNS, Docker, and ping monitoring
  • Discord notification integration (default alert channel for all monitors)
  • Public status page at https://status.manticorum.com
  • Multi-protocol health checks at 60-second intervals with 3 retries
  • Certificate expiration monitoring

Infrastructure:

  • Proxmox LXC 227, Ubuntu 22.04, 2 cores, 2GB RAM, 8GB disk
  • Docker with AppArmor unconfined (required for Docker-in-LXC)
  • Data persisted via Docker named volume (uptime-kuma-data)
  • Compose config: server-configs/uptime-kuma/docker-compose/uptime-kuma/
  • SSH alias: uptime-kuma
  • Admin credentials: username cal, password in ~/.claude/secrets/kuma_web_password

Active Monitors (20):

Tag Monitor Type Target
Infrastructure Proxmox VE HTTP https://10.10.0.11:8006
Infrastructure Home Assistant HTTP http://10.0.0.28:8123
DNS Pi-hole Primary DNS DNS 10.10.0.16:53
DNS Pi-hole Secondary DNS DNS 10.10.0.226:53
Media Jellyfin HTTP http://10.10.0.226:8096
Media Tdarr HTTP http://10.10.0.226:8265
Media Sonarr HTTP http://10.10.0.221:8989
Media Radarr HTTP http://10.10.0.221:7878
Media Jellyseerr HTTP http://10.10.0.221:5055
DevOps Gitea HTTP http://10.10.0.225:3000
DevOps n8n HTTP http://10.10.0.210:5678
Networking NPM Local (Admin) HTTP http://10.10.0.16:81
Networking Pi-hole Primary Web HTTP http://10.10.0.16:81/admin
Networking Pi-hole Secondary Web HTTP http://10.10.0.226:8053/admin
Gaming Foundry VTT HTTP http://10.10.0.223:30000
AI OpenClaw Gateway HTTP http://10.10.0.224:18789
Bots discord-bots VM Ping 10.10.0.33
Bots sba-bots VM Ping 10.10.0.88
Database PostgreSQL (strat-database) TCP 10.10.0.42:5432
External Akamai NPM HTTP http://172.237.147.99

Notifications:

  • Discord webhook: "Discord - Homelab Alerts" (default, applied to all monitors)
  • Alerts on service down (after 3 retries at 30s intervals) and on recovery

API Access:

  • Python library: uptime-kuma-api (pip installed)
  • Connection: UptimeKumaApi("http://10.10.0.227:3001")
  • Used for programmatic monitor/notification management

Network and Service Monitoring

Purpose: Monitor critical infrastructure availability Implementation:

# Service health check pattern
SERVICES="https://homelab.local http://nas.homelab.local"
for service in $SERVICES; do
    if curl -sSf --max-time 10 "$service" >/dev/null 2>&1; then
        echo "✅ $service: Available"
    else
        echo "❌ $service: Failed" | send_alert
    fi
done

Automation Patterns

Cron-Based Scheduling

Pattern: Regular health checks with intelligent alerting

# Monitoring schedule examples
*/20 * * * * /path/to/tdarr-timeout-monitor.sh    # Every 20 minutes
0 */6 * * * /path/to/cleanup-temp-dirs.sh         # Every 6 hours
0 2 * * * /path/to/backup-monitor.sh              # Daily at 2 AM

Event-Driven Monitoring

Pattern: Reactive monitoring for critical events

  • System Startup: Windows boot detection
  • Service Failures: Container restart alerts
  • Resource Exhaustion: Disk space warnings
  • Security Events: Failed login attempts

Data Collection and Analysis

Log Management

Pattern: Centralized logging with rotation

# Log rotation configuration
LOG_FILE="/var/log/homelab-monitor.log"
MAX_SIZE="10M"
RETENTION_DAYS=30

# Rotate logs when size exceeded
if [ $(stat -c%s "$LOG_FILE") -gt $((10*1024*1024)) ]; then
    mv "$LOG_FILE" "$LOG_FILE.$(date +%Y%m%d)"
    touch "$LOG_FILE"
fi

Metrics Collection

Pattern: Time-series data for trend analysis

  • System Metrics: CPU, memory, disk usage
  • Service Metrics: Response times, error rates
  • Application Metrics: Transcoding progress, queue sizes
  • Network Metrics: Bandwidth usage, latency

Alert Integration

Discord Notification System

Pattern: Rich, actionable notifications

# Professional alert format
**🔧 System Maintenance**
Service: Tdarr Transcoding
Issue: 3 files timed out in staging
Resolution: Automatic cleanup completed
Status: System operational

Manual review recommended <@user_id>

Alert Escalation

Pattern: Tiered alerting based on severity

  1. Info: Routine maintenance completed
  2. Warning: Service degradation detected
  3. Critical: Service failure requiring immediate attention
  4. Emergency: System-wide failure requiring manual intervention

Best Practices Implementation

Monitoring Strategy

  1. Proactive: Monitor trends to predict issues
  2. Reactive: Alert on current failures
  3. Preventive: Automated cleanup and maintenance
  4. Comprehensive: Cover all critical services
  5. Actionable: Provide clear resolution paths

Performance Optimization

  1. Efficient Polling: Balance monitoring frequency with resource usage
  2. Smart Alerting: Avoid alert fatigue with intelligent filtering
  3. Resource Management: Monitor the monitoring system itself
  4. Scalable Architecture: Design for growth and additional services

This technology context provides the foundation for implementing comprehensive monitoring and alerting in home lab environments.