claude-home/monitoring/CONTEXT.md
Cal Corum a35891b565 Add Uptime Kuma service monitoring on LXC 227
Deploy Uptime Kuma for centralized service uptime monitoring at
https://status.manticorum.com. Proxmox LXC 227 (10.10.0.227) running
Ubuntu 22.04 with Docker. Updated monitoring documentation, CLAUDE.md
context loading rules, and server-configs host inventory.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 22:18:51 -06:00

5.7 KiB

System Monitoring and Alerting - Technology Context

Overview

Comprehensive monitoring and alerting system for home lab infrastructure with focus on automated health checks, Discord notifications, and proactive system maintenance.

Architecture Patterns

Distributed Monitoring Strategy

Pattern: Service-specific monitoring with centralized alerting

  • Uptime Kuma: Centralized service uptime and health monitoring (status page)
  • Tdarr Monitoring: API-based transcoding health checks
  • Windows Desktop Monitoring: Reboot detection and system events
  • Network Monitoring: Connectivity and service availability
  • Container Monitoring: Docker/Podman health and resource usage

Alert Management

Pattern: Structured notifications with actionable information

# Discord webhook integration
curl -X POST "$DISCORD_WEBHOOK" \
  -H "Content-Type: application/json" \
  -d '{
    "content": "**System Alert**\n```\nService: Tdarr\nIssue: Staging timeout\nAction: Automatic cleanup performed\n```\n<@user_id>"
  }'

Core Monitoring Components

Tdarr System Monitoring

Purpose: Monitor transcoding pipeline health and performance Location: scripts/tdarr_monitor.py

Key Features:

  • API-based status monitoring with dataclass structures
  • Staging section timeout detection and cleanup
  • Discord notifications with professional formatting
  • Log rotation and retention management

Windows Desktop Monitoring

Purpose: Track Windows system reboots and power events Location: scripts/windows-desktop/

Components:

  • PowerShell monitoring script
  • Scheduled task automation
  • Discord notification integration
  • System event correlation

Uptime Kuma (Centralized Uptime Monitoring)

Purpose: Centralized service uptime, health checks, and status page for all homelab services Location: LXC 227 (10.10.0.227), Docker container URL: https://status.manticorum.com (internal: http://10.10.0.227:3001)

Key Features:

  • HTTP/HTTPS, TCP, DNS, Docker, and ping monitoring
  • Built-in Discord notification support
  • Public/private status pages
  • Multi-protocol health checks with configurable intervals
  • Certificate expiration monitoring

Infrastructure:

  • Proxmox LXC 227, Ubuntu 22.04, 2 cores, 2GB RAM, 8GB disk
  • Docker with AppArmor unconfined (required for Docker-in-LXC)
  • Data persisted via Docker named volume (uptime-kuma-data)
  • Compose config: server-configs/uptime-kuma/docker-compose/uptime-kuma/

Recommended Monitors:

  • All Docker hosts: Jellyfin, Tdarr, n8n, Gitea, Foundry, Pi-holes, NPM, Discord bots
  • Databases: strat-database PostgreSQL instances
  • External: Akamai services, SBA website
  • Infrastructure: Proxmox API, Home Assistant

Network and Service Monitoring

Purpose: Monitor critical infrastructure availability Implementation:

# Service health check pattern
SERVICES="https://homelab.local http://nas.homelab.local"
for service in $SERVICES; do
    if curl -sSf --max-time 10 "$service" >/dev/null 2>&1; then
        echo "✅ $service: Available"
    else
        echo "❌ $service: Failed" | send_alert
    fi
done

Automation Patterns

Cron-Based Scheduling

Pattern: Regular health checks with intelligent alerting

# Monitoring schedule examples
*/20 * * * * /path/to/tdarr-timeout-monitor.sh    # Every 20 minutes
0 */6 * * * /path/to/cleanup-temp-dirs.sh         # Every 6 hours
0 2 * * * /path/to/backup-monitor.sh              # Daily at 2 AM

Event-Driven Monitoring

Pattern: Reactive monitoring for critical events

  • System Startup: Windows boot detection
  • Service Failures: Container restart alerts
  • Resource Exhaustion: Disk space warnings
  • Security Events: Failed login attempts

Data Collection and Analysis

Log Management

Pattern: Centralized logging with rotation

# Log rotation configuration
LOG_FILE="/var/log/homelab-monitor.log"
MAX_SIZE="10M"
RETENTION_DAYS=30

# Rotate logs when size exceeded
if [ $(stat -c%s "$LOG_FILE") -gt $((10*1024*1024)) ]; then
    mv "$LOG_FILE" "$LOG_FILE.$(date +%Y%m%d)"
    touch "$LOG_FILE"
fi

Metrics Collection

Pattern: Time-series data for trend analysis

  • System Metrics: CPU, memory, disk usage
  • Service Metrics: Response times, error rates
  • Application Metrics: Transcoding progress, queue sizes
  • Network Metrics: Bandwidth usage, latency

Alert Integration

Discord Notification System

Pattern: Rich, actionable notifications

# Professional alert format
**🔧 System Maintenance**
Service: Tdarr Transcoding
Issue: 3 files timed out in staging
Resolution: Automatic cleanup completed
Status: System operational

Manual review recommended <@user_id>

Alert Escalation

Pattern: Tiered alerting based on severity

  1. Info: Routine maintenance completed
  2. Warning: Service degradation detected
  3. Critical: Service failure requiring immediate attention
  4. Emergency: System-wide failure requiring manual intervention

Best Practices Implementation

Monitoring Strategy

  1. Proactive: Monitor trends to predict issues
  2. Reactive: Alert on current failures
  3. Preventive: Automated cleanup and maintenance
  4. Comprehensive: Cover all critical services
  5. Actionable: Provide clear resolution paths

Performance Optimization

  1. Efficient Polling: Balance monitoring frequency with resource usage
  2. Smart Alerting: Avoid alert fatigue with intelligent filtering
  3. Resource Management: Monitor the monitoring system itself
  4. Scalable Architecture: Design for growth and additional services

This technology context provides the foundation for implementing comprehensive monitoring and alerting in home lab environments.