Cal Corum 3b2e031f45 Update monitoring docs with Uptime Kuma monitors and Discord alerts

Document all 20 active monitors with targets and tags, Discord
notification configuration, and API access details for programmatic
management via uptime-kuma-api.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-07 23:02:49 -06:00

7.2 KiB

Raw Blame History

System Monitoring and Alerting - Technology Context

Overview

Comprehensive monitoring and alerting system for home lab infrastructure with focus on automated health checks, Discord notifications, and proactive system maintenance.

Architecture Patterns

Distributed Monitoring Strategy

Pattern: Service-specific monitoring with centralized alerting

Uptime Kuma: Centralized service uptime and health monitoring (status page)
Tdarr Monitoring: API-based transcoding health checks
Windows Desktop Monitoring: Reboot detection and system events
Network Monitoring: Connectivity and service availability
Container Monitoring: Docker/Podman health and resource usage

Alert Management

Pattern: Structured notifications with actionable information

# Discord webhook integration
curl -X POST "$DISCORD_WEBHOOK" \
  -H "Content-Type: application/json" \
  -d '{
    "content": "**System Alert**\n```\nService: Tdarr\nIssue: Staging timeout\nAction: Automatic cleanup performed\n```\n<@user_id>"
  }'

Core Monitoring Components

Tdarr System Monitoring

Purpose: Monitor transcoding pipeline health and performance Location: scripts/tdarr_monitor.py

Key Features:

API-based status monitoring with dataclass structures
Staging section timeout detection and cleanup
Discord notifications with professional formatting
Log rotation and retention management

Windows Desktop Monitoring

Purpose: Track Windows system reboots and power events Location: scripts/windows-desktop/

Components:

PowerShell monitoring script
Scheduled task automation
Discord notification integration
System event correlation

Uptime Kuma (Centralized Uptime Monitoring)

Purpose: Centralized service uptime, health checks, and status page for all homelab services Location: LXC 227 (10.10.0.227), Docker container URL: https://status.manticorum.com (internal: http://10.10.0.227:3001)

Key Features:

HTTP/HTTPS, TCP, DNS, Docker, and ping monitoring
Discord notification integration (default alert channel for all monitors)
Public status page at https://status.manticorum.com
Multi-protocol health checks at 60-second intervals with 3 retries
Certificate expiration monitoring

Infrastructure:

Proxmox LXC 227, Ubuntu 22.04, 2 cores, 2GB RAM, 8GB disk
Docker with AppArmor unconfined (required for Docker-in-LXC)
Data persisted via Docker named volume (uptime-kuma-data)
Compose config: server-configs/uptime-kuma/docker-compose/uptime-kuma/
SSH alias: uptime-kuma
Admin credentials: username cal, password in ~/.claude/secrets/kuma_web_password

Active Monitors (20):

Tag	Monitor	Type	Target
Infrastructure	Proxmox VE	HTTP	https://10.10.0.11:8006
Infrastructure	Home Assistant	HTTP	http://10.0.0.28:8123
DNS	Pi-hole Primary DNS	DNS	10.10.0.16:53
DNS	Pi-hole Secondary DNS	DNS	10.10.0.226:53
Media	Jellyfin	HTTP	http://10.10.0.226:8096
Media	Tdarr	HTTP	http://10.10.0.226:8265
Media	Sonarr	HTTP	http://10.10.0.221:8989
Media	Radarr	HTTP	http://10.10.0.221:7878
Media	Jellyseerr	HTTP	http://10.10.0.221:5055
DevOps	Gitea	HTTP	http://10.10.0.225:3000
DevOps	n8n	HTTP	http://10.10.0.210:5678
Networking	NPM Local (Admin)	HTTP	http://10.10.0.16:81
Networking	Pi-hole Primary Web	HTTP	http://10.10.0.16:81/admin
Networking	Pi-hole Secondary Web	HTTP	http://10.10.0.226:8053/admin
Gaming	Foundry VTT	HTTP	http://10.10.0.223:30000
AI	OpenClaw Gateway	HTTP	http://10.10.0.224:18789
Bots	discord-bots VM	Ping	10.10.0.33
Bots	sba-bots VM	Ping	10.10.0.88
Database	PostgreSQL (strat-database)	TCP	10.10.0.42:5432
External	Akamai NPM	HTTP	http://172.237.147.99

Notifications:

Discord webhook: "Discord - Homelab Alerts" (default, applied to all monitors)
Alerts on service down (after 3 retries at 30s intervals) and on recovery

API Access:

Python library: uptime-kuma-api (pip installed)
Connection: UptimeKumaApi("http://10.10.0.227:3001")
Used for programmatic monitor/notification management

Network and Service Monitoring

Purpose: Monitor critical infrastructure availability Implementation:

# Service health check pattern
SERVICES="https://homelab.local http://nas.homelab.local"
for service in $SERVICES; do
    if curl -sSf --max-time 10 "$service" >/dev/null 2>&1; then
        echo "✅ $service: Available"
    else
        echo "❌ $service: Failed" | send_alert
    fi
done

Automation Patterns

Cron-Based Scheduling

Pattern: Regular health checks with intelligent alerting

# Monitoring schedule examples
*/20 * * * * /path/to/tdarr-timeout-monitor.sh    # Every 20 minutes
0 */6 * * * /path/to/cleanup-temp-dirs.sh         # Every 6 hours
0 2 * * * /path/to/backup-monitor.sh              # Daily at 2 AM

Event-Driven Monitoring

Pattern: Reactive monitoring for critical events

System Startup: Windows boot detection
Service Failures: Container restart alerts
Resource Exhaustion: Disk space warnings
Security Events: Failed login attempts

Data Collection and Analysis

Log Management

Pattern: Centralized logging with rotation

# Log rotation configuration
LOG_FILE="/var/log/homelab-monitor.log"
MAX_SIZE="10M"
RETENTION_DAYS=30

# Rotate logs when size exceeded
if [ $(stat -c%s "$LOG_FILE") -gt $((10*1024*1024)) ]; then
    mv "$LOG_FILE" "$LOG_FILE.$(date +%Y%m%d)"
    touch "$LOG_FILE"
fi

Metrics Collection

Pattern: Time-series data for trend analysis

System Metrics: CPU, memory, disk usage
Service Metrics: Response times, error rates
Application Metrics: Transcoding progress, queue sizes
Network Metrics: Bandwidth usage, latency

Alert Integration

Discord Notification System

Pattern: Rich, actionable notifications

# Professional alert format
**🔧 System Maintenance**
Service: Tdarr Transcoding
Issue: 3 files timed out in staging
Resolution: Automatic cleanup completed
Status: System operational

Manual review recommended <@user_id>

Alert Escalation

Pattern: Tiered alerting based on severity

Info: Routine maintenance completed
Warning: Service degradation detected
Critical: Service failure requiring immediate attention
Emergency: System-wide failure requiring manual intervention

Best Practices Implementation

Monitoring Strategy

Proactive: Monitor trends to predict issues
Reactive: Alert on current failures
Preventive: Automated cleanup and maintenance
Comprehensive: Cover all critical services
Actionable: Provide clear resolution paths

Performance Optimization

Efficient Polling: Balance monitoring frequency with resource usage
Smart Alerting: Avoid alert fatigue with intelligent filtering
Resource Management: Monitor the monitoring system itself
Scalable Architecture: Design for growth and additional services

This technology context provides the foundation for implementing comprehensive monitoring and alerting in home lab environments.

7.2 KiB Raw Blame History