--- title: "Monitoring and Alerting Overview" description: "Architecture overview of the homelab monitoring system including Uptime Kuma, Claude Runner (CT 302) two-tier health checks, Tdarr monitoring, Windows desktop monitoring, and n8n master/sub-workflow orchestration." type: context domain: monitoring tags: [uptime-kuma, claude-runner, n8n, discord, healthcheck, tdarr, windows, infrastructure] --- # System Monitoring and Alerting - Technology Context ## Overview Comprehensive monitoring and alerting system for home lab infrastructure with focus on automated health checks, Discord notifications, and proactive system maintenance. ## Architecture Patterns ### Distributed Monitoring Strategy **Pattern**: Service-specific monitoring with centralized alerting - **Uptime Kuma**: Centralized service uptime and health monitoring (status page) - **Claude Runner (CT 302)**: SSH-based server diagnostics with two-tier auto-remediation - **Tdarr Monitoring**: API-based transcoding health checks - **Windows Desktop Monitoring**: Reboot detection and system events - **Network Monitoring**: Connectivity and service availability - **Container Monitoring**: Docker/Podman health and resource usage ### AI Infrastructure LXCs (301–302) #### Claude Discord Coordinator — CT 301 (10.10.0.147) **Purpose**: Discord bot coordination with read-only KB search MCP access **SSH alias**: `claude-discord-coordinator` #### Claude Runner — CT 302 (10.10.0.148) **Purpose**: Automated server health monitoring with AI-escalated remediation **Repo**: `cal/claude-runner-monitoring` on Gitea (cloned to `/root/.claude` on CT 302) **Docs**: `monitoring/server-diagnostics/CONTEXT.md` **Two-tier system:** - **Tier 1** (`health_check.py`): Pure Python, runs every 5 min via n8n. Checks containers, systemd services, disk/memory/load. Auto-restarts containers when allowed. Exit 0=healthy, 1=auto-fixed, 2=needs Claude. - **Tier 2** (`client.py`): Full diagnostic toolkit used by Claude during escalation sessions. **Monitored servers** (dynamic from `config.yaml`): | Server Key | IP | SSH User | Services | Critical | |---|---|---|---|---| | arr-stack | 10.10.0.221 | root | sonarr, radarr, readarr, lidarr, jellyseerr, sabnzbd | Yes | | gitea | 10.10.0.225 | root | gitea (systemd), gitea-runner (docker) | Yes | | uptime-kuma | 10.10.0.227 | root | uptime-kuma | Yes | | n8n | 10.10.0.210 | root | n8n (no restart), n8n-postgres, omni-tools, termix | Yes | | ubuntu-manticore | 10.10.0.226 | cal | jellyfin, tdarr-server, tdarr-node, pihole, watchstate, orbital-sync | Yes | | strat-database | 10.10.0.42 | cal | sba_postgres, sba_redis, sba_db_api, dev_pd_database, sba_adminer | No (dev) | | pihole1 | 10.10.0.16 | cal | pihole (primary DNS), nginx-proxy-manager, portainer | Yes | | sba-bots | 10.10.0.88 | cal | paper-dynasty bot, paper-dynasty DB (prod), PD adminer, sba-website, ghost | Yes | | foundry | 10.10.0.223 | root | foundry-foundry-1 (Foundry VTT) | No | **Per-server SSH user:** `health_check.py` supports per-server `ssh_user` override in config.yaml (default: root). Used by ubuntu-manticore, strat-database, pihole1, and sba-bots which require `cal` user. **SSH keys:** n8n uses `n8n_runner_key` → CT 302, CT 302 uses `homelab_rsa` → target servers **Helper script:** `/root/.claude/skills/server-diagnostics/list_servers.sh` — extracts server keys from config.yaml as JSON array #### n8n Workflow Architecture (Master + Sub-workflow) The monitoring uses a master/sub-workflow pattern in n8n. Adding or removing servers only requires editing `config.yaml` on CT 302 — no n8n changes needed. **Master: "Server Health Monitor - Claude Code"** (`p7XmW23SgCs3hEkY`, active) ``` Schedule (every 5 min) → SSH to CT 302: list_servers.sh → ["arr-stack", "gitea", "uptime-kuma", "n8n", "ubuntu-manticore", "strat-database"] → Code: split JSON array into one item per server_key → Execute Sub-workflow (mode: "each") → "Server Health Check" → Code: aggregate results (healthy/remediated/escalated counts) → If any escalations → Discord summary embed ``` **Sub-workflow: "Server Health Check"** (`BhzYmWr6NcIDoioy`, active) ``` Execute Workflow Trigger (receives { server_key: "arr-stack" }) → SSH to CT 302: health_check.py --server {server_key} → Code: parse JSON output (status, exit_code, issues, escalations) → If exit_code == 2 → SSH: remediate.sh (escalation JSON) → Return results to parent (server_key, status, issues, remediation_output) ``` **Exit code behavior:** - `0` (healthy): No action, aggregated in summary - `1` (auto-remediated): Script already handled it + sent Discord via notifier.py — n8n takes no action - `2` (needs escalation): Sub-workflow runs `remediate.sh`, master sends Discord summary **Pre-escalation notification:** `remediate.sh` sends a Discord warning embed ("Claude API Escalation Triggered") via `notifier.py` *before* invoking the Claude CLI, so Cal gets a heads-up that API charges are about to be incurred. **SSH credential:** `SSH Private Key account` (id: `QkbHQ8JmYimUoTcM`) **Discord webhook:** Homelab Alerts channel ### Alert Management **Pattern**: Structured notifications with actionable information ```bash # Discord webhook integration curl -X POST "$DISCORD_WEBHOOK" \ -H "Content-Type: application/json" \ -d '{ "content": "**System Alert**\n```\nService: Tdarr\nIssue: Staging timeout\nAction: Automatic cleanup performed\n```\n<@user_id>" }' ``` ## Core Monitoring Components ### Tdarr System Monitoring **Purpose**: Monitor transcoding pipeline health and performance **Location**: `scripts/tdarr_monitor.py` **Key Features**: - API-based status monitoring with dataclass structures - Staging section timeout detection and cleanup - Discord notifications with professional formatting - Log rotation and retention management ### Windows Desktop Monitoring **Purpose**: Track Windows system reboots and power events **Location**: `scripts/windows-desktop/` **Components**: - PowerShell monitoring script - Scheduled task automation - Discord notification integration - System event correlation ### Uptime Kuma (Centralized Uptime Monitoring) **Purpose**: Centralized service uptime, health checks, and status page for all homelab services **Location**: LXC 227 (10.10.0.227), Docker container **URL**: https://status.manticorum.com (internal: http://10.10.0.227:3001) **Key Features**: - HTTP/HTTPS, TCP, DNS, Docker, and ping monitoring - Discord notification integration (default alert channel for all monitors) - Public status page at https://status.manticorum.com - Multi-protocol health checks at 60-second intervals with 3 retries - Certificate expiration monitoring **Infrastructure**: - Proxmox LXC 227, Ubuntu 22.04, 2 cores, 2GB RAM, 8GB disk - Docker with AppArmor unconfined (required for Docker-in-LXC) - Data persisted via Docker named volume (`uptime-kuma-data`) - Compose config: `server-configs/uptime-kuma/docker-compose/uptime-kuma/` - SSH alias: `uptime-kuma` - Admin credentials: username `cal`, password in `~/.claude/secrets/kuma_web_password` **Active Monitors (20)**: | Tag | Monitor | Type | Target | |-----|---------|------|--------| | Infrastructure | Proxmox VE | HTTP | https://10.10.0.11:8006 | | Infrastructure | Home Assistant | HTTP | http://10.0.0.28:8123 | | DNS | Pi-hole Primary DNS | DNS | 10.10.0.16:53 | | DNS | Pi-hole Secondary DNS | DNS | 10.10.0.226:53 | | Media | Jellyfin | HTTP | http://10.10.0.226:8096 | | Media | Tdarr | HTTP | http://10.10.0.226:8265 | | Media | Sonarr | HTTP | http://10.10.0.221:8989 | | Media | Radarr | HTTP | http://10.10.0.221:7878 | | Media | Jellyseerr | HTTP | http://10.10.0.221:5055 | | DevOps | Gitea | HTTP | http://10.10.0.225:3000 | | DevOps | n8n | HTTP | http://10.10.0.210:5678 | | Networking | NPM Local (Admin) | HTTP | http://10.10.0.16:81 | | Networking | Pi-hole Primary Web | HTTP | http://10.10.0.16:81/admin | | Networking | Pi-hole Secondary Web | HTTP | http://10.10.0.226:8053/admin | | Gaming | Foundry VTT | HTTP | http://10.10.0.223:30000 | | AI | OpenClaw Gateway | HTTP | http://10.10.0.224:18789 | | Bots | discord-bots VM | Ping | 10.10.0.33 | | Bots | sba-bots VM | Ping | 10.10.0.88 | | Database | PostgreSQL (strat-database) | TCP | 10.10.0.42:5432 | | External | Akamai NPM | HTTP | http://172.237.147.99 | **Notifications**: - Discord webhook: "Discord - Homelab Alerts" (default, applied to all monitors) - Alerts on service down (after 3 retries at 30s intervals) and on recovery **API Access**: - Python library: `uptime-kuma-api` (pip installed) - Connection: `UptimeKumaApi("http://10.10.0.227:3001")` - Used for programmatic monitor/notification management ### Network and Service Monitoring **Purpose**: Monitor critical infrastructure availability **Implementation**: ```bash # Service health check pattern SERVICES="https://homelab.local http://nas.homelab.local" for service in $SERVICES; do if curl -sSf --max-time 10 "$service" >/dev/null 2>&1; then echo "✅ $service: Available" else echo "❌ $service: Failed" | send_alert fi done ``` ## Automation Patterns ### Cron-Based Scheduling **Pattern**: Regular health checks with intelligent alerting ```bash # Monitoring schedule examples */20 * * * * /path/to/tdarr-timeout-monitor.sh # Every 20 minutes 0 */6 * * * /path/to/cleanup-temp-dirs.sh # Every 6 hours 0 2 * * * /path/to/backup-monitor.sh # Daily at 2 AM ``` ### Event-Driven Monitoring **Pattern**: Reactive monitoring for critical events - **System Startup**: Windows boot detection - **Service Failures**: Container restart alerts - **Resource Exhaustion**: Disk space warnings - **Security Events**: Failed login attempts ## Data Collection and Analysis ### Log Management **Pattern**: Centralized logging with rotation ```bash # Log rotation configuration LOG_FILE="/var/log/homelab-monitor.log" MAX_SIZE="10M" RETENTION_DAYS=30 # Rotate logs when size exceeded if [ $(stat -c%s "$LOG_FILE") -gt $((10*1024*1024)) ]; then mv "$LOG_FILE" "$LOG_FILE.$(date +%Y%m%d)" touch "$LOG_FILE" fi ``` ### Metrics Collection **Pattern**: Time-series data for trend analysis - **System Metrics**: CPU, memory, disk usage - **Service Metrics**: Response times, error rates - **Application Metrics**: Transcoding progress, queue sizes - **Network Metrics**: Bandwidth usage, latency ## Alert Integration ### Discord Notification System **Pattern**: Rich, actionable notifications ```markdown # Professional alert format **🔧 System Maintenance** Service: Tdarr Transcoding Issue: 3 files timed out in staging Resolution: Automatic cleanup completed Status: System operational Manual review recommended <@user_id> ``` ### Alert Escalation **Pattern**: Tiered alerting based on severity 1. **Info**: Routine maintenance completed 2. **Warning**: Service degradation detected 3. **Critical**: Service failure requiring immediate attention 4. **Emergency**: System-wide failure requiring manual intervention ## Best Practices Implementation ### Monitoring Strategy 1. **Proactive**: Monitor trends to predict issues 2. **Reactive**: Alert on current failures 3. **Preventive**: Automated cleanup and maintenance 4. **Comprehensive**: Cover all critical services 5. **Actionable**: Provide clear resolution paths ### Performance Optimization 1. **Efficient Polling**: Balance monitoring frequency with resource usage 2. **Smart Alerting**: Avoid alert fatigue with intelligent filtering 3. **Resource Management**: Monitor the monitoring system itself 4. **Scalable Architecture**: Design for growth and additional services This technology context provides the foundation for implementing comprehensive monitoring and alerting in home lab environments.