claude-home/monitoring/CONTEXT.md
Cal Corum 4b7eca8a46
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
docs: add YAML frontmatter to all 151 markdown files
Adds title, description, type, domain, and tags frontmatter to every
doc for improved KB semantic search. The description field is prepended
to every search chunk, and domain/type/tags enable filtered queries.

Type values: context, guide, runbook, reference, troubleshooting
Domain values match directory structure (networking, docker, etc.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 09:00:44 -05:00

12 KiB
Raw Blame History

title description type domain tags
Monitoring and Alerting Overview Architecture overview of the homelab monitoring system including Uptime Kuma, Claude Runner (CT 302) two-tier health checks, Tdarr monitoring, Windows desktop monitoring, and n8n master/sub-workflow orchestration. context monitoring
uptime-kuma
claude-runner
n8n
discord
healthcheck
tdarr
windows
infrastructure

System Monitoring and Alerting - Technology Context

Overview

Comprehensive monitoring and alerting system for home lab infrastructure with focus on automated health checks, Discord notifications, and proactive system maintenance.

Architecture Patterns

Distributed Monitoring Strategy

Pattern: Service-specific monitoring with centralized alerting

  • Uptime Kuma: Centralized service uptime and health monitoring (status page)
  • Claude Runner (CT 302): SSH-based server diagnostics with two-tier auto-remediation
  • Tdarr Monitoring: API-based transcoding health checks
  • Windows Desktop Monitoring: Reboot detection and system events
  • Network Monitoring: Connectivity and service availability
  • Container Monitoring: Docker/Podman health and resource usage

AI Infrastructure LXCs (301302)

Claude Discord Coordinator — CT 301 (10.10.0.147)

Purpose: Discord bot coordination with read-only cognitive memory MCP access SSH alias: claude-discord-coordinator

Claude Runner — CT 302 (10.10.0.148)

Purpose: Automated server health monitoring with AI-escalated remediation Repo: cal/claude-runner-monitoring on Gitea (cloned to /root/.claude on CT 302) Docs: monitoring/server-diagnostics/CONTEXT.md

Two-tier system:

  • Tier 1 (health_check.py): Pure Python, runs every 5 min via n8n. Checks containers, systemd services, disk/memory/load. Auto-restarts containers when allowed. Exit 0=healthy, 1=auto-fixed, 2=needs Claude.
  • Tier 2 (client.py): Full diagnostic toolkit used by Claude during escalation sessions.

Monitored servers (dynamic from config.yaml):

Server Key IP SSH User Services Critical
arr-stack 10.10.0.221 root sonarr, radarr, readarr, lidarr, jellyseerr, sabnzbd Yes
gitea 10.10.0.225 root gitea (systemd), gitea-runner (docker) Yes
uptime-kuma 10.10.0.227 root uptime-kuma Yes
n8n 10.10.0.210 root n8n (no restart), n8n-postgres, omni-tools, termix Yes
ubuntu-manticore 10.10.0.226 cal jellyfin, tdarr-server, tdarr-node, pihole, watchstate, orbital-sync Yes
strat-database 10.10.0.42 cal sba_postgres, sba_redis, sba_db_api, dev_pd_database, sba_adminer No (dev)
pihole1 10.10.0.16 cal pihole (primary DNS), nginx-proxy-manager, portainer Yes
sba-bots 10.10.0.88 cal paper-dynasty bot, paper-dynasty DB (prod), PD adminer, sba-website, ghost Yes
foundry 10.10.0.223 root foundry-foundry-1 (Foundry VTT) No

Per-server SSH user: health_check.py supports per-server ssh_user override in config.yaml (default: root). Used by ubuntu-manticore, strat-database, pihole1, and sba-bots which require cal user. SSH keys: n8n uses n8n_runner_key → CT 302, CT 302 uses homelab_rsa → target servers Helper script: /root/.claude/skills/server-diagnostics/list_servers.sh — extracts server keys from config.yaml as JSON array

n8n Workflow Architecture (Master + Sub-workflow)

The monitoring uses a master/sub-workflow pattern in n8n. Adding or removing servers only requires editing config.yaml on CT 302 — no n8n changes needed.

Master: "Server Health Monitor - Claude Code" (p7XmW23SgCs3hEkY, active)

Schedule (every 5 min)
  → SSH to CT 302: list_servers.sh → ["arr-stack", "gitea", "uptime-kuma", "n8n", "ubuntu-manticore", "strat-database"]
  → Code: split JSON array into one item per server_key
  → Execute Sub-workflow (mode: "each") → "Server Health Check"
  → Code: aggregate results (healthy/remediated/escalated counts)
  → If any escalations → Discord summary embed

Sub-workflow: "Server Health Check" (BhzYmWr6NcIDoioy, active)

Execute Workflow Trigger (receives { server_key: "arr-stack" })
  → SSH to CT 302: health_check.py --server {server_key}
  → Code: parse JSON output (status, exit_code, issues, escalations)
  → If exit_code == 2 → SSH: remediate.sh (escalation JSON)
  → Return results to parent (server_key, status, issues, remediation_output)

Exit code behavior:

  • 0 (healthy): No action, aggregated in summary
  • 1 (auto-remediated): Script already handled it + sent Discord via notifier.py — n8n takes no action
  • 2 (needs escalation): Sub-workflow runs remediate.sh, master sends Discord summary

Pre-escalation notification: remediate.sh sends a Discord warning embed ("Claude API Escalation Triggered") via notifier.py before invoking the Claude CLI, so Cal gets a heads-up that API charges are about to be incurred.

SSH credential: SSH Private Key account (id: QkbHQ8JmYimUoTcM) Discord webhook: Homelab Alerts channel

Alert Management

Pattern: Structured notifications with actionable information

# Discord webhook integration
curl -X POST "$DISCORD_WEBHOOK" \
  -H "Content-Type: application/json" \
  -d '{
    "content": "**System Alert**\n```\nService: Tdarr\nIssue: Staging timeout\nAction: Automatic cleanup performed\n```\n<@user_id>"
  }'

Core Monitoring Components

Tdarr System Monitoring

Purpose: Monitor transcoding pipeline health and performance Location: scripts/tdarr_monitor.py

Key Features:

  • API-based status monitoring with dataclass structures
  • Staging section timeout detection and cleanup
  • Discord notifications with professional formatting
  • Log rotation and retention management

Windows Desktop Monitoring

Purpose: Track Windows system reboots and power events Location: scripts/windows-desktop/

Components:

  • PowerShell monitoring script
  • Scheduled task automation
  • Discord notification integration
  • System event correlation

Uptime Kuma (Centralized Uptime Monitoring)

Purpose: Centralized service uptime, health checks, and status page for all homelab services Location: LXC 227 (10.10.0.227), Docker container URL: https://status.manticorum.com (internal: http://10.10.0.227:3001)

Key Features:

  • HTTP/HTTPS, TCP, DNS, Docker, and ping monitoring
  • Discord notification integration (default alert channel for all monitors)
  • Public status page at https://status.manticorum.com
  • Multi-protocol health checks at 60-second intervals with 3 retries
  • Certificate expiration monitoring

Infrastructure:

  • Proxmox LXC 227, Ubuntu 22.04, 2 cores, 2GB RAM, 8GB disk
  • Docker with AppArmor unconfined (required for Docker-in-LXC)
  • Data persisted via Docker named volume (uptime-kuma-data)
  • Compose config: server-configs/uptime-kuma/docker-compose/uptime-kuma/
  • SSH alias: uptime-kuma
  • Admin credentials: username cal, password in ~/.claude/secrets/kuma_web_password

Active Monitors (20):

Tag Monitor Type Target
Infrastructure Proxmox VE HTTP https://10.10.0.11:8006
Infrastructure Home Assistant HTTP http://10.0.0.28:8123
DNS Pi-hole Primary DNS DNS 10.10.0.16:53
DNS Pi-hole Secondary DNS DNS 10.10.0.226:53
Media Jellyfin HTTP http://10.10.0.226:8096
Media Tdarr HTTP http://10.10.0.226:8265
Media Sonarr HTTP http://10.10.0.221:8989
Media Radarr HTTP http://10.10.0.221:7878
Media Jellyseerr HTTP http://10.10.0.221:5055
DevOps Gitea HTTP http://10.10.0.225:3000
DevOps n8n HTTP http://10.10.0.210:5678
Networking NPM Local (Admin) HTTP http://10.10.0.16:81
Networking Pi-hole Primary Web HTTP http://10.10.0.16:81/admin
Networking Pi-hole Secondary Web HTTP http://10.10.0.226:8053/admin
Gaming Foundry VTT HTTP http://10.10.0.223:30000
AI OpenClaw Gateway HTTP http://10.10.0.224:18789
Bots discord-bots VM Ping 10.10.0.33
Bots sba-bots VM Ping 10.10.0.88
Database PostgreSQL (strat-database) TCP 10.10.0.42:5432
External Akamai NPM HTTP http://172.237.147.99

Notifications:

  • Discord webhook: "Discord - Homelab Alerts" (default, applied to all monitors)
  • Alerts on service down (after 3 retries at 30s intervals) and on recovery

API Access:

  • Python library: uptime-kuma-api (pip installed)
  • Connection: UptimeKumaApi("http://10.10.0.227:3001")
  • Used for programmatic monitor/notification management

Network and Service Monitoring

Purpose: Monitor critical infrastructure availability Implementation:

# Service health check pattern
SERVICES="https://homelab.local http://nas.homelab.local"
for service in $SERVICES; do
    if curl -sSf --max-time 10 "$service" >/dev/null 2>&1; then
        echo "✅ $service: Available"
    else
        echo "❌ $service: Failed" | send_alert
    fi
done

Automation Patterns

Cron-Based Scheduling

Pattern: Regular health checks with intelligent alerting

# Monitoring schedule examples
*/20 * * * * /path/to/tdarr-timeout-monitor.sh    # Every 20 minutes
0 */6 * * * /path/to/cleanup-temp-dirs.sh         # Every 6 hours
0 2 * * * /path/to/backup-monitor.sh              # Daily at 2 AM

Event-Driven Monitoring

Pattern: Reactive monitoring for critical events

  • System Startup: Windows boot detection
  • Service Failures: Container restart alerts
  • Resource Exhaustion: Disk space warnings
  • Security Events: Failed login attempts

Data Collection and Analysis

Log Management

Pattern: Centralized logging with rotation

# Log rotation configuration
LOG_FILE="/var/log/homelab-monitor.log"
MAX_SIZE="10M"
RETENTION_DAYS=30

# Rotate logs when size exceeded
if [ $(stat -c%s "$LOG_FILE") -gt $((10*1024*1024)) ]; then
    mv "$LOG_FILE" "$LOG_FILE.$(date +%Y%m%d)"
    touch "$LOG_FILE"
fi

Metrics Collection

Pattern: Time-series data for trend analysis

  • System Metrics: CPU, memory, disk usage
  • Service Metrics: Response times, error rates
  • Application Metrics: Transcoding progress, queue sizes
  • Network Metrics: Bandwidth usage, latency

Alert Integration

Discord Notification System

Pattern: Rich, actionable notifications

# Professional alert format
**🔧 System Maintenance**
Service: Tdarr Transcoding
Issue: 3 files timed out in staging
Resolution: Automatic cleanup completed
Status: System operational

Manual review recommended <@user_id>

Alert Escalation

Pattern: Tiered alerting based on severity

  1. Info: Routine maintenance completed
  2. Warning: Service degradation detected
  3. Critical: Service failure requiring immediate attention
  4. Emergency: System-wide failure requiring manual intervention

Best Practices Implementation

Monitoring Strategy

  1. Proactive: Monitor trends to predict issues
  2. Reactive: Alert on current failures
  3. Preventive: Automated cleanup and maintenance
  4. Comprehensive: Cover all critical services
  5. Actionable: Provide clear resolution paths

Performance Optimization

  1. Efficient Polling: Balance monitoring frequency with resource usage
  2. Smart Alerting: Avoid alert fatigue with intelligent filtering
  3. Resource Management: Monitor the monitoring system itself
  4. Scalable Architecture: Design for growth and additional services

This technology context provides the foundation for implementing comprehensive monitoring and alerting in home lab environments.