claude-home/monitoring/CONTEXT.md
Cal Corum df553e5142 docs: add AI infrastructure LXCs (301-303) to monitoring server inventory
Groups Claude Discord Coordinator, Claude Runner, and MCP Gateway
under a shared section. Documents new CT 303 MCP Gateway with n8n
and Gitea MCP server configuration details.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 19:58:19 -06:00

12 KiB
Raw Blame History

System Monitoring and Alerting - Technology Context

Overview

Comprehensive monitoring and alerting system for home lab infrastructure with focus on automated health checks, Discord notifications, and proactive system maintenance.

Architecture Patterns

Distributed Monitoring Strategy

Pattern: Service-specific monitoring with centralized alerting

  • Uptime Kuma: Centralized service uptime and health monitoring (status page)
  • Claude Runner (CT 302): SSH-based server diagnostics with two-tier auto-remediation
  • Tdarr Monitoring: API-based transcoding health checks
  • Windows Desktop Monitoring: Reboot detection and system events
  • Network Monitoring: Connectivity and service availability
  • Container Monitoring: Docker/Podman health and resource usage

AI Infrastructure LXCs (301303)

Claude Discord Coordinator — CT 301 (10.10.0.147)

Purpose: Discord bot coordination with read-only cognitive memory MCP access SSH alias: claude-discord-coordinator

Claude Runner — CT 302 (10.10.0.148)

Purpose: Automated server health monitoring with AI-escalated remediation Repo: cal/claude-runner-monitoring on Gitea (cloned to /root/.claude on CT 302) Docs: monitoring/server-diagnostics/CONTEXT.md

MCP Gateway — CT 303 (10.10.0.231)

Purpose: Docker MCP Gateway — centralized MCP server proxy for Claude Code SSH alias: mcp-gateway Service: docker/mcp-gateway v2.0.1 on port 8811 (http://10.10.0.231:8811/mcp) MCP servers: n8n (23 tools, connects to LXC 210), Gitea (109 tools, connects to git.manticorum.com) Config: /home/cal/mcp-gateway/secrets.env, config/config.yaml, config/registry.yaml, gitea-catalog.yaml Note: Uses --servers flag for static server activation (Docker Desktop secret store unavailable on headless Engine). Custom catalog adds Gitea MCP via docker.gitea.com/gitea-mcp-server image.

Two-tier system:

  • Tier 1 (health_check.py): Pure Python, runs every 5 min via n8n. Checks containers, systemd services, disk/memory/load. Auto-restarts containers when allowed. Exit 0=healthy, 1=auto-fixed, 2=needs Claude.
  • Tier 2 (client.py): Full diagnostic toolkit used by Claude during escalation sessions.

Monitored servers (dynamic from config.yaml):

Server Key IP SSH User Services Critical
arr-stack 10.10.0.221 root sonarr, radarr, readarr, lidarr, jellyseerr, sabnzbd Yes
gitea 10.10.0.225 root gitea (systemd), gitea-runner (docker) Yes
uptime-kuma 10.10.0.227 root uptime-kuma Yes
n8n 10.10.0.210 root n8n (no restart), n8n-postgres, omni-tools, termix Yes
ubuntu-manticore 10.10.0.226 cal jellyfin, tdarr-server, tdarr-node, pihole, watchstate, orbital-sync Yes
strat-database 10.10.0.42 cal sba_postgres, sba_redis, sba_db_api, dev_pd_database, sba_adminer No (dev)
pihole1 10.10.0.16 cal pihole (primary DNS), nginx-proxy-manager, portainer Yes
sba-bots 10.10.0.88 cal paper-dynasty bot, paper-dynasty DB (prod), PD adminer, sba-website, ghost Yes
foundry 10.10.0.223 root foundry-foundry-1 (Foundry VTT) No

Per-server SSH user: health_check.py supports per-server ssh_user override in config.yaml (default: root). Used by ubuntu-manticore, strat-database, pihole1, and sba-bots which require cal user. SSH keys: n8n uses n8n_runner_key → CT 302, CT 302 uses homelab_rsa → target servers Helper script: /root/.claude/skills/server-diagnostics/list_servers.sh — extracts server keys from config.yaml as JSON array

n8n Workflow Architecture (Master + Sub-workflow)

The monitoring uses a master/sub-workflow pattern in n8n. Adding or removing servers only requires editing config.yaml on CT 302 — no n8n changes needed.

Master: "Server Health Monitor - Claude Code" (p7XmW23SgCs3hEkY, active)

Schedule (every 5 min)
  → SSH to CT 302: list_servers.sh → ["arr-stack", "gitea", "uptime-kuma", "n8n", "ubuntu-manticore", "strat-database"]
  → Code: split JSON array into one item per server_key
  → Execute Sub-workflow (mode: "each") → "Server Health Check"
  → Code: aggregate results (healthy/remediated/escalated counts)
  → If any escalations → Discord summary embed

Sub-workflow: "Server Health Check" (BhzYmWr6NcIDoioy, active)

Execute Workflow Trigger (receives { server_key: "arr-stack" })
  → SSH to CT 302: health_check.py --server {server_key}
  → Code: parse JSON output (status, exit_code, issues, escalations)
  → If exit_code == 2 → SSH: remediate.sh (escalation JSON)
  → Return results to parent (server_key, status, issues, remediation_output)

Exit code behavior:

  • 0 (healthy): No action, aggregated in summary
  • 1 (auto-remediated): Script already handled it + sent Discord via notifier.py — n8n takes no action
  • 2 (needs escalation): Sub-workflow runs remediate.sh, master sends Discord summary

Pre-escalation notification: remediate.sh sends a Discord warning embed ("Claude API Escalation Triggered") via notifier.py before invoking the Claude CLI, so Cal gets a heads-up that API charges are about to be incurred.

SSH credential: SSH Private Key account (id: QkbHQ8JmYimUoTcM) Discord webhook: Homelab Alerts channel

Alert Management

Pattern: Structured notifications with actionable information

# Discord webhook integration
curl -X POST "$DISCORD_WEBHOOK" \
  -H "Content-Type: application/json" \
  -d '{
    "content": "**System Alert**\n```\nService: Tdarr\nIssue: Staging timeout\nAction: Automatic cleanup performed\n```\n<@user_id>"
  }'

Core Monitoring Components

Tdarr System Monitoring

Purpose: Monitor transcoding pipeline health and performance Location: scripts/tdarr_monitor.py

Key Features:

  • API-based status monitoring with dataclass structures
  • Staging section timeout detection and cleanup
  • Discord notifications with professional formatting
  • Log rotation and retention management

Windows Desktop Monitoring

Purpose: Track Windows system reboots and power events Location: scripts/windows-desktop/

Components:

  • PowerShell monitoring script
  • Scheduled task automation
  • Discord notification integration
  • System event correlation

Uptime Kuma (Centralized Uptime Monitoring)

Purpose: Centralized service uptime, health checks, and status page for all homelab services Location: LXC 227 (10.10.0.227), Docker container URL: https://status.manticorum.com (internal: http://10.10.0.227:3001)

Key Features:

  • HTTP/HTTPS, TCP, DNS, Docker, and ping monitoring
  • Discord notification integration (default alert channel for all monitors)
  • Public status page at https://status.manticorum.com
  • Multi-protocol health checks at 60-second intervals with 3 retries
  • Certificate expiration monitoring

Infrastructure:

  • Proxmox LXC 227, Ubuntu 22.04, 2 cores, 2GB RAM, 8GB disk
  • Docker with AppArmor unconfined (required for Docker-in-LXC)
  • Data persisted via Docker named volume (uptime-kuma-data)
  • Compose config: server-configs/uptime-kuma/docker-compose/uptime-kuma/
  • SSH alias: uptime-kuma
  • Admin credentials: username cal, password in ~/.claude/secrets/kuma_web_password

Active Monitors (20):

Tag Monitor Type Target
Infrastructure Proxmox VE HTTP https://10.10.0.11:8006
Infrastructure Home Assistant HTTP http://10.0.0.28:8123
DNS Pi-hole Primary DNS DNS 10.10.0.16:53
DNS Pi-hole Secondary DNS DNS 10.10.0.226:53
Media Jellyfin HTTP http://10.10.0.226:8096
Media Tdarr HTTP http://10.10.0.226:8265
Media Sonarr HTTP http://10.10.0.221:8989
Media Radarr HTTP http://10.10.0.221:7878
Media Jellyseerr HTTP http://10.10.0.221:5055
DevOps Gitea HTTP http://10.10.0.225:3000
DevOps n8n HTTP http://10.10.0.210:5678
Networking NPM Local (Admin) HTTP http://10.10.0.16:81
Networking Pi-hole Primary Web HTTP http://10.10.0.16:81/admin
Networking Pi-hole Secondary Web HTTP http://10.10.0.226:8053/admin
Gaming Foundry VTT HTTP http://10.10.0.223:30000
AI OpenClaw Gateway HTTP http://10.10.0.224:18789
Bots discord-bots VM Ping 10.10.0.33
Bots sba-bots VM Ping 10.10.0.88
Database PostgreSQL (strat-database) TCP 10.10.0.42:5432
External Akamai NPM HTTP http://172.237.147.99

Notifications:

  • Discord webhook: "Discord - Homelab Alerts" (default, applied to all monitors)
  • Alerts on service down (after 3 retries at 30s intervals) and on recovery

API Access:

  • Python library: uptime-kuma-api (pip installed)
  • Connection: UptimeKumaApi("http://10.10.0.227:3001")
  • Used for programmatic monitor/notification management

Network and Service Monitoring

Purpose: Monitor critical infrastructure availability Implementation:

# Service health check pattern
SERVICES="https://homelab.local http://nas.homelab.local"
for service in $SERVICES; do
    if curl -sSf --max-time 10 "$service" >/dev/null 2>&1; then
        echo "✅ $service: Available"
    else
        echo "❌ $service: Failed" | send_alert
    fi
done

Automation Patterns

Cron-Based Scheduling

Pattern: Regular health checks with intelligent alerting

# Monitoring schedule examples
*/20 * * * * /path/to/tdarr-timeout-monitor.sh    # Every 20 minutes
0 */6 * * * /path/to/cleanup-temp-dirs.sh         # Every 6 hours
0 2 * * * /path/to/backup-monitor.sh              # Daily at 2 AM

Event-Driven Monitoring

Pattern: Reactive monitoring for critical events

  • System Startup: Windows boot detection
  • Service Failures: Container restart alerts
  • Resource Exhaustion: Disk space warnings
  • Security Events: Failed login attempts

Data Collection and Analysis

Log Management

Pattern: Centralized logging with rotation

# Log rotation configuration
LOG_FILE="/var/log/homelab-monitor.log"
MAX_SIZE="10M"
RETENTION_DAYS=30

# Rotate logs when size exceeded
if [ $(stat -c%s "$LOG_FILE") -gt $((10*1024*1024)) ]; then
    mv "$LOG_FILE" "$LOG_FILE.$(date +%Y%m%d)"
    touch "$LOG_FILE"
fi

Metrics Collection

Pattern: Time-series data for trend analysis

  • System Metrics: CPU, memory, disk usage
  • Service Metrics: Response times, error rates
  • Application Metrics: Transcoding progress, queue sizes
  • Network Metrics: Bandwidth usage, latency

Alert Integration

Discord Notification System

Pattern: Rich, actionable notifications

# Professional alert format
**🔧 System Maintenance**
Service: Tdarr Transcoding
Issue: 3 files timed out in staging
Resolution: Automatic cleanup completed
Status: System operational

Manual review recommended <@user_id>

Alert Escalation

Pattern: Tiered alerting based on severity

  1. Info: Routine maintenance completed
  2. Warning: Service degradation detected
  3. Critical: Service failure requiring immediate attention
  4. Emergency: System-wide failure requiring manual intervention

Best Practices Implementation

Monitoring Strategy

  1. Proactive: Monitor trends to predict issues
  2. Reactive: Alert on current failures
  3. Preventive: Automated cleanup and maintenance
  4. Comprehensive: Cover all critical services
  5. Actionable: Provide clear resolution paths

Performance Optimization

  1. Efficient Polling: Balance monitoring frequency with resource usage
  2. Smart Alerting: Avoid alert fatigue with intelligent filtering
  3. Resource Management: Monitor the monitoring system itself
  4. Scalable Architecture: Design for growth and additional services

This technology context provides the foundation for implementing comprehensive monitoring and alerting in home lab environments.