Groups Claude Discord Coordinator, Claude Runner, and MCP Gateway under a shared section. Documents new CT 303 MCP Gateway with n8n and Gitea MCP server configuration details. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
12 KiB
System Monitoring and Alerting - Technology Context
Overview
Comprehensive monitoring and alerting system for home lab infrastructure with focus on automated health checks, Discord notifications, and proactive system maintenance.
Architecture Patterns
Distributed Monitoring Strategy
Pattern: Service-specific monitoring with centralized alerting
- Uptime Kuma: Centralized service uptime and health monitoring (status page)
- Claude Runner (CT 302): SSH-based server diagnostics with two-tier auto-remediation
- Tdarr Monitoring: API-based transcoding health checks
- Windows Desktop Monitoring: Reboot detection and system events
- Network Monitoring: Connectivity and service availability
- Container Monitoring: Docker/Podman health and resource usage
AI Infrastructure LXCs (301–303)
Claude Discord Coordinator — CT 301 (10.10.0.147)
Purpose: Discord bot coordination with read-only cognitive memory MCP access
SSH alias: claude-discord-coordinator
Claude Runner — CT 302 (10.10.0.148)
Purpose: Automated server health monitoring with AI-escalated remediation
Repo: cal/claude-runner-monitoring on Gitea (cloned to /root/.claude on CT 302)
Docs: monitoring/server-diagnostics/CONTEXT.md
MCP Gateway — CT 303 (10.10.0.231)
Purpose: Docker MCP Gateway — centralized MCP server proxy for Claude Code
SSH alias: mcp-gateway
Service: docker/mcp-gateway v2.0.1 on port 8811 (http://10.10.0.231:8811/mcp)
MCP servers: n8n (23 tools, connects to LXC 210), Gitea (109 tools, connects to git.manticorum.com)
Config: /home/cal/mcp-gateway/ — secrets.env, config/config.yaml, config/registry.yaml, gitea-catalog.yaml
Note: Uses --servers flag for static server activation (Docker Desktop secret store unavailable on headless Engine). Custom catalog adds Gitea MCP via docker.gitea.com/gitea-mcp-server image.
Two-tier system:
- Tier 1 (
health_check.py): Pure Python, runs every 5 min via n8n. Checks containers, systemd services, disk/memory/load. Auto-restarts containers when allowed. Exit 0=healthy, 1=auto-fixed, 2=needs Claude. - Tier 2 (
client.py): Full diagnostic toolkit used by Claude during escalation sessions.
Monitored servers (dynamic from config.yaml):
| Server Key | IP | SSH User | Services | Critical |
|---|---|---|---|---|
| arr-stack | 10.10.0.221 | root | sonarr, radarr, readarr, lidarr, jellyseerr, sabnzbd | Yes |
| gitea | 10.10.0.225 | root | gitea (systemd), gitea-runner (docker) | Yes |
| uptime-kuma | 10.10.0.227 | root | uptime-kuma | Yes |
| n8n | 10.10.0.210 | root | n8n (no restart), n8n-postgres, omni-tools, termix | Yes |
| ubuntu-manticore | 10.10.0.226 | cal | jellyfin, tdarr-server, tdarr-node, pihole, watchstate, orbital-sync | Yes |
| strat-database | 10.10.0.42 | cal | sba_postgres, sba_redis, sba_db_api, dev_pd_database, sba_adminer | No (dev) |
| pihole1 | 10.10.0.16 | cal | pihole (primary DNS), nginx-proxy-manager, portainer | Yes |
| sba-bots | 10.10.0.88 | cal | paper-dynasty bot, paper-dynasty DB (prod), PD adminer, sba-website, ghost | Yes |
| foundry | 10.10.0.223 | root | foundry-foundry-1 (Foundry VTT) | No |
Per-server SSH user: health_check.py supports per-server ssh_user override in config.yaml (default: root). Used by ubuntu-manticore, strat-database, pihole1, and sba-bots which require cal user.
SSH keys: n8n uses n8n_runner_key → CT 302, CT 302 uses homelab_rsa → target servers
Helper script: /root/.claude/skills/server-diagnostics/list_servers.sh — extracts server keys from config.yaml as JSON array
n8n Workflow Architecture (Master + Sub-workflow)
The monitoring uses a master/sub-workflow pattern in n8n. Adding or removing servers only requires editing config.yaml on CT 302 — no n8n changes needed.
Master: "Server Health Monitor - Claude Code" (p7XmW23SgCs3hEkY, active)
Schedule (every 5 min)
→ SSH to CT 302: list_servers.sh → ["arr-stack", "gitea", "uptime-kuma", "n8n", "ubuntu-manticore", "strat-database"]
→ Code: split JSON array into one item per server_key
→ Execute Sub-workflow (mode: "each") → "Server Health Check"
→ Code: aggregate results (healthy/remediated/escalated counts)
→ If any escalations → Discord summary embed
Sub-workflow: "Server Health Check" (BhzYmWr6NcIDoioy, active)
Execute Workflow Trigger (receives { server_key: "arr-stack" })
→ SSH to CT 302: health_check.py --server {server_key}
→ Code: parse JSON output (status, exit_code, issues, escalations)
→ If exit_code == 2 → SSH: remediate.sh (escalation JSON)
→ Return results to parent (server_key, status, issues, remediation_output)
Exit code behavior:
0(healthy): No action, aggregated in summary1(auto-remediated): Script already handled it + sent Discord via notifier.py — n8n takes no action2(needs escalation): Sub-workflow runsremediate.sh, master sends Discord summary
Pre-escalation notification: remediate.sh sends a Discord warning embed ("Claude API Escalation Triggered") via notifier.py before invoking the Claude CLI, so Cal gets a heads-up that API charges are about to be incurred.
SSH credential: SSH Private Key account (id: QkbHQ8JmYimUoTcM)
Discord webhook: Homelab Alerts channel
Alert Management
Pattern: Structured notifications with actionable information
# Discord webhook integration
curl -X POST "$DISCORD_WEBHOOK" \
-H "Content-Type: application/json" \
-d '{
"content": "**System Alert**\n```\nService: Tdarr\nIssue: Staging timeout\nAction: Automatic cleanup performed\n```\n<@user_id>"
}'
Core Monitoring Components
Tdarr System Monitoring
Purpose: Monitor transcoding pipeline health and performance
Location: scripts/tdarr_monitor.py
Key Features:
- API-based status monitoring with dataclass structures
- Staging section timeout detection and cleanup
- Discord notifications with professional formatting
- Log rotation and retention management
Windows Desktop Monitoring
Purpose: Track Windows system reboots and power events
Location: scripts/windows-desktop/
Components:
- PowerShell monitoring script
- Scheduled task automation
- Discord notification integration
- System event correlation
Uptime Kuma (Centralized Uptime Monitoring)
Purpose: Centralized service uptime, health checks, and status page for all homelab services Location: LXC 227 (10.10.0.227), Docker container URL: https://status.manticorum.com (internal: http://10.10.0.227:3001)
Key Features:
- HTTP/HTTPS, TCP, DNS, Docker, and ping monitoring
- Discord notification integration (default alert channel for all monitors)
- Public status page at https://status.manticorum.com
- Multi-protocol health checks at 60-second intervals with 3 retries
- Certificate expiration monitoring
Infrastructure:
- Proxmox LXC 227, Ubuntu 22.04, 2 cores, 2GB RAM, 8GB disk
- Docker with AppArmor unconfined (required for Docker-in-LXC)
- Data persisted via Docker named volume (
uptime-kuma-data) - Compose config:
server-configs/uptime-kuma/docker-compose/uptime-kuma/ - SSH alias:
uptime-kuma - Admin credentials: username
cal, password in~/.claude/secrets/kuma_web_password
Active Monitors (20):
| Tag | Monitor | Type | Target |
|---|---|---|---|
| Infrastructure | Proxmox VE | HTTP | https://10.10.0.11:8006 |
| Infrastructure | Home Assistant | HTTP | http://10.0.0.28:8123 |
| DNS | Pi-hole Primary DNS | DNS | 10.10.0.16:53 |
| DNS | Pi-hole Secondary DNS | DNS | 10.10.0.226:53 |
| Media | Jellyfin | HTTP | http://10.10.0.226:8096 |
| Media | Tdarr | HTTP | http://10.10.0.226:8265 |
| Media | Sonarr | HTTP | http://10.10.0.221:8989 |
| Media | Radarr | HTTP | http://10.10.0.221:7878 |
| Media | Jellyseerr | HTTP | http://10.10.0.221:5055 |
| DevOps | Gitea | HTTP | http://10.10.0.225:3000 |
| DevOps | n8n | HTTP | http://10.10.0.210:5678 |
| Networking | NPM Local (Admin) | HTTP | http://10.10.0.16:81 |
| Networking | Pi-hole Primary Web | HTTP | http://10.10.0.16:81/admin |
| Networking | Pi-hole Secondary Web | HTTP | http://10.10.0.226:8053/admin |
| Gaming | Foundry VTT | HTTP | http://10.10.0.223:30000 |
| AI | OpenClaw Gateway | HTTP | http://10.10.0.224:18789 |
| Bots | discord-bots VM | Ping | 10.10.0.33 |
| Bots | sba-bots VM | Ping | 10.10.0.88 |
| Database | PostgreSQL (strat-database) | TCP | 10.10.0.42:5432 |
| External | Akamai NPM | HTTP | http://172.237.147.99 |
Notifications:
- Discord webhook: "Discord - Homelab Alerts" (default, applied to all monitors)
- Alerts on service down (after 3 retries at 30s intervals) and on recovery
API Access:
- Python library:
uptime-kuma-api(pip installed) - Connection:
UptimeKumaApi("http://10.10.0.227:3001") - Used for programmatic monitor/notification management
Network and Service Monitoring
Purpose: Monitor critical infrastructure availability Implementation:
# Service health check pattern
SERVICES="https://homelab.local http://nas.homelab.local"
for service in $SERVICES; do
if curl -sSf --max-time 10 "$service" >/dev/null 2>&1; then
echo "✅ $service: Available"
else
echo "❌ $service: Failed" | send_alert
fi
done
Automation Patterns
Cron-Based Scheduling
Pattern: Regular health checks with intelligent alerting
# Monitoring schedule examples
*/20 * * * * /path/to/tdarr-timeout-monitor.sh # Every 20 minutes
0 */6 * * * /path/to/cleanup-temp-dirs.sh # Every 6 hours
0 2 * * * /path/to/backup-monitor.sh # Daily at 2 AM
Event-Driven Monitoring
Pattern: Reactive monitoring for critical events
- System Startup: Windows boot detection
- Service Failures: Container restart alerts
- Resource Exhaustion: Disk space warnings
- Security Events: Failed login attempts
Data Collection and Analysis
Log Management
Pattern: Centralized logging with rotation
# Log rotation configuration
LOG_FILE="/var/log/homelab-monitor.log"
MAX_SIZE="10M"
RETENTION_DAYS=30
# Rotate logs when size exceeded
if [ $(stat -c%s "$LOG_FILE") -gt $((10*1024*1024)) ]; then
mv "$LOG_FILE" "$LOG_FILE.$(date +%Y%m%d)"
touch "$LOG_FILE"
fi
Metrics Collection
Pattern: Time-series data for trend analysis
- System Metrics: CPU, memory, disk usage
- Service Metrics: Response times, error rates
- Application Metrics: Transcoding progress, queue sizes
- Network Metrics: Bandwidth usage, latency
Alert Integration
Discord Notification System
Pattern: Rich, actionable notifications
# Professional alert format
**🔧 System Maintenance**
Service: Tdarr Transcoding
Issue: 3 files timed out in staging
Resolution: Automatic cleanup completed
Status: System operational
Manual review recommended <@user_id>
Alert Escalation
Pattern: Tiered alerting based on severity
- Info: Routine maintenance completed
- Warning: Service degradation detected
- Critical: Service failure requiring immediate attention
- Emergency: System-wide failure requiring manual intervention
Best Practices Implementation
Monitoring Strategy
- Proactive: Monitor trends to predict issues
- Reactive: Alert on current failures
- Preventive: Automated cleanup and maintenance
- Comprehensive: Cover all critical services
- Actionable: Provide clear resolution paths
Performance Optimization
- Efficient Polling: Balance monitoring frequency with resource usage
- Smart Alerting: Avoid alert fatigue with intelligent filtering
- Resource Management: Monitor the monitoring system itself
- Scalable Architecture: Design for growth and additional services
This technology context provides the foundation for implementing comprehensive monitoring and alerting in home lab environments.