claude-home/monitoring/CONTEXT.md
Cal Corum fcecde0de4
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 2s
docs: decommission cognitive memory references from KB
Removed cognitive-memory MCP, timers, and symlink system references.
Replaced with kb-search MCP and /save-doc skill workflow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 23:02:56 -05:00

272 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Monitoring and Alerting Overview"
description: "Architecture overview of the homelab monitoring system including Uptime Kuma, Claude Runner (CT 302) two-tier health checks, Tdarr monitoring, Windows desktop monitoring, and n8n master/sub-workflow orchestration."
type: context
domain: monitoring
tags: [uptime-kuma, claude-runner, n8n, discord, healthcheck, tdarr, windows, infrastructure]
---
# System Monitoring and Alerting - Technology Context
## Overview
Comprehensive monitoring and alerting system for home lab infrastructure with focus on automated health checks, Discord notifications, and proactive system maintenance.
## Architecture Patterns
### Distributed Monitoring Strategy
**Pattern**: Service-specific monitoring with centralized alerting
- **Uptime Kuma**: Centralized service uptime and health monitoring (status page)
- **Claude Runner (CT 302)**: SSH-based server diagnostics with two-tier auto-remediation
- **Tdarr Monitoring**: API-based transcoding health checks
- **Windows Desktop Monitoring**: Reboot detection and system events
- **Network Monitoring**: Connectivity and service availability
- **Container Monitoring**: Docker/Podman health and resource usage
### AI Infrastructure LXCs (301302)
#### Claude Discord Coordinator — CT 301 (10.10.0.147)
**Purpose**: Discord bot coordination with read-only KB search MCP access
**SSH alias**: `claude-discord-coordinator`
#### Claude Runner — CT 302 (10.10.0.148)
**Purpose**: Automated server health monitoring with AI-escalated remediation
**Repo**: `cal/claude-runner-monitoring` on Gitea (cloned to `/root/.claude` on CT 302)
**Docs**: `monitoring/server-diagnostics/CONTEXT.md`
**Two-tier system:**
- **Tier 1** (`health_check.py`): Pure Python, runs every 5 min via n8n. Checks containers, systemd services, disk/memory/load. Auto-restarts containers when allowed. Exit 0=healthy, 1=auto-fixed, 2=needs Claude.
- **Tier 2** (`client.py`): Full diagnostic toolkit used by Claude during escalation sessions.
**Monitored servers** (dynamic from `config.yaml`):
| Server Key | IP | SSH User | Services | Critical |
|---|---|---|---|---|
| arr-stack | 10.10.0.221 | root | sonarr, radarr, readarr, lidarr, jellyseerr, sabnzbd | Yes |
| gitea | 10.10.0.225 | root | gitea (systemd), gitea-runner (docker) | Yes |
| uptime-kuma | 10.10.0.227 | root | uptime-kuma | Yes |
| n8n | 10.10.0.210 | root | n8n (no restart), n8n-postgres, omni-tools, termix | Yes |
| ubuntu-manticore | 10.10.0.226 | cal | jellyfin, tdarr-server, tdarr-node, pihole, watchstate, orbital-sync | Yes |
| strat-database | 10.10.0.42 | cal | sba_postgres, sba_redis, sba_db_api, dev_pd_database, sba_adminer | No (dev) |
| pihole1 | 10.10.0.16 | cal | pihole (primary DNS), nginx-proxy-manager, portainer | Yes |
| sba-bots | 10.10.0.88 | cal | paper-dynasty bot, paper-dynasty DB (prod), PD adminer, sba-website, ghost | Yes |
| foundry | 10.10.0.223 | root | foundry-foundry-1 (Foundry VTT) | No |
**Per-server SSH user:** `health_check.py` supports per-server `ssh_user` override in config.yaml (default: root). Used by ubuntu-manticore, strat-database, pihole1, and sba-bots which require `cal` user.
**SSH keys:** n8n uses `n8n_runner_key` → CT 302, CT 302 uses `homelab_rsa` → target servers
**Helper script:** `/root/.claude/skills/server-diagnostics/list_servers.sh` — extracts server keys from config.yaml as JSON array
#### n8n Workflow Architecture (Master + Sub-workflow)
The monitoring uses a master/sub-workflow pattern in n8n. Adding or removing servers only requires editing `config.yaml` on CT 302 — no n8n changes needed.
**Master: "Server Health Monitor - Claude Code"** (`p7XmW23SgCs3hEkY`, active)
```
Schedule (every 5 min)
→ SSH to CT 302: list_servers.sh → ["arr-stack", "gitea", "uptime-kuma", "n8n", "ubuntu-manticore", "strat-database"]
→ Code: split JSON array into one item per server_key
→ Execute Sub-workflow (mode: "each") → "Server Health Check"
→ Code: aggregate results (healthy/remediated/escalated counts)
→ If any escalations → Discord summary embed
```
**Sub-workflow: "Server Health Check"** (`BhzYmWr6NcIDoioy`, active)
```
Execute Workflow Trigger (receives { server_key: "arr-stack" })
→ SSH to CT 302: health_check.py --server {server_key}
→ Code: parse JSON output (status, exit_code, issues, escalations)
→ If exit_code == 2 → SSH: remediate.sh (escalation JSON)
→ Return results to parent (server_key, status, issues, remediation_output)
```
**Exit code behavior:**
- `0` (healthy): No action, aggregated in summary
- `1` (auto-remediated): Script already handled it + sent Discord via notifier.py — n8n takes no action
- `2` (needs escalation): Sub-workflow runs `remediate.sh`, master sends Discord summary
**Pre-escalation notification:** `remediate.sh` sends a Discord warning embed ("Claude API Escalation Triggered") via `notifier.py` *before* invoking the Claude CLI, so Cal gets a heads-up that API charges are about to be incurred.
**SSH credential:** `SSH Private Key account` (id: `QkbHQ8JmYimUoTcM`)
**Discord webhook:** Homelab Alerts channel
### Alert Management
**Pattern**: Structured notifications with actionable information
```bash
# Discord webhook integration
curl -X POST "$DISCORD_WEBHOOK" \
-H "Content-Type: application/json" \
-d '{
"content": "**System Alert**\n```\nService: Tdarr\nIssue: Staging timeout\nAction: Automatic cleanup performed\n```\n<@user_id>"
}'
```
## Core Monitoring Components
### Tdarr System Monitoring
**Purpose**: Monitor transcoding pipeline health and performance
**Location**: `scripts/tdarr_monitor.py`
**Key Features**:
- API-based status monitoring with dataclass structures
- Staging section timeout detection and cleanup
- Discord notifications with professional formatting
- Log rotation and retention management
### Windows Desktop Monitoring
**Purpose**: Track Windows system reboots and power events
**Location**: `scripts/windows-desktop/`
**Components**:
- PowerShell monitoring script
- Scheduled task automation
- Discord notification integration
- System event correlation
### Uptime Kuma (Centralized Uptime Monitoring)
**Purpose**: Centralized service uptime, health checks, and status page for all homelab services
**Location**: LXC 227 (10.10.0.227), Docker container
**URL**: https://status.manticorum.com (internal: http://10.10.0.227:3001)
**Key Features**:
- HTTP/HTTPS, TCP, DNS, Docker, and ping monitoring
- Discord notification integration (default alert channel for all monitors)
- Public status page at https://status.manticorum.com
- Multi-protocol health checks at 60-second intervals with 3 retries
- Certificate expiration monitoring
**Infrastructure**:
- Proxmox LXC 227, Ubuntu 22.04, 2 cores, 2GB RAM, 8GB disk
- Docker with AppArmor unconfined (required for Docker-in-LXC)
- Data persisted via Docker named volume (`uptime-kuma-data`)
- Compose config: `server-configs/uptime-kuma/docker-compose/uptime-kuma/`
- SSH alias: `uptime-kuma`
- Admin credentials: username `cal`, password in `~/.claude/secrets/kuma_web_password`
**Active Monitors (20)**:
| Tag | Monitor | Type | Target |
|-----|---------|------|--------|
| Infrastructure | Proxmox VE | HTTP | https://10.10.0.11:8006 |
| Infrastructure | Home Assistant | HTTP | http://10.0.0.28:8123 |
| DNS | Pi-hole Primary DNS | DNS | 10.10.0.16:53 |
| DNS | Pi-hole Secondary DNS | DNS | 10.10.0.226:53 |
| Media | Jellyfin | HTTP | http://10.10.0.226:8096 |
| Media | Tdarr | HTTP | http://10.10.0.226:8265 |
| Media | Sonarr | HTTP | http://10.10.0.221:8989 |
| Media | Radarr | HTTP | http://10.10.0.221:7878 |
| Media | Jellyseerr | HTTP | http://10.10.0.221:5055 |
| DevOps | Gitea | HTTP | http://10.10.0.225:3000 |
| DevOps | n8n | HTTP | http://10.10.0.210:5678 |
| Networking | NPM Local (Admin) | HTTP | http://10.10.0.16:81 |
| Networking | Pi-hole Primary Web | HTTP | http://10.10.0.16:81/admin |
| Networking | Pi-hole Secondary Web | HTTP | http://10.10.0.226:8053/admin |
| Gaming | Foundry VTT | HTTP | http://10.10.0.223:30000 |
| AI | OpenClaw Gateway | HTTP | http://10.10.0.224:18789 |
| Bots | discord-bots VM | Ping | 10.10.0.33 |
| Bots | sba-bots VM | Ping | 10.10.0.88 |
| Database | PostgreSQL (strat-database) | TCP | 10.10.0.42:5432 |
| External | Akamai NPM | HTTP | http://172.237.147.99 |
**Notifications**:
- Discord webhook: "Discord - Homelab Alerts" (default, applied to all monitors)
- Alerts on service down (after 3 retries at 30s intervals) and on recovery
**API Access**:
- Python library: `uptime-kuma-api` (pip installed)
- Connection: `UptimeKumaApi("http://10.10.0.227:3001")`
- Used for programmatic monitor/notification management
### Network and Service Monitoring
**Purpose**: Monitor critical infrastructure availability
**Implementation**:
```bash
# Service health check pattern
SERVICES="https://homelab.local http://nas.homelab.local"
for service in $SERVICES; do
if curl -sSf --max-time 10 "$service" >/dev/null 2>&1; then
echo "✅ $service: Available"
else
echo "❌ $service: Failed" | send_alert
fi
done
```
## Automation Patterns
### Cron-Based Scheduling
**Pattern**: Regular health checks with intelligent alerting
```bash
# Monitoring schedule examples
*/20 * * * * /path/to/tdarr-timeout-monitor.sh # Every 20 minutes
0 */6 * * * /path/to/cleanup-temp-dirs.sh # Every 6 hours
0 2 * * * /path/to/backup-monitor.sh # Daily at 2 AM
```
### Event-Driven Monitoring
**Pattern**: Reactive monitoring for critical events
- **System Startup**: Windows boot detection
- **Service Failures**: Container restart alerts
- **Resource Exhaustion**: Disk space warnings
- **Security Events**: Failed login attempts
## Data Collection and Analysis
### Log Management
**Pattern**: Centralized logging with rotation
```bash
# Log rotation configuration
LOG_FILE="/var/log/homelab-monitor.log"
MAX_SIZE="10M"
RETENTION_DAYS=30
# Rotate logs when size exceeded
if [ $(stat -c%s "$LOG_FILE") -gt $((10*1024*1024)) ]; then
mv "$LOG_FILE" "$LOG_FILE.$(date +%Y%m%d)"
touch "$LOG_FILE"
fi
```
### Metrics Collection
**Pattern**: Time-series data for trend analysis
- **System Metrics**: CPU, memory, disk usage
- **Service Metrics**: Response times, error rates
- **Application Metrics**: Transcoding progress, queue sizes
- **Network Metrics**: Bandwidth usage, latency
## Alert Integration
### Discord Notification System
**Pattern**: Rich, actionable notifications
```markdown
# Professional alert format
**🔧 System Maintenance**
Service: Tdarr Transcoding
Issue: 3 files timed out in staging
Resolution: Automatic cleanup completed
Status: System operational
Manual review recommended <@user_id>
```
### Alert Escalation
**Pattern**: Tiered alerting based on severity
1. **Info**: Routine maintenance completed
2. **Warning**: Service degradation detected
3. **Critical**: Service failure requiring immediate attention
4. **Emergency**: System-wide failure requiring manual intervention
## Best Practices Implementation
### Monitoring Strategy
1. **Proactive**: Monitor trends to predict issues
2. **Reactive**: Alert on current failures
3. **Preventive**: Automated cleanup and maintenance
4. **Comprehensive**: Cover all critical services
5. **Actionable**: Provide clear resolution paths
### Performance Optimization
1. **Efficient Polling**: Balance monitoring frequency with resource usage
2. **Smart Alerting**: Avoid alert fatigue with intelligent filtering
3. **Resource Management**: Monitor the monitoring system itself
4. **Scalable Architecture**: Design for growth and additional services
This technology context provides the foundation for implementing comprehensive monitoring and alerting in home lab environments.