All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 2s
Removed cognitive-memory MCP, timers, and symlink system references. Replaced with kb-search MCP and /save-doc skill workflow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
272 lines
12 KiB
Markdown
272 lines
12 KiB
Markdown
---
|
||
title: "Monitoring and Alerting Overview"
|
||
description: "Architecture overview of the homelab monitoring system including Uptime Kuma, Claude Runner (CT 302) two-tier health checks, Tdarr monitoring, Windows desktop monitoring, and n8n master/sub-workflow orchestration."
|
||
type: context
|
||
domain: monitoring
|
||
tags: [uptime-kuma, claude-runner, n8n, discord, healthcheck, tdarr, windows, infrastructure]
|
||
---
|
||
|
||
# System Monitoring and Alerting - Technology Context
|
||
|
||
## Overview
|
||
Comprehensive monitoring and alerting system for home lab infrastructure with focus on automated health checks, Discord notifications, and proactive system maintenance.
|
||
|
||
## Architecture Patterns
|
||
|
||
### Distributed Monitoring Strategy
|
||
**Pattern**: Service-specific monitoring with centralized alerting
|
||
- **Uptime Kuma**: Centralized service uptime and health monitoring (status page)
|
||
- **Claude Runner (CT 302)**: SSH-based server diagnostics with two-tier auto-remediation
|
||
- **Tdarr Monitoring**: API-based transcoding health checks
|
||
- **Windows Desktop Monitoring**: Reboot detection and system events
|
||
- **Network Monitoring**: Connectivity and service availability
|
||
- **Container Monitoring**: Docker/Podman health and resource usage
|
||
|
||
### AI Infrastructure LXCs (301–302)
|
||
|
||
#### Claude Discord Coordinator — CT 301 (10.10.0.147)
|
||
**Purpose**: Discord bot coordination with read-only KB search MCP access
|
||
**SSH alias**: `claude-discord-coordinator`
|
||
|
||
#### Claude Runner — CT 302 (10.10.0.148)
|
||
**Purpose**: Automated server health monitoring with AI-escalated remediation
|
||
**Repo**: `cal/claude-runner-monitoring` on Gitea (cloned to `/root/.claude` on CT 302)
|
||
**Docs**: `monitoring/server-diagnostics/CONTEXT.md`
|
||
|
||
**Two-tier system:**
|
||
- **Tier 1** (`health_check.py`): Pure Python, runs every 5 min via n8n. Checks containers, systemd services, disk/memory/load. Auto-restarts containers when allowed. Exit 0=healthy, 1=auto-fixed, 2=needs Claude.
|
||
- **Tier 2** (`client.py`): Full diagnostic toolkit used by Claude during escalation sessions.
|
||
|
||
**Monitored servers** (dynamic from `config.yaml`):
|
||
|
||
| Server Key | IP | SSH User | Services | Critical |
|
||
|---|---|---|---|---|
|
||
| arr-stack | 10.10.0.221 | root | sonarr, radarr, readarr, lidarr, jellyseerr, sabnzbd | Yes |
|
||
| gitea | 10.10.0.225 | root | gitea (systemd), gitea-runner (docker) | Yes |
|
||
| uptime-kuma | 10.10.0.227 | root | uptime-kuma | Yes |
|
||
| n8n | 10.10.0.210 | root | n8n (no restart), n8n-postgres, omni-tools, termix | Yes |
|
||
| ubuntu-manticore | 10.10.0.226 | cal | jellyfin, tdarr-server, tdarr-node, pihole, watchstate, orbital-sync | Yes |
|
||
| strat-database | 10.10.0.42 | cal | sba_postgres, sba_redis, sba_db_api, dev_pd_database, sba_adminer | No (dev) |
|
||
| pihole1 | 10.10.0.16 | cal | pihole (primary DNS), nginx-proxy-manager, portainer | Yes |
|
||
| sba-bots | 10.10.0.88 | cal | paper-dynasty bot, paper-dynasty DB (prod), PD adminer, sba-website, ghost | Yes |
|
||
| foundry | 10.10.0.223 | root | foundry-foundry-1 (Foundry VTT) | No |
|
||
|
||
**Per-server SSH user:** `health_check.py` supports per-server `ssh_user` override in config.yaml (default: root). Used by ubuntu-manticore, strat-database, pihole1, and sba-bots which require `cal` user.
|
||
**SSH keys:** n8n uses `n8n_runner_key` → CT 302, CT 302 uses `homelab_rsa` → target servers
|
||
**Helper script:** `/root/.claude/skills/server-diagnostics/list_servers.sh` — extracts server keys from config.yaml as JSON array
|
||
|
||
#### n8n Workflow Architecture (Master + Sub-workflow)
|
||
|
||
The monitoring uses a master/sub-workflow pattern in n8n. Adding or removing servers only requires editing `config.yaml` on CT 302 — no n8n changes needed.
|
||
|
||
**Master: "Server Health Monitor - Claude Code"** (`p7XmW23SgCs3hEkY`, active)
|
||
```
|
||
Schedule (every 5 min)
|
||
→ SSH to CT 302: list_servers.sh → ["arr-stack", "gitea", "uptime-kuma", "n8n", "ubuntu-manticore", "strat-database"]
|
||
→ Code: split JSON array into one item per server_key
|
||
→ Execute Sub-workflow (mode: "each") → "Server Health Check"
|
||
→ Code: aggregate results (healthy/remediated/escalated counts)
|
||
→ If any escalations → Discord summary embed
|
||
```
|
||
|
||
**Sub-workflow: "Server Health Check"** (`BhzYmWr6NcIDoioy`, active)
|
||
```
|
||
Execute Workflow Trigger (receives { server_key: "arr-stack" })
|
||
→ SSH to CT 302: health_check.py --server {server_key}
|
||
→ Code: parse JSON output (status, exit_code, issues, escalations)
|
||
→ If exit_code == 2 → SSH: remediate.sh (escalation JSON)
|
||
→ Return results to parent (server_key, status, issues, remediation_output)
|
||
```
|
||
|
||
**Exit code behavior:**
|
||
- `0` (healthy): No action, aggregated in summary
|
||
- `1` (auto-remediated): Script already handled it + sent Discord via notifier.py — n8n takes no action
|
||
- `2` (needs escalation): Sub-workflow runs `remediate.sh`, master sends Discord summary
|
||
|
||
**Pre-escalation notification:** `remediate.sh` sends a Discord warning embed ("Claude API Escalation Triggered") via `notifier.py` *before* invoking the Claude CLI, so Cal gets a heads-up that API charges are about to be incurred.
|
||
|
||
**SSH credential:** `SSH Private Key account` (id: `QkbHQ8JmYimUoTcM`)
|
||
**Discord webhook:** Homelab Alerts channel
|
||
|
||
### Alert Management
|
||
**Pattern**: Structured notifications with actionable information
|
||
```bash
|
||
# Discord webhook integration
|
||
curl -X POST "$DISCORD_WEBHOOK" \
|
||
-H "Content-Type: application/json" \
|
||
-d '{
|
||
"content": "**System Alert**\n```\nService: Tdarr\nIssue: Staging timeout\nAction: Automatic cleanup performed\n```\n<@user_id>"
|
||
}'
|
||
```
|
||
|
||
## Core Monitoring Components
|
||
|
||
### Tdarr System Monitoring
|
||
**Purpose**: Monitor transcoding pipeline health and performance
|
||
**Location**: `scripts/tdarr_monitor.py`
|
||
|
||
**Key Features**:
|
||
- API-based status monitoring with dataclass structures
|
||
- Staging section timeout detection and cleanup
|
||
- Discord notifications with professional formatting
|
||
- Log rotation and retention management
|
||
|
||
### Windows Desktop Monitoring
|
||
**Purpose**: Track Windows system reboots and power events
|
||
**Location**: `scripts/windows-desktop/`
|
||
|
||
**Components**:
|
||
- PowerShell monitoring script
|
||
- Scheduled task automation
|
||
- Discord notification integration
|
||
- System event correlation
|
||
|
||
### Uptime Kuma (Centralized Uptime Monitoring)
|
||
**Purpose**: Centralized service uptime, health checks, and status page for all homelab services
|
||
**Location**: LXC 227 (10.10.0.227), Docker container
|
||
**URL**: https://status.manticorum.com (internal: http://10.10.0.227:3001)
|
||
|
||
**Key Features**:
|
||
- HTTP/HTTPS, TCP, DNS, Docker, and ping monitoring
|
||
- Discord notification integration (default alert channel for all monitors)
|
||
- Public status page at https://status.manticorum.com
|
||
- Multi-protocol health checks at 60-second intervals with 3 retries
|
||
- Certificate expiration monitoring
|
||
|
||
**Infrastructure**:
|
||
- Proxmox LXC 227, Ubuntu 22.04, 2 cores, 2GB RAM, 8GB disk
|
||
- Docker with AppArmor unconfined (required for Docker-in-LXC)
|
||
- Data persisted via Docker named volume (`uptime-kuma-data`)
|
||
- Compose config: `server-configs/uptime-kuma/docker-compose/uptime-kuma/`
|
||
- SSH alias: `uptime-kuma`
|
||
- Admin credentials: username `cal`, password in `~/.claude/secrets/kuma_web_password`
|
||
|
||
**Active Monitors (20)**:
|
||
|
||
| Tag | Monitor | Type | Target |
|
||
|-----|---------|------|--------|
|
||
| Infrastructure | Proxmox VE | HTTP | https://10.10.0.11:8006 |
|
||
| Infrastructure | Home Assistant | HTTP | http://10.0.0.28:8123 |
|
||
| DNS | Pi-hole Primary DNS | DNS | 10.10.0.16:53 |
|
||
| DNS | Pi-hole Secondary DNS | DNS | 10.10.0.226:53 |
|
||
| Media | Jellyfin | HTTP | http://10.10.0.226:8096 |
|
||
| Media | Tdarr | HTTP | http://10.10.0.226:8265 |
|
||
| Media | Sonarr | HTTP | http://10.10.0.221:8989 |
|
||
| Media | Radarr | HTTP | http://10.10.0.221:7878 |
|
||
| Media | Jellyseerr | HTTP | http://10.10.0.221:5055 |
|
||
| DevOps | Gitea | HTTP | http://10.10.0.225:3000 |
|
||
| DevOps | n8n | HTTP | http://10.10.0.210:5678 |
|
||
| Networking | NPM Local (Admin) | HTTP | http://10.10.0.16:81 |
|
||
| Networking | Pi-hole Primary Web | HTTP | http://10.10.0.16:81/admin |
|
||
| Networking | Pi-hole Secondary Web | HTTP | http://10.10.0.226:8053/admin |
|
||
| Gaming | Foundry VTT | HTTP | http://10.10.0.223:30000 |
|
||
| AI | OpenClaw Gateway | HTTP | http://10.10.0.224:18789 |
|
||
| Bots | discord-bots VM | Ping | 10.10.0.33 |
|
||
| Bots | sba-bots VM | Ping | 10.10.0.88 |
|
||
| Database | PostgreSQL (strat-database) | TCP | 10.10.0.42:5432 |
|
||
| External | Akamai NPM | HTTP | http://172.237.147.99 |
|
||
|
||
**Notifications**:
|
||
- Discord webhook: "Discord - Homelab Alerts" (default, applied to all monitors)
|
||
- Alerts on service down (after 3 retries at 30s intervals) and on recovery
|
||
|
||
**API Access**:
|
||
- Python library: `uptime-kuma-api` (pip installed)
|
||
- Connection: `UptimeKumaApi("http://10.10.0.227:3001")`
|
||
- Used for programmatic monitor/notification management
|
||
|
||
### Network and Service Monitoring
|
||
**Purpose**: Monitor critical infrastructure availability
|
||
**Implementation**:
|
||
```bash
|
||
# Service health check pattern
|
||
SERVICES="https://homelab.local http://nas.homelab.local"
|
||
for service in $SERVICES; do
|
||
if curl -sSf --max-time 10 "$service" >/dev/null 2>&1; then
|
||
echo "✅ $service: Available"
|
||
else
|
||
echo "❌ $service: Failed" | send_alert
|
||
fi
|
||
done
|
||
```
|
||
|
||
## Automation Patterns
|
||
|
||
### Cron-Based Scheduling
|
||
**Pattern**: Regular health checks with intelligent alerting
|
||
```bash
|
||
# Monitoring schedule examples
|
||
*/20 * * * * /path/to/tdarr-timeout-monitor.sh # Every 20 minutes
|
||
0 */6 * * * /path/to/cleanup-temp-dirs.sh # Every 6 hours
|
||
0 2 * * * /path/to/backup-monitor.sh # Daily at 2 AM
|
||
```
|
||
|
||
### Event-Driven Monitoring
|
||
**Pattern**: Reactive monitoring for critical events
|
||
- **System Startup**: Windows boot detection
|
||
- **Service Failures**: Container restart alerts
|
||
- **Resource Exhaustion**: Disk space warnings
|
||
- **Security Events**: Failed login attempts
|
||
|
||
## Data Collection and Analysis
|
||
|
||
### Log Management
|
||
**Pattern**: Centralized logging with rotation
|
||
```bash
|
||
# Log rotation configuration
|
||
LOG_FILE="/var/log/homelab-monitor.log"
|
||
MAX_SIZE="10M"
|
||
RETENTION_DAYS=30
|
||
|
||
# Rotate logs when size exceeded
|
||
if [ $(stat -c%s "$LOG_FILE") -gt $((10*1024*1024)) ]; then
|
||
mv "$LOG_FILE" "$LOG_FILE.$(date +%Y%m%d)"
|
||
touch "$LOG_FILE"
|
||
fi
|
||
```
|
||
|
||
### Metrics Collection
|
||
**Pattern**: Time-series data for trend analysis
|
||
- **System Metrics**: CPU, memory, disk usage
|
||
- **Service Metrics**: Response times, error rates
|
||
- **Application Metrics**: Transcoding progress, queue sizes
|
||
- **Network Metrics**: Bandwidth usage, latency
|
||
|
||
## Alert Integration
|
||
|
||
### Discord Notification System
|
||
**Pattern**: Rich, actionable notifications
|
||
```markdown
|
||
# Professional alert format
|
||
**🔧 System Maintenance**
|
||
Service: Tdarr Transcoding
|
||
Issue: 3 files timed out in staging
|
||
Resolution: Automatic cleanup completed
|
||
Status: System operational
|
||
|
||
Manual review recommended <@user_id>
|
||
```
|
||
|
||
### Alert Escalation
|
||
**Pattern**: Tiered alerting based on severity
|
||
1. **Info**: Routine maintenance completed
|
||
2. **Warning**: Service degradation detected
|
||
3. **Critical**: Service failure requiring immediate attention
|
||
4. **Emergency**: System-wide failure requiring manual intervention
|
||
|
||
## Best Practices Implementation
|
||
|
||
### Monitoring Strategy
|
||
1. **Proactive**: Monitor trends to predict issues
|
||
2. **Reactive**: Alert on current failures
|
||
3. **Preventive**: Automated cleanup and maintenance
|
||
4. **Comprehensive**: Cover all critical services
|
||
5. **Actionable**: Provide clear resolution paths
|
||
|
||
### Performance Optimization
|
||
1. **Efficient Polling**: Balance monitoring frequency with resource usage
|
||
2. **Smart Alerting**: Avoid alert fatigue with intelligent filtering
|
||
3. **Resource Management**: Monitor the monitoring system itself
|
||
4. **Scalable Architecture**: Design for growth and additional services
|
||
|
||
This technology context provides the foundation for implementing comprehensive monitoring and alerting in home lab environments. |