docs: update monitoring CONTEXT.md with expanded server inventory
Add server table with all 6 monitored hosts, per-server SSH user docs, updated workflow server list, and pre-escalation Discord notification documentation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
ed16fee9f7
commit
f20e221090
@ -8,11 +8,69 @@ Comprehensive monitoring and alerting system for home lab infrastructure with fo
|
||||
### Distributed Monitoring Strategy
|
||||
**Pattern**: Service-specific monitoring with centralized alerting
|
||||
- **Uptime Kuma**: Centralized service uptime and health monitoring (status page)
|
||||
- **Claude Runner (CT 302)**: SSH-based server diagnostics with two-tier auto-remediation
|
||||
- **Tdarr Monitoring**: API-based transcoding health checks
|
||||
- **Windows Desktop Monitoring**: Reboot detection and system events
|
||||
- **Network Monitoring**: Connectivity and service availability
|
||||
- **Container Monitoring**: Docker/Podman health and resource usage
|
||||
|
||||
### Claude Runner — CT 302 (10.10.0.148)
|
||||
**Purpose**: Automated server health monitoring with AI-escalated remediation
|
||||
**Repo**: `cal/claude-runner-monitoring` on Gitea (cloned to `/root/.claude` on CT 302)
|
||||
**Docs**: `monitoring/server-diagnostics/CONTEXT.md`
|
||||
|
||||
**Two-tier system:**
|
||||
- **Tier 1** (`health_check.py`): Pure Python, runs every 5 min via n8n. Checks containers, systemd services, disk/memory/load. Auto-restarts containers when allowed. Exit 0=healthy, 1=auto-fixed, 2=needs Claude.
|
||||
- **Tier 2** (`client.py`): Full diagnostic toolkit used by Claude during escalation sessions.
|
||||
|
||||
**Monitored servers** (dynamic from `config.yaml`):
|
||||
|
||||
| Server Key | IP | SSH User | Services | Critical |
|
||||
|---|---|---|---|---|
|
||||
| arr-stack | 10.10.0.221 | root | sonarr, radarr, readarr, lidarr, jellyseerr, sabnzbd | Yes |
|
||||
| gitea | 10.10.0.225 | root | gitea (systemd), gitea-runner (docker) | Yes |
|
||||
| uptime-kuma | 10.10.0.227 | root | uptime-kuma | Yes |
|
||||
| n8n | 10.10.0.210 | root | n8n (no restart), n8n-postgres, omni-tools, termix | Yes |
|
||||
| ubuntu-manticore | 10.10.0.226 | cal | jellyfin, tdarr-server, tdarr-node, pihole, watchstate, orbital-sync | Yes |
|
||||
| strat-database | 10.10.0.42 | cal | sba_postgres, sba_redis, sba_db_api, dev_pd_database, sba_adminer | No (dev) |
|
||||
|
||||
**Per-server SSH user:** `health_check.py` supports per-server `ssh_user` override in config.yaml (default: root). Used by ubuntu-manticore and strat-database which require `cal` user.
|
||||
**SSH keys:** n8n uses `n8n_runner_key` → CT 302, CT 302 uses `homelab_rsa` → target servers
|
||||
**Helper script:** `/root/.claude/skills/server-diagnostics/list_servers.sh` — extracts server keys from config.yaml as JSON array
|
||||
|
||||
#### n8n Workflow Architecture (Master + Sub-workflow)
|
||||
|
||||
The monitoring uses a master/sub-workflow pattern in n8n. Adding or removing servers only requires editing `config.yaml` on CT 302 — no n8n changes needed.
|
||||
|
||||
**Master: "Server Health Monitor - Claude Code"** (`p7XmW23SgCs3hEkY`, active)
|
||||
```
|
||||
Schedule (every 5 min)
|
||||
→ SSH to CT 302: list_servers.sh → ["arr-stack", "gitea", "uptime-kuma", "n8n", "ubuntu-manticore", "strat-database"]
|
||||
→ Code: split JSON array into one item per server_key
|
||||
→ Execute Sub-workflow (mode: "each") → "Server Health Check"
|
||||
→ Code: aggregate results (healthy/remediated/escalated counts)
|
||||
→ If any escalations → Discord summary embed
|
||||
```
|
||||
|
||||
**Sub-workflow: "Server Health Check"** (`BhzYmWr6NcIDoioy`, active)
|
||||
```
|
||||
Execute Workflow Trigger (receives { server_key: "arr-stack" })
|
||||
→ SSH to CT 302: health_check.py --server {server_key}
|
||||
→ Code: parse JSON output (status, exit_code, issues, escalations)
|
||||
→ If exit_code == 2 → SSH: remediate.sh (escalation JSON)
|
||||
→ Return results to parent (server_key, status, issues, remediation_output)
|
||||
```
|
||||
|
||||
**Exit code behavior:**
|
||||
- `0` (healthy): No action, aggregated in summary
|
||||
- `1` (auto-remediated): Script already handled it + sent Discord via notifier.py — n8n takes no action
|
||||
- `2` (needs escalation): Sub-workflow runs `remediate.sh`, master sends Discord summary
|
||||
|
||||
**Pre-escalation notification:** `remediate.sh` sends a Discord warning embed ("Claude API Escalation Triggered") via `notifier.py` *before* invoking the Claude CLI, so Cal gets a heads-up that API charges are about to be incurred.
|
||||
|
||||
**SSH credential:** `SSH Private Key account` (id: `QkbHQ8JmYimUoTcM`)
|
||||
**Discord webhook:** Homelab Alerts channel
|
||||
|
||||
### Alert Management
|
||||
**Pattern**: Structured notifications with actionable information
|
||||
```bash
|
||||
|
||||
Loading…
Reference in New Issue
Block a user