diff --git a/monitoring/server-diagnostics/CONTEXT.md b/monitoring/server-diagnostics/CONTEXT.md new file mode 100644 index 0000000..37507fb --- /dev/null +++ b/monitoring/server-diagnostics/CONTEXT.md @@ -0,0 +1,91 @@ +# Server Diagnostics — Deployment & Architecture + +## Overview + +Automated server health monitoring running on CT 302 (claude-runner, 10.10.0.148). +Two-tier system: Python health checks handle 99% of issues autonomously; Claude +is only invoked for complex failures that scripts can't resolve. + +## Architecture + +``` +┌──────────────────────┐ ┌──────────────────────────────────┐ +│ N8N (LXC 210) │ │ CT 302 — claude-runner │ +│ 10.10.0.210 │ │ 10.10.0.148 │ +│ │ │ │ +│ ┌─────────────────┐ │ SSH │ ┌──────────────────────────┐ │ +│ │ Cron: */15 min │─┼─────┼─→│ health_check.py │ │ +│ │ │ │ │ │ (exit 0/1/2) │ │ +│ │ Branch on exit: │ │ │ └──────────────────────────┘ │ +│ │ 0 → stop │ │ │ │ +│ │ 1 → stop │ │ │ ┌──────────────────────────┐ │ +│ │ 2 → invoke │─┼─────┼─→│ claude --print │ │ +│ │ Claude │ │ │ │ + client.py │ │ +│ └─────────────────┘ │ │ └──────────────────────────┘ │ +│ │ │ │ +│ ┌─────────────────┐ │ │ SSH keys: │ +│ │ Uptime Kuma │ │ │ - homelab_rsa (→ target servers)│ +│ │ webhook trigger │ │ │ - n8n_runner_key (← N8N) │ +│ └─────────────────┘ │ └──────────────────────────────────┘ +└──────────────────────┘ + │ SSH to target servers + ▼ +┌────────────────┐ ┌────────────────┐ ┌────────────────┐ +│ arr-stack │ │ gitea │ │ uptime-kuma │ +│ 10.10.0.221 │ │ 10.10.0.225 │ │ 10.10.0.227 │ +│ Docker: sonarr │ │ systemd: gitea │ │ Docker: kuma │ +│ radarr, etc. │ │ Docker: runner │ │ │ +└────────────────┘ └────────────────┘ └────────────────┘ +``` + +## Cost Model + +- **Exit 0** (healthy): $0 — pure Python, no API call +- **Exit 1** (auto-remediated): $0 — Python restarts container + Discord webhook +- **Exit 2** (escalation): ~$0.10-0.15 — Claude Sonnet invoked via `claude --print` + +At 96 checks/day (every 15 min), typical cost is near $0 unless something +actually breaks and can't be auto-fixed. + +## Repository + +**Gitea:** `cal/claude-runner-monitoring` +**Deployed to:** `/root/.claude` on CT 302 +**SSH alias:** `claude-runner` (root@10.10.0.148, defined in `~/.ssh/config`) +**Update method:** `ssh claude-runner "cd /root/.claude && git pull"` + +### Git Auth on CT 302 + +CT 302 pushes to Gitea via HTTPS with a token auth header (embedded-credential URLs are rejected by Gitea). The token is stored locally in `~/.claude/secrets/claude_runner_monitoring_gitea_token` and configured on CT 302 via: + +``` +git config http.https://git.manticorum.com/.extraHeader 'Authorization: token ' +``` + +CT 302 does **not** have an SSH key registered with Gitea, so SSH git remotes won't work. + +## Files + +| File | Purpose | +|------|---------| +| `CLAUDE.md` | Runner-specific instructions for Claude | +| `settings.json` | Locked-down permissions (read-only + restart only) | +| `skills/server-diagnostics/health_check.py` | Tier 1: automated health checks | +| `skills/server-diagnostics/client.py` | Tier 2: Claude's diagnostic toolkit | +| `skills/server-diagnostics/notifier.py` | Discord webhook notifications | +| `skills/server-diagnostics/config.yaml` | Server inventory + security rules | +| `skills/server-diagnostics/SKILL.md` | Skill reference | +| `skills/server-diagnostics/CLAUDE.md` | Remediation methodology | + +## Adding a New Server + +1. Add entry to `config.yaml` under `servers:` with hostname, containers, etc. +2. Ensure CT 302 can SSH: `ssh -i /root/.ssh/homelab_rsa root@ hostname` +3. Commit to Gitea, pull on CT 302 +4. Add Uptime Kuma monitors if desired + +## Related + +- [monitoring/CONTEXT.md](../CONTEXT.md) — Overall monitoring architecture +- [productivity/n8n/CONTEXT.md](../../productivity/n8n/CONTEXT.md) — N8N deployment +- Uptime Kuma status page: https://status.manticorum.com