Closes #22 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
6.9 KiB
| title | description | type | domain | tags | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Server Diagnostics Architecture | Deployment and architecture docs for the CT 302 claude-runner two-tier health monitoring system. Covers n8n integration, cost model, repository layout, SSH auth, and adding new servers to monitoring. | reference | monitoring |
|
Server Diagnostics — Deployment & Architecture
Overview
Automated server health monitoring running on CT 302 (claude-runner, 10.10.0.148). Two-tier system: Python health checks handle 99% of issues autonomously; Claude is only invoked for complex failures that scripts can't resolve.
Architecture
┌──────────────────────┐ ┌──────────────────────────────────┐
│ N8N (LXC 210) │ │ CT 302 — claude-runner │
│ 10.10.0.210 │ │ 10.10.0.148 │
│ │ │ │
│ ┌─────────────────┐ │ SSH │ ┌──────────────────────────┐ │
│ │ Cron: */15 min │─┼─────┼─→│ health_check.py │ │
│ │ │ │ │ │ (exit 0/1/2) │ │
│ │ Branch on exit: │ │ │ └──────────────────────────┘ │
│ │ 0 → stop │ │ │ │
│ │ 1 → stop │ │ │ ┌──────────────────────────┐ │
│ │ 2 → invoke │─┼─────┼─→│ claude --print │ │
│ │ Claude │ │ │ │ + client.py │ │
│ └─────────────────┘ │ │ └──────────────────────────┘ │
│ │ │ │
│ ┌─────────────────┐ │ │ SSH keys: │
│ │ Uptime Kuma │ │ │ - homelab_rsa (→ target servers)│
│ │ webhook trigger │ │ │ - n8n_runner_key (← N8N) │
│ └─────────────────┘ │ └──────────────────────────────────┘
└──────────────────────┘
│ SSH to target servers
▼
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ arr-stack │ │ gitea │ │ uptime-kuma │
│ 10.10.0.221 │ │ 10.10.0.225 │ │ 10.10.0.227 │
│ Docker: sonarr │ │ systemd: gitea │ │ Docker: kuma │
│ radarr, etc. │ │ Docker: runner │ │ │
└────────────────┘ └────────────────┘ └────────────────┘
Cost Model
- Exit 0 (healthy): $0 — pure Python, no API call
- Exit 1 (auto-remediated): $0 — Python restarts container + Discord webhook
- Exit 2 (escalation): ~$0.10-0.15 — Claude Sonnet invoked via
claude --print
At 96 checks/day (every 15 min), typical cost is near $0 unless something actually breaks and can't be auto-fixed.
Repository
Gitea: cal/claude-runner-monitoring
Deployed to: /root/.claude on CT 302
SSH alias: claude-runner (root@10.10.0.148, defined in ~/.ssh/config)
Update method: ssh claude-runner "cd /root/.claude && git pull"
Git Auth on CT 302
CT 302 pushes to Gitea via HTTPS with a token auth header (embedded-credential URLs are rejected by Gitea). The token is stored locally in ~/.claude/secrets/claude_runner_monitoring_gitea_token and configured on CT 302 via:
git config http.https://git.manticorum.com/.extraHeader 'Authorization: token <token>'
CT 302 does not have an SSH key registered with Gitea, so SSH git remotes won't work.
Files
| File | Purpose |
|---|---|
CLAUDE.md |
Runner-specific instructions for Claude |
settings.json |
Locked-down permissions (read-only + restart only) |
skills/server-diagnostics/health_check.py |
Tier 1: automated health checks |
skills/server-diagnostics/client.py |
Tier 2: Claude's diagnostic toolkit |
skills/server-diagnostics/notifier.py |
Discord webhook notifications |
skills/server-diagnostics/config.yaml |
Server inventory + security rules |
skills/server-diagnostics/SKILL.md |
Skill reference |
skills/server-diagnostics/CLAUDE.md |
Remediation methodology |
Adding a New Server
- Add entry to
config.yamlunderservers:with hostname, containers, etc. - Ensure CT 302 can SSH:
ssh -i /root/.ssh/homelab_rsa root@<ip> hostname - Commit to Gitea, pull on CT 302
- Add Uptime Kuma monitors if desired
Health Check Thresholds
Thresholds are evaluated in health_check.py. All load thresholds use per-core metrics
to avoid false positives from LXC containers (which see the Proxmox host's aggregate load).
Load Average
| Metric | Value | Rationale |
|---|---|---|
LOAD_WARN_PER_CORE |
0.7 |
Elevated — investigate if sustained |
LOAD_CRIT_PER_CORE |
1.0 |
Saturated — CPU is a bottleneck |
| Sample window | 5-minute | Filters transient spikes (not 1-minute) |
Formula: load_per_core = load_5m / nproc
Why per-core? Proxmox LXC containers see the host's aggregate load average via the
shared kernel. A 32-core Proxmox host at load 9 is at 0.28/core (healthy), but a naive
absolute threshold of 2× would trigger at 9 for a 4-core LXC. Using load_5m / nproc
where nproc returns the host's visible core count gives the correct ratio.
Validation examples:
- Proxmox host: load 9 / 32 cores = 0.28/core → no alert ✓
- VM 116 at 0.75/core → warning ✓ (above 0.7 threshold)
- VM at 1.1/core → critical ✓
Other Thresholds
| Check | Threshold | Notes |
|---|---|---|
| Zombie processes | 5 | Single zombies are transient noise; alert only if ≥ 5 |
| Swap usage | 30% of total swap | Percentage-based to handle varied swap sizes across hosts |
| Disk warning | 85% | |
| Disk critical | 95% | |
| Memory | 90% | |
| Uptime alert | Non-urgent Discord post | Not a page-level alert |
Related
- monitoring/CONTEXT.md — Overall monitoring architecture
- productivity/n8n/CONTEXT.md — N8N deployment
- Uptime Kuma status page: https://status.manticorum.com