--- title: "Server Diagnostics Architecture" description: "Deployment and architecture docs for the CT 302 claude-runner two-tier health monitoring system. Covers n8n integration, cost model, repository layout, SSH auth, and adding new servers to monitoring." type: reference domain: monitoring tags: [claude-runner, server-diagnostics, n8n, ssh, health-check, ct-302, gitea] --- # Server Diagnostics — Deployment & Architecture ## Overview Automated server health monitoring running on CT 302 (claude-runner, 10.10.0.148). Two-tier system: Python health checks handle 99% of issues autonomously; Claude is only invoked for complex failures that scripts can't resolve. ## Architecture ``` ┌──────────────────────┐ ┌──────────────────────────────────┐ │ N8N (LXC 210) │ │ CT 302 — claude-runner │ │ 10.10.0.210 │ │ 10.10.0.148 │ │ │ │ │ │ ┌─────────────────┐ │ SSH │ ┌──────────────────────────┐ │ │ │ Cron: */15 min │─┼─────┼─→│ health_check.py │ │ │ │ │ │ │ │ (exit 0/1/2) │ │ │ │ Branch on exit: │ │ │ └──────────────────────────┘ │ │ │ 0 → stop │ │ │ │ │ │ 1 → stop │ │ │ ┌──────────────────────────┐ │ │ │ 2 → invoke │─┼─────┼─→│ claude --print │ │ │ │ Claude │ │ │ │ + client.py │ │ │ └─────────────────┘ │ │ └──────────────────────────┘ │ │ │ │ │ │ ┌─────────────────┐ │ │ SSH keys: │ │ │ Uptime Kuma │ │ │ - homelab_rsa (→ target servers)│ │ │ webhook trigger │ │ │ - n8n_runner_key (← N8N) │ │ └─────────────────┘ │ └──────────────────────────────────┘ └──────────────────────┘ │ SSH to target servers ▼ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │ arr-stack │ │ gitea │ │ uptime-kuma │ │ 10.10.0.221 │ │ 10.10.0.225 │ │ 10.10.0.227 │ │ Docker: sonarr │ │ systemd: gitea │ │ Docker: kuma │ │ radarr, etc. │ │ Docker: runner │ │ │ └────────────────┘ └────────────────┘ └────────────────┘ ``` ## Cost Model - **Exit 0** (healthy): $0 — pure Python, no API call - **Exit 1** (auto-remediated): $0 — Python restarts container + Discord webhook - **Exit 2** (escalation): ~$0.10-0.15 — Claude Sonnet invoked via `claude --print` At 96 checks/day (every 15 min), typical cost is near $0 unless something actually breaks and can't be auto-fixed. ## Repository **Gitea:** `cal/claude-runner-monitoring` **Deployed to:** `/root/.claude` on CT 302 **SSH alias:** `claude-runner` (root@10.10.0.148, defined in `~/.ssh/config`) **Update method:** `ssh claude-runner "cd /root/.claude && git pull"` ### Git Auth on CT 302 CT 302 pushes to Gitea via HTTPS with a token auth header (embedded-credential URLs are rejected by Gitea). The token is stored locally in `~/.claude/secrets/claude_runner_monitoring_gitea_token` and configured on CT 302 via: ``` git config http.https://git.manticorum.com/.extraHeader 'Authorization: token ' ``` CT 302 does **not** have an SSH key registered with Gitea, so SSH git remotes won't work. ## Files | File | Purpose | |------|---------| | `CLAUDE.md` | Runner-specific instructions for Claude | | `settings.json` | Locked-down permissions (read-only + restart only) | | `skills/server-diagnostics/health_check.py` | Tier 1: automated health checks | | `skills/server-diagnostics/client.py` | Tier 2: Claude's diagnostic toolkit | | `skills/server-diagnostics/notifier.py` | Discord webhook notifications | | `skills/server-diagnostics/config.yaml` | Server inventory + security rules | | `skills/server-diagnostics/SKILL.md` | Skill reference | | `skills/server-diagnostics/CLAUDE.md` | Remediation methodology | ## Adding a New Server 1. Add entry to `config.yaml` under `servers:` with hostname, containers, etc. 2. Ensure CT 302 can SSH: `ssh -i /root/.ssh/homelab_rsa root@ hostname` 3. Commit to Gitea, pull on CT 302 4. Add Uptime Kuma monitors if desired ## Health Check Thresholds Thresholds are evaluated in `health_check.py`. All load thresholds use **per-core** metrics to avoid false positives from LXC containers (which see the Proxmox host's aggregate load). ### Load Average | Metric | Value | Rationale | |--------|-------|-----------| | `LOAD_WARN_PER_CORE` | `0.7` | Elevated — investigate if sustained | | `LOAD_CRIT_PER_CORE` | `1.0` | Saturated — CPU is a bottleneck | | Sample window | 5-minute | Filters transient spikes (not 1-minute) | **Formula**: `load_per_core = load_5m / nproc` **Why per-core?** Proxmox LXC containers see the host's aggregate load average via the shared kernel. A 32-core Proxmox host at load 9 is at 0.28/core (healthy), but a naive absolute threshold of 2× would trigger at 9 for a 4-core LXC. Using `load_5m / nproc` where `nproc` returns the host's visible core count gives the correct ratio. **Validation examples**: - Proxmox host: load 9 / 32 cores = 0.28/core → no alert ✓ - VM 116 at 0.75/core → warning ✓ (above 0.7 threshold) - VM at 1.1/core → critical ✓ ### Other Thresholds | Check | Threshold | Notes | |-------|-----------|-------| | Zombie processes | 5 | Single zombies are transient noise; alert only if ≥ 5 | | Swap usage | 30% of total swap | Percentage-based to handle varied swap sizes across hosts | | Disk warning | 85% | | | Disk critical | 95% | | | Memory | 90% | | | Uptime alert | Non-urgent Discord post | Not a page-level alert | ## Related - [monitoring/CONTEXT.md](../CONTEXT.md) — Overall monitoring architecture - [productivity/n8n/CONTEXT.md](../../productivity/n8n/CONTEXT.md) — N8N deployment - Uptime Kuma status page: https://status.manticorum.com