claude-home/monitoring/server-diagnostics/CONTEXT.md
Cal Corum 193ae68f96
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 5s
docs: document per-core load threshold policy for server health monitoring (#22)
Closes #22

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 13:35:23 -05:00

6.9 KiB
Raw Permalink Blame History

title description type domain tags
Server Diagnostics Architecture Deployment and architecture docs for the CT 302 claude-runner two-tier health monitoring system. Covers n8n integration, cost model, repository layout, SSH auth, and adding new servers to monitoring. reference monitoring
claude-runner
server-diagnostics
n8n
ssh
health-check
ct-302
gitea

Server Diagnostics — Deployment & Architecture

Overview

Automated server health monitoring running on CT 302 (claude-runner, 10.10.0.148). Two-tier system: Python health checks handle 99% of issues autonomously; Claude is only invoked for complex failures that scripts can't resolve.

Architecture

┌──────────────────────┐     ┌──────────────────────────────────┐
│  N8N (LXC 210)       │     │  CT 302 — claude-runner          │
│  10.10.0.210         │     │  10.10.0.148                     │
│                      │     │                                  │
│  ┌─────────────────┐ │ SSH │  ┌──────────────────────────┐   │
│  │ Cron: */15 min  │─┼─────┼─→│ health_check.py          │   │
│  │                 │ │     │  │ (exit 0/1/2)             │   │
│  │ Branch on exit: │ │     │  └──────────────────────────┘   │
│  │  0 → stop       │ │     │                                  │
│  │  1 → stop       │ │     │  ┌──────────────────────────┐   │
│  │  2 → invoke     │─┼─────┼─→│ claude --print            │   │
│  │     Claude      │ │     │  │ + client.py               │   │
│  └─────────────────┘ │     │  └──────────────────────────┘   │
│                      │     │                                  │
│  ┌─────────────────┐ │     │  SSH keys:                       │
│  │ Uptime Kuma     │ │     │  - homelab_rsa (→ target servers)│
│  │ webhook trigger │ │     │  - n8n_runner_key (← N8N)        │
│  └─────────────────┘ │     └──────────────────────────────────┘
└──────────────────────┘
         │ SSH to target servers
         ▼
┌────────────────┐  ┌────────────────┐  ┌────────────────┐
│ arr-stack      │  │ gitea          │  │ uptime-kuma    │
│ 10.10.0.221    │  │ 10.10.0.225    │  │ 10.10.0.227    │
│ Docker: sonarr │  │ systemd: gitea │  │ Docker: kuma   │
│ radarr, etc.   │  │ Docker: runner │  │                │
└────────────────┘  └────────────────┘  └────────────────┘

Cost Model

  • Exit 0 (healthy): $0 — pure Python, no API call
  • Exit 1 (auto-remediated): $0 — Python restarts container + Discord webhook
  • Exit 2 (escalation): ~$0.10-0.15 — Claude Sonnet invoked via claude --print

At 96 checks/day (every 15 min), typical cost is near $0 unless something actually breaks and can't be auto-fixed.

Repository

Gitea: cal/claude-runner-monitoring Deployed to: /root/.claude on CT 302 SSH alias: claude-runner (root@10.10.0.148, defined in ~/.ssh/config) Update method: ssh claude-runner "cd /root/.claude && git pull"

Git Auth on CT 302

CT 302 pushes to Gitea via HTTPS with a token auth header (embedded-credential URLs are rejected by Gitea). The token is stored locally in ~/.claude/secrets/claude_runner_monitoring_gitea_token and configured on CT 302 via:

git config http.https://git.manticorum.com/.extraHeader 'Authorization: token <token>'

CT 302 does not have an SSH key registered with Gitea, so SSH git remotes won't work.

Files

File Purpose
CLAUDE.md Runner-specific instructions for Claude
settings.json Locked-down permissions (read-only + restart only)
skills/server-diagnostics/health_check.py Tier 1: automated health checks
skills/server-diagnostics/client.py Tier 2: Claude's diagnostic toolkit
skills/server-diagnostics/notifier.py Discord webhook notifications
skills/server-diagnostics/config.yaml Server inventory + security rules
skills/server-diagnostics/SKILL.md Skill reference
skills/server-diagnostics/CLAUDE.md Remediation methodology

Adding a New Server

  1. Add entry to config.yaml under servers: with hostname, containers, etc.
  2. Ensure CT 302 can SSH: ssh -i /root/.ssh/homelab_rsa root@<ip> hostname
  3. Commit to Gitea, pull on CT 302
  4. Add Uptime Kuma monitors if desired

Health Check Thresholds

Thresholds are evaluated in health_check.py. All load thresholds use per-core metrics to avoid false positives from LXC containers (which see the Proxmox host's aggregate load).

Load Average

Metric Value Rationale
LOAD_WARN_PER_CORE 0.7 Elevated — investigate if sustained
LOAD_CRIT_PER_CORE 1.0 Saturated — CPU is a bottleneck
Sample window 5-minute Filters transient spikes (not 1-minute)

Formula: load_per_core = load_5m / nproc

Why per-core? Proxmox LXC containers see the host's aggregate load average via the shared kernel. A 32-core Proxmox host at load 9 is at 0.28/core (healthy), but a naive absolute threshold of 2× would trigger at 9 for a 4-core LXC. Using load_5m / nproc where nproc returns the host's visible core count gives the correct ratio.

Validation examples:

  • Proxmox host: load 9 / 32 cores = 0.28/core → no alert ✓
  • VM 116 at 0.75/core → warning ✓ (above 0.7 threshold)
  • VM at 1.1/core → critical ✓

Other Thresholds

Check Threshold Notes
Zombie processes 5 Single zombies are transient noise; alert only if ≥ 5
Swap usage 30% of total swap Percentage-based to handle varied swap sizes across hosts
Disk warning 85%
Disk critical 95%
Memory 90%
Uptime alert Non-urgent Discord post Not a page-level alert