Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 5s

Details

docs: document per-core load threshold policy for server health monitoring (#22 )

Closes #22

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-03 13:35:23 -05:00

6.9 KiB

Raw Permalink Blame History

title

description

type

domain

Server Diagnostics — Deployment & Architecture

Overview

Automated server health monitoring running on CT 302 (claude-runner, 10.10.0.148). Two-tier system: Python health checks handle 99% of issues autonomously; Claude is only invoked for complex failures that scripts can't resolve.

Architecture

┌──────────────────────┐     ┌──────────────────────────────────┐
│  N8N (LXC 210)       │     │  CT 302 — claude-runner          │
│  10.10.0.210         │     │  10.10.0.148                     │
│                      │     │                                  │
│  ┌─────────────────┐ │ SSH │  ┌──────────────────────────┐   │
│  │ Cron: */15 min  │─┼─────┼─→│ health_check.py          │   │
│  │                 │ │     │  │ (exit 0/1/2)             │   │
│  │ Branch on exit: │ │     │  └──────────────────────────┘   │
│  │  0 → stop       │ │     │                                  │
│  │  1 → stop       │ │     │  ┌──────────────────────────┐   │
│  │  2 → invoke     │─┼─────┼─→│ claude --print            │   │
│  │     Claude      │ │     │  │ + client.py               │   │
│  └─────────────────┘ │     │  └──────────────────────────┘   │
│                      │     │                                  │
│  ┌─────────────────┐ │     │  SSH keys:                       │
│  │ Uptime Kuma     │ │     │  - homelab_rsa (→ target servers)│
│  │ webhook trigger │ │     │  - n8n_runner_key (← N8N)        │
│  └─────────────────┘ │     └──────────────────────────────────┘
└──────────────────────┘
         │ SSH to target servers
         ▼
┌────────────────┐  ┌────────────────┐  ┌────────────────┐
│ arr-stack      │  │ gitea          │  │ uptime-kuma    │
│ 10.10.0.221    │  │ 10.10.0.225    │  │ 10.10.0.227    │
│ Docker: sonarr │  │ systemd: gitea │  │ Docker: kuma   │
│ radarr, etc.   │  │ Docker: runner │  │                │
└────────────────┘  └────────────────┘  └────────────────┘

Cost Model

Exit 0 (healthy): $0 — pure Python, no API call
Exit 1 (auto-remediated): $0 — Python restarts container + Discord webhook
Exit 2 (escalation): ~$0.10-0.15 — Claude Sonnet invoked via claude --print

At 96 checks/day (every 15 min), typical cost is near $0 unless something actually breaks and can't be auto-fixed.

Repository

Gitea: cal/claude-runner-monitoring Deployed to: /root/.claude on CT 302 SSH alias: claude-runner (root@10.10.0.148, defined in ~/.ssh/config) Update method: ssh claude-runner "cd /root/.claude && git pull"

Git Auth on CT 302

CT 302 pushes to Gitea via HTTPS with a token auth header (embedded-credential URLs are rejected by Gitea). The token is stored locally in ~/.claude/secrets/claude_runner_monitoring_gitea_token and configured on CT 302 via:

git config http.https://git.manticorum.com/.extraHeader 'Authorization: token <token>'

CT 302 does not have an SSH key registered with Gitea, so SSH git remotes won't work.

Files

File	Purpose
`CLAUDE.md`	Runner-specific instructions for Claude
`settings.json`	Locked-down permissions (read-only + restart only)
`skills/server-diagnostics/health_check.py`	Tier 1: automated health checks
`skills/server-diagnostics/client.py`	Tier 2: Claude's diagnostic toolkit
`skills/server-diagnostics/notifier.py`	Discord webhook notifications
`skills/server-diagnostics/config.yaml`	Server inventory + security rules
`skills/server-diagnostics/SKILL.md`	Skill reference
`skills/server-diagnostics/CLAUDE.md`	Remediation methodology

Adding a New Server

Add entry to config.yaml under servers: with hostname, containers, etc.
Ensure CT 302 can SSH: ssh -i /root/.ssh/homelab_rsa root@<ip> hostname
Commit to Gitea, pull on CT 302
Add Uptime Kuma monitors if desired

Health Check Thresholds

Thresholds are evaluated in health_check.py. All load thresholds use per-core metrics to avoid false positives from LXC containers (which see the Proxmox host's aggregate load).

Load Average

Metric	Value	Rationale
`LOAD_WARN_PER_CORE`	`0.7`	Elevated — investigate if sustained
`LOAD_CRIT_PER_CORE`	`1.0`	Saturated — CPU is a bottleneck
Sample window	5-minute	Filters transient spikes (not 1-minute)

Formula: load_per_core = load_5m / nproc

Why per-core? Proxmox LXC containers see the host's aggregate load average via the shared kernel. A 32-core Proxmox host at load 9 is at 0.28/core (healthy), but a naive absolute threshold of 2× would trigger at 9 for a 4-core LXC. Using load_5m / nproc where nproc returns the host's visible core count gives the correct ratio.

Validation examples:

Proxmox host: load 9 / 32 cores = 0.28/core → no alert ✓
VM 116 at 0.75/core → warning ✓ (above 0.7 threshold)
VM at 1.1/core → critical ✓

Other Thresholds

Check	Threshold	Notes
Zombie processes	5	Single zombies are transient noise; alert only if ≥ 5
Swap usage	30% of total swap	Percentage-based to handle varied swap sizes across hosts
Disk warning	85%
Disk critical	95%
Memory	90%
Uptime alert	Non-urgent Discord post	Not a page-level alert

monitoring/CONTEXT.md — Overall monitoring architecture
productivity/n8n/CONTEXT.md — N8N deployment
Uptime Kuma status page: https://status.manticorum.com

6.9 KiB Raw Permalink Blame History Unescape Escape