claude-home/monitoring/server-diagnostics/CONTEXT.md
Cal Corum 193ae68f96
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 5s
docs: document per-core load threshold policy for server health monitoring (#22)
Closes #22

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 13:35:23 -05:00

136 lines
6.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "Server Diagnostics Architecture"
description: "Deployment and architecture docs for the CT 302 claude-runner two-tier health monitoring system. Covers n8n integration, cost model, repository layout, SSH auth, and adding new servers to monitoring."
type: reference
domain: monitoring
tags: [claude-runner, server-diagnostics, n8n, ssh, health-check, ct-302, gitea]
---
# Server Diagnostics — Deployment & Architecture
## Overview
Automated server health monitoring running on CT 302 (claude-runner, 10.10.0.148).
Two-tier system: Python health checks handle 99% of issues autonomously; Claude
is only invoked for complex failures that scripts can't resolve.
## Architecture
```
┌──────────────────────┐ ┌──────────────────────────────────┐
│ N8N (LXC 210) │ │ CT 302 — claude-runner │
│ 10.10.0.210 │ │ 10.10.0.148 │
│ │ │ │
│ ┌─────────────────┐ │ SSH │ ┌──────────────────────────┐ │
│ │ Cron: */15 min │─┼─────┼─→│ health_check.py │ │
│ │ │ │ │ │ (exit 0/1/2) │ │
│ │ Branch on exit: │ │ │ └──────────────────────────┘ │
│ │ 0 → stop │ │ │ │
│ │ 1 → stop │ │ │ ┌──────────────────────────┐ │
│ │ 2 → invoke │─┼─────┼─→│ claude --print │ │
│ │ Claude │ │ │ │ + client.py │ │
│ └─────────────────┘ │ │ └──────────────────────────┘ │
│ │ │ │
│ ┌─────────────────┐ │ │ SSH keys: │
│ │ Uptime Kuma │ │ │ - homelab_rsa (→ target servers)│
│ │ webhook trigger │ │ │ - n8n_runner_key (← N8N) │
│ └─────────────────┘ │ └──────────────────────────────────┘
└──────────────────────┘
│ SSH to target servers
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ arr-stack │ │ gitea │ │ uptime-kuma │
│ 10.10.0.221 │ │ 10.10.0.225 │ │ 10.10.0.227 │
│ Docker: sonarr │ │ systemd: gitea │ │ Docker: kuma │
│ radarr, etc. │ │ Docker: runner │ │ │
└────────────────┘ └────────────────┘ └────────────────┘
```
## Cost Model
- **Exit 0** (healthy): $0 — pure Python, no API call
- **Exit 1** (auto-remediated): $0 — Python restarts container + Discord webhook
- **Exit 2** (escalation): ~$0.10-0.15 — Claude Sonnet invoked via `claude --print`
At 96 checks/day (every 15 min), typical cost is near $0 unless something
actually breaks and can't be auto-fixed.
## Repository
**Gitea:** `cal/claude-runner-monitoring`
**Deployed to:** `/root/.claude` on CT 302
**SSH alias:** `claude-runner` (root@10.10.0.148, defined in `~/.ssh/config`)
**Update method:** `ssh claude-runner "cd /root/.claude && git pull"`
### Git Auth on CT 302
CT 302 pushes to Gitea via HTTPS with a token auth header (embedded-credential URLs are rejected by Gitea). The token is stored locally in `~/.claude/secrets/claude_runner_monitoring_gitea_token` and configured on CT 302 via:
```
git config http.https://git.manticorum.com/.extraHeader 'Authorization: token <token>'
```
CT 302 does **not** have an SSH key registered with Gitea, so SSH git remotes won't work.
## Files
| File | Purpose |
|------|---------|
| `CLAUDE.md` | Runner-specific instructions for Claude |
| `settings.json` | Locked-down permissions (read-only + restart only) |
| `skills/server-diagnostics/health_check.py` | Tier 1: automated health checks |
| `skills/server-diagnostics/client.py` | Tier 2: Claude's diagnostic toolkit |
| `skills/server-diagnostics/notifier.py` | Discord webhook notifications |
| `skills/server-diagnostics/config.yaml` | Server inventory + security rules |
| `skills/server-diagnostics/SKILL.md` | Skill reference |
| `skills/server-diagnostics/CLAUDE.md` | Remediation methodology |
## Adding a New Server
1. Add entry to `config.yaml` under `servers:` with hostname, containers, etc.
2. Ensure CT 302 can SSH: `ssh -i /root/.ssh/homelab_rsa root@<ip> hostname`
3. Commit to Gitea, pull on CT 302
4. Add Uptime Kuma monitors if desired
## Health Check Thresholds
Thresholds are evaluated in `health_check.py`. All load thresholds use **per-core** metrics
to avoid false positives from LXC containers (which see the Proxmox host's aggregate load).
### Load Average
| Metric | Value | Rationale |
|--------|-------|-----------|
| `LOAD_WARN_PER_CORE` | `0.7` | Elevated — investigate if sustained |
| `LOAD_CRIT_PER_CORE` | `1.0` | Saturated — CPU is a bottleneck |
| Sample window | 5-minute | Filters transient spikes (not 1-minute) |
**Formula**: `load_per_core = load_5m / nproc`
**Why per-core?** Proxmox LXC containers see the host's aggregate load average via the
shared kernel. A 32-core Proxmox host at load 9 is at 0.28/core (healthy), but a naive
absolute threshold of 2× would trigger at 9 for a 4-core LXC. Using `load_5m / nproc`
where `nproc` returns the host's visible core count gives the correct ratio.
**Validation examples**:
- Proxmox host: load 9 / 32 cores = 0.28/core → no alert ✓
- VM 116 at 0.75/core → warning ✓ (above 0.7 threshold)
- VM at 1.1/core → critical ✓
### Other Thresholds
| Check | Threshold | Notes |
|-------|-----------|-------|
| Zombie processes | 5 | Single zombies are transient noise; alert only if ≥ 5 |
| Swap usage | 30% of total swap | Percentage-based to handle varied swap sizes across hosts |
| Disk warning | 85% | |
| Disk critical | 95% | |
| Memory | 90% | |
| Uptime alert | Non-urgent Discord post | Not a page-level alert |
## Related
- [monitoring/CONTEXT.md](../CONTEXT.md) — Overall monitoring architecture
- [productivity/n8n/CONTEXT.md](../../productivity/n8n/CONTEXT.md) — N8N deployment
- Uptime Kuma status page: https://status.manticorum.com