docs: document per-core load threshold policy for server health monitoring (#22)

Closes #22 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 13:35:23 -05:00 · 2026-04-03 13:35:23 -05:00 · 193ae68f96
commit 193ae68f96
parent 7c9c96eb52
1 changed files with 36 additions and 0 deletions
--- a/monitoring/server-diagnostics/CONTEXT.md
+++ b/monitoring/server-diagnostics/CONTEXT.md
@ -92,6 +92,42 @@ CT 302 does **not** have an SSH key registered with Gitea, so SSH git remotes wo
 3. Commit to Gitea, pull on CT 302
 4. Add Uptime Kuma monitors if desired

+## Health Check Thresholds
+
+Thresholds are evaluated in `health_check.py`. All load thresholds use **per-core** metrics
+to avoid false positives from LXC containers (which see the Proxmox host's aggregate load).
+
+### Load Average
+
+| Metric | Value | Rationale |
+|--------|-------|-----------|
+| `LOAD_WARN_PER_CORE` | `0.7` | Elevated — investigate if sustained |
+| `LOAD_CRIT_PER_CORE` | `1.0` | Saturated — CPU is a bottleneck |
+| Sample window | 5-minute | Filters transient spikes (not 1-minute) |
+
+**Formula**: `load_per_core = load_5m / nproc`
+
+**Why per-core?** Proxmox LXC containers see the host's aggregate load average via the
+shared kernel. A 32-core Proxmox host at load 9 is at 0.28/core (healthy), but a naive
+absolute threshold of 2× would trigger at 9 for a 4-core LXC. Using `load_5m / nproc`
+where `nproc` returns the host's visible core count gives the correct ratio.
+
+**Validation examples**:
+- Proxmox host: load 9 / 32 cores = 0.28/core → no alert ✓
+- VM 116 at 0.75/core → warning ✓ (above 0.7 threshold)
+- VM at 1.1/core → critical ✓
+
+### Other Thresholds
+
+| Check | Threshold | Notes |
+|-------|-----------|-------|
+| Zombie processes | 5 | Single zombies are transient noise; alert only if ≥ 5 |
+| Swap usage | 30% of total swap | Percentage-based to handle varied swap sizes across hosts |
+| Disk warning | 85% | |
+| Disk critical | 95% | |
+| Memory | 90% | |
+| Uptime alert | Non-urgent Discord post | Not a page-level alert |
+
 ## Related

 - [monitoring/CONTEXT.md](../CONTEXT.md) — Overall monitoring architecture