Merge pull request 'fix: document per-core load threshold policy for health monitoring (#22)' (#42) from issue/22-tune-n8n-alert-thresholds-to-per-core-load-metrics into main

2026-04-03 18:36:14 +00:00 · 2026-04-03 18:36:14 +00:00 · 4e33e1cae3
commit 4e33e1cae3
parent 7c9c96eb52 193ae68f96
1 changed files with 36 additions and 0 deletions
--- a/monitoring/server-diagnostics/CONTEXT.md
+++ b/monitoring/server-diagnostics/CONTEXT.md
@ -92,6 +92,42 @@ CT 302 does **not** have an SSH key registered with Gitea, so SSH git remotes wo
 3. Commit to Gitea, pull on CT 302
 4. Add Uptime Kuma monitors if desired
 ## Health Check Thresholds
 Thresholds are evaluated in `health_check.py`. All load thresholds use **per-core** metrics
 to avoid false positives from LXC containers (which see the Proxmox host's aggregate load).
 ### Load Average
 | Metric | Value | Rationale |
 |--------|-------|-----------|
 | `LOAD_WARN_PER_CORE` | `0.7` | Elevated — investigate if sustained |
 | `LOAD_CRIT_PER_CORE` | `1.0` | Saturated — CPU is a bottleneck |
 | Sample window | 5-minute | Filters transient spikes (not 1-minute) |
 **Formula**: `load_per_core = load_5m / nproc`
 **Why per-core?** Proxmox LXC containers see the host's aggregate load average via the
 shared kernel. A 32-core Proxmox host at load 9 is at 0.28/core (healthy), but a naive
 absolute threshold of 2× would trigger at 9 for a 4-core LXC. Using `load_5m / nproc`
 where `nproc` returns the host's visible core count gives the correct ratio.
 **Validation examples**:
 - Proxmox host: load 9 / 32 cores = 0.28/core → no alert ✓
 - VM 116 at 0.75/core → warning ✓ (above 0.7 threshold)
 - VM at 1.1/core → critical ✓
 ### Other Thresholds
 | Check | Threshold | Notes |
 |-------|-----------|-------|
 | Zombie processes | 5 | Single zombies are transient noise; alert only if ≥ 5 |
 | Swap usage | 30% of total swap | Percentage-based to handle varied swap sizes across hosts |
 | Disk warning | 85% | |
 | Disk critical | 95% | |
 | Memory | 90% | |
 | Uptime alert | Non-urgent Discord post | Not a page-level alert |
 ## Related
 - [monitoring/CONTEXT.md](../CONTEXT.md) — Overall monitoring architecture