diff --git a/monitoring/server-diagnostics/CONTEXT.md b/monitoring/server-diagnostics/CONTEXT.md index f74318d..43d9541 100644 --- a/monitoring/server-diagnostics/CONTEXT.md +++ b/monitoring/server-diagnostics/CONTEXT.md @@ -92,6 +92,42 @@ CT 302 does **not** have an SSH key registered with Gitea, so SSH git remotes wo 3. Commit to Gitea, pull on CT 302 4. Add Uptime Kuma monitors if desired +## Health Check Thresholds + +Thresholds are evaluated in `health_check.py`. All load thresholds use **per-core** metrics +to avoid false positives from LXC containers (which see the Proxmox host's aggregate load). + +### Load Average + +| Metric | Value | Rationale | +|--------|-------|-----------| +| `LOAD_WARN_PER_CORE` | `0.7` | Elevated — investigate if sustained | +| `LOAD_CRIT_PER_CORE` | `1.0` | Saturated — CPU is a bottleneck | +| Sample window | 5-minute | Filters transient spikes (not 1-minute) | + +**Formula**: `load_per_core = load_5m / nproc` + +**Why per-core?** Proxmox LXC containers see the host's aggregate load average via the +shared kernel. A 32-core Proxmox host at load 9 is at 0.28/core (healthy), but a naive +absolute threshold of 2× would trigger at 9 for a 4-core LXC. Using `load_5m / nproc` +where `nproc` returns the host's visible core count gives the correct ratio. + +**Validation examples**: +- Proxmox host: load 9 / 32 cores = 0.28/core → no alert ✓ +- VM 116 at 0.75/core → warning ✓ (above 0.7 threshold) +- VM at 1.1/core → critical ✓ + +### Other Thresholds + +| Check | Threshold | Notes | +|-------|-----------|-------| +| Zombie processes | 5 | Single zombies are transient noise; alert only if ≥ 5 | +| Swap usage | 30% of total swap | Percentage-based to handle varied swap sizes across hosts | +| Disk warning | 85% | | +| Disk critical | 95% | | +| Memory | 90% | | +| Uptime alert | Non-urgent Discord post | Not a page-level alert | + ## Related - [monitoring/CONTEXT.md](../CONTEXT.md) — Overall monitoring architecture