Merge pull request 'fix: document per-core load threshold policy for health monitoring (#22)' (#42) from issue/22-tune-n8n-alert-thresholds-to-per-core-load-metrics into main
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 2s

This commit is contained in:
cal 2026-04-03 18:36:14 +00:00
commit 4e33e1cae3

View File

@ -92,6 +92,42 @@ CT 302 does **not** have an SSH key registered with Gitea, so SSH git remotes wo
3. Commit to Gitea, pull on CT 302 3. Commit to Gitea, pull on CT 302
4. Add Uptime Kuma monitors if desired 4. Add Uptime Kuma monitors if desired
## Health Check Thresholds
Thresholds are evaluated in `health_check.py`. All load thresholds use **per-core** metrics
to avoid false positives from LXC containers (which see the Proxmox host's aggregate load).
### Load Average
| Metric | Value | Rationale |
|--------|-------|-----------|
| `LOAD_WARN_PER_CORE` | `0.7` | Elevated — investigate if sustained |
| `LOAD_CRIT_PER_CORE` | `1.0` | Saturated — CPU is a bottleneck |
| Sample window | 5-minute | Filters transient spikes (not 1-minute) |
**Formula**: `load_per_core = load_5m / nproc`
**Why per-core?** Proxmox LXC containers see the host's aggregate load average via the
shared kernel. A 32-core Proxmox host at load 9 is at 0.28/core (healthy), but a naive
absolute threshold of 2× would trigger at 9 for a 4-core LXC. Using `load_5m / nproc`
where `nproc` returns the host's visible core count gives the correct ratio.
**Validation examples**:
- Proxmox host: load 9 / 32 cores = 0.28/core → no alert ✓
- VM 116 at 0.75/core → warning ✓ (above 0.7 threshold)
- VM at 1.1/core → critical ✓
### Other Thresholds
| Check | Threshold | Notes |
|-------|-----------|-------|
| Zombie processes | 5 | Single zombies are transient noise; alert only if ≥ 5 |
| Swap usage | 30% of total swap | Percentage-based to handle varied swap sizes across hosts |
| Disk warning | 85% | |
| Disk critical | 95% | |
| Memory | 90% | |
| Uptime alert | Non-urgent Discord post | Not a page-level alert |
## Related ## Related
- [monitoring/CONTEXT.md](../CONTEXT.md) — Overall monitoring architecture - [monitoring/CONTEXT.md](../CONTEXT.md) — Overall monitoring architecture