Merge pull request 'fix: document per-core load threshold policy for health monitoring (#22)' (#42) from issue/22-tune-n8n-alert-thresholds-to-per-core-load-metrics into main
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 2s
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 2s
This commit is contained in:
commit
4e33e1cae3
@ -92,6 +92,42 @@ CT 302 does **not** have an SSH key registered with Gitea, so SSH git remotes wo
|
|||||||
3. Commit to Gitea, pull on CT 302
|
3. Commit to Gitea, pull on CT 302
|
||||||
4. Add Uptime Kuma monitors if desired
|
4. Add Uptime Kuma monitors if desired
|
||||||
|
|
||||||
|
## Health Check Thresholds
|
||||||
|
|
||||||
|
Thresholds are evaluated in `health_check.py`. All load thresholds use **per-core** metrics
|
||||||
|
to avoid false positives from LXC containers (which see the Proxmox host's aggregate load).
|
||||||
|
|
||||||
|
### Load Average
|
||||||
|
|
||||||
|
| Metric | Value | Rationale |
|
||||||
|
|--------|-------|-----------|
|
||||||
|
| `LOAD_WARN_PER_CORE` | `0.7` | Elevated — investigate if sustained |
|
||||||
|
| `LOAD_CRIT_PER_CORE` | `1.0` | Saturated — CPU is a bottleneck |
|
||||||
|
| Sample window | 5-minute | Filters transient spikes (not 1-minute) |
|
||||||
|
|
||||||
|
**Formula**: `load_per_core = load_5m / nproc`
|
||||||
|
|
||||||
|
**Why per-core?** Proxmox LXC containers see the host's aggregate load average via the
|
||||||
|
shared kernel. A 32-core Proxmox host at load 9 is at 0.28/core (healthy), but a naive
|
||||||
|
absolute threshold of 2× would trigger at 9 for a 4-core LXC. Using `load_5m / nproc`
|
||||||
|
where `nproc` returns the host's visible core count gives the correct ratio.
|
||||||
|
|
||||||
|
**Validation examples**:
|
||||||
|
- Proxmox host: load 9 / 32 cores = 0.28/core → no alert ✓
|
||||||
|
- VM 116 at 0.75/core → warning ✓ (above 0.7 threshold)
|
||||||
|
- VM at 1.1/core → critical ✓
|
||||||
|
|
||||||
|
### Other Thresholds
|
||||||
|
|
||||||
|
| Check | Threshold | Notes |
|
||||||
|
|-------|-----------|-------|
|
||||||
|
| Zombie processes | 5 | Single zombies are transient noise; alert only if ≥ 5 |
|
||||||
|
| Swap usage | 30% of total swap | Percentage-based to handle varied swap sizes across hosts |
|
||||||
|
| Disk warning | 85% | |
|
||||||
|
| Disk critical | 95% | |
|
||||||
|
| Memory | 90% | |
|
||||||
|
| Uptime alert | Non-urgent Discord post | Not a page-level alert |
|
||||||
|
|
||||||
## Related
|
## Related
|
||||||
|
|
||||||
- [monitoring/CONTEXT.md](../CONTEXT.md) — Overall monitoring architecture
|
- [monitoring/CONTEXT.md](../CONTEXT.md) — Overall monitoring architecture
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user