docs: document per-core load threshold policy for server health monitoring (#22)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 5s
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 5s
Closes #22 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
7c9c96eb52
commit
193ae68f96
@ -92,6 +92,42 @@ CT 302 does **not** have an SSH key registered with Gitea, so SSH git remotes wo
|
||||
3. Commit to Gitea, pull on CT 302
|
||||
4. Add Uptime Kuma monitors if desired
|
||||
|
||||
## Health Check Thresholds
|
||||
|
||||
Thresholds are evaluated in `health_check.py`. All load thresholds use **per-core** metrics
|
||||
to avoid false positives from LXC containers (which see the Proxmox host's aggregate load).
|
||||
|
||||
### Load Average
|
||||
|
||||
| Metric | Value | Rationale |
|
||||
|--------|-------|-----------|
|
||||
| `LOAD_WARN_PER_CORE` | `0.7` | Elevated — investigate if sustained |
|
||||
| `LOAD_CRIT_PER_CORE` | `1.0` | Saturated — CPU is a bottleneck |
|
||||
| Sample window | 5-minute | Filters transient spikes (not 1-minute) |
|
||||
|
||||
**Formula**: `load_per_core = load_5m / nproc`
|
||||
|
||||
**Why per-core?** Proxmox LXC containers see the host's aggregate load average via the
|
||||
shared kernel. A 32-core Proxmox host at load 9 is at 0.28/core (healthy), but a naive
|
||||
absolute threshold of 2× would trigger at 9 for a 4-core LXC. Using `load_5m / nproc`
|
||||
where `nproc` returns the host's visible core count gives the correct ratio.
|
||||
|
||||
**Validation examples**:
|
||||
- Proxmox host: load 9 / 32 cores = 0.28/core → no alert ✓
|
||||
- VM 116 at 0.75/core → warning ✓ (above 0.7 threshold)
|
||||
- VM at 1.1/core → critical ✓
|
||||
|
||||
### Other Thresholds
|
||||
|
||||
| Check | Threshold | Notes |
|
||||
|-------|-----------|-------|
|
||||
| Zombie processes | 5 | Single zombies are transient noise; alert only if ≥ 5 |
|
||||
| Swap usage | 30% of total swap | Percentage-based to handle varied swap sizes across hosts |
|
||||
| Disk warning | 85% | |
|
||||
| Disk critical | 95% | |
|
||||
| Memory | 90% | |
|
||||
| Uptime alert | Non-urgent Discord post | Not a page-level alert |
|
||||
|
||||
## Related
|
||||
|
||||
- [monitoring/CONTEXT.md](../CONTEXT.md) — Overall monitoring architecture
|
||||
|
||||
Loading…
Reference in New Issue
Block a user