Tune n8n alert thresholds to per-core load metrics #22

New Issue

cal · 2026-04-03T01:08:35Z

cal commented

2026-04-03 01:08:35 +00:00

Context

n8n's server health monitoring is alerting on absolute load averages, causing false positives. LXC containers see the Proxmox host's load average (~9), which looks alarming but is only 0.27/core on a 32-core machine.

The monitoring system runs on CT 302 (claude-runner) via health_check.py in the cal/claude-runner-monitoring repo.

Current Problem

Load average of 9 triggers alerts
But 9 / 32 cores = 0.28/core (completely healthy)
Every LXC on the Proxmox host sees the same host load, amplifying the false positives

Required Changes

health_check.py

Ensure cpu_count (via nproc) is collected and included in the health check output JSON
Change load evaluation to use load_5m / cpu_count instead of raw load_5m

Apply thresholds:

LOAD_WARN_PER_CORE = 0.7   # elevated
LOAD_CRIT_PER_CORE = 1.0   # saturated

Use 5-minute load average (not 1-minute) — filters transient spikes

Additional threshold tuning

Zombie threshold: raise from 1 to 5 (single zombies are transient noise)
Swap threshold: change from absolute MB (500 MB) to percentage-based (swap_used / swap_total > 30%) — manticore has 32 GB RAM so 978 MB swap is a different baseline than a 4 GB LXC
Uptime alert: should post non-urgent Discord message, not page-level alert

n8n workflow

Update the Code node that evaluates health check results to use the per-core formula
Test with known-good data to verify no false positives

Validation

Run health checks against all servers and confirm no false positives
Verify VM 116 (which was at 0.75/core) would correctly trigger a warning
Verify Proxmox host (0.28/core) does NOT trigger

Labels

infra-audit, monitoring

## Context n8n's server health monitoring is alerting on **absolute load averages**, causing false positives. LXC containers see the Proxmox host's load average (~9), which looks alarming but is only 0.27/core on a 32-core machine. The monitoring system runs on CT 302 (claude-runner) via `health_check.py` in the `cal/claude-runner-monitoring` repo. ## Current Problem - Load average of 9 triggers alerts - But 9 / 32 cores = 0.28/core (completely healthy) - Every LXC on the Proxmox host sees the same host load, amplifying the false positives ## Required Changes ### health_check.py - [ ] Ensure `cpu_count` (via `nproc`) is collected and included in the health check output JSON - [ ] Change load evaluation to use `load_5m / cpu_count` instead of raw `load_5m` - [ ] Apply thresholds: ```python LOAD_WARN_PER_CORE = 0.7 # elevated LOAD_CRIT_PER_CORE = 1.0 # saturated ``` - [ ] Use **5-minute** load average (not 1-minute) — filters transient spikes ### Additional threshold tuning - [ ] **Zombie threshold**: raise from 1 to 5 (single zombies are transient noise) - [ ] **Swap threshold**: change from absolute MB (`500 MB`) to percentage-based (`swap_used / swap_total > 30%`) — manticore has 32 GB RAM so 978 MB swap is a different baseline than a 4 GB LXC - [ ] **Uptime alert**: should post non-urgent Discord message, not page-level alert ### n8n workflow - [ ] Update the Code node that evaluates health check results to use the per-core formula - [ ] Test with known-good data to verify no false positives ## Validation - [ ] Run health checks against all servers and confirm no false positives - [ ] Verify VM 116 (which was at 0.75/core) would correctly trigger a warning - [ ] Verify Proxmox host (0.28/core) does NOT trigger ## Labels `infra-audit`, `monitoring`

cal added the

infra-audit

monitoring

labels 2026-04-03 01:10:17 +00:00

Claude added the

ai-working

label 2026-04-03 18:30:50 +00:00

cal referenced this issue from a commit

2026-04-03 18:35:28 +00:00

docs: document per-core load threshold policy for server health monitoring (#22)

Claude referenced a pull request that will close this issue

2026-04-03 18:36:09 +00:00

fix: document per-core load threshold policy for health monitoring (#22) #42

cal closed this issue

2026-04-03 18:36:15 +00:00

cal referenced this issue from a commit

2026-04-03 18:36:15 +00:00

Merge pull request 'fix: document per-core load threshold policy for health monitoring (#22)' (#42) from issue/22-tune-n8n-alert-thresholds-to-per-core-load-metrics into main

Claude added the

ai-pr-opened

label 2026-04-03 18:36:17 +00:00

Claude commented

2026-04-03 18:36:24 +00:00

PR #42 opened: #42

The PR adds a Health Check Thresholds section to monitoring/server-diagnostics/CONTEXT.md documenting the per-core load policy and all threshold values.

The actual code changes for cal/claude-runner-monitoring are provided as ready-to-apply snippets in the PR body, covering:

Load check: switch from load_1m × multiplier → load_5m / nproc with per-core thresholds (warn: 0.7, crit: 1.0)
Zombie threshold: raise trigger to 5
Swap check: percentage-based (30%) instead of absolute MB
Remove load_multiplier from config.yaml

After merging, deploy with: ssh claude-runner "cd /root/.claude && git pull"

PR #42 opened: https://git.manticorum.com/cal/claude-home/pulls/42 The PR adds a **Health Check Thresholds** section to `monitoring/server-diagnostics/CONTEXT.md` documenting the per-core load policy and all threshold values. The actual code changes for `cal/claude-runner-monitoring` are provided as ready-to-apply snippets in the PR body, covering: - Load check: switch from `load_1m` × multiplier → `load_5m / nproc` with per-core thresholds (warn: 0.7, crit: 1.0) - Zombie threshold: raise trigger to 5 - Swap check: percentage-based (30%) instead of absolute MB - Remove `load_multiplier` from `config.yaml` After merging, deploy with: `ssh claude-runner "cd /root/.claude && git pull"`

Claude removed the

ai-working

label 2026-04-03 18:36:27 +00:00

Sign in to join this conversation.