Tune n8n alert thresholds to per-core load metrics #22

Closed
opened 2026-04-03 01:08:35 +00:00 by cal · 1 comment
Owner

Context

n8n's server health monitoring is alerting on absolute load averages, causing false positives. LXC containers see the Proxmox host's load average (~9), which looks alarming but is only 0.27/core on a 32-core machine.

The monitoring system runs on CT 302 (claude-runner) via health_check.py in the cal/claude-runner-monitoring repo.

Current Problem

  • Load average of 9 triggers alerts
  • But 9 / 32 cores = 0.28/core (completely healthy)
  • Every LXC on the Proxmox host sees the same host load, amplifying the false positives

Required Changes

health_check.py

  • Ensure cpu_count (via nproc) is collected and included in the health check output JSON
  • Change load evaluation to use load_5m / cpu_count instead of raw load_5m
  • Apply thresholds:
    LOAD_WARN_PER_CORE = 0.7   # elevated
    LOAD_CRIT_PER_CORE = 1.0   # saturated
    
  • Use 5-minute load average (not 1-minute) — filters transient spikes

Additional threshold tuning

  • Zombie threshold: raise from 1 to 5 (single zombies are transient noise)
  • Swap threshold: change from absolute MB (500 MB) to percentage-based (swap_used / swap_total > 30%) — manticore has 32 GB RAM so 978 MB swap is a different baseline than a 4 GB LXC
  • Uptime alert: should post non-urgent Discord message, not page-level alert

n8n workflow

  • Update the Code node that evaluates health check results to use the per-core formula
  • Test with known-good data to verify no false positives

Validation

  • Run health checks against all servers and confirm no false positives
  • Verify VM 116 (which was at 0.75/core) would correctly trigger a warning
  • Verify Proxmox host (0.28/core) does NOT trigger

Labels

infra-audit, monitoring

## Context n8n's server health monitoring is alerting on **absolute load averages**, causing false positives. LXC containers see the Proxmox host's load average (~9), which looks alarming but is only 0.27/core on a 32-core machine. The monitoring system runs on CT 302 (claude-runner) via `health_check.py` in the `cal/claude-runner-monitoring` repo. ## Current Problem - Load average of 9 triggers alerts - But 9 / 32 cores = 0.28/core (completely healthy) - Every LXC on the Proxmox host sees the same host load, amplifying the false positives ## Required Changes ### health_check.py - [ ] Ensure `cpu_count` (via `nproc`) is collected and included in the health check output JSON - [ ] Change load evaluation to use `load_5m / cpu_count` instead of raw `load_5m` - [ ] Apply thresholds: ```python LOAD_WARN_PER_CORE = 0.7 # elevated LOAD_CRIT_PER_CORE = 1.0 # saturated ``` - [ ] Use **5-minute** load average (not 1-minute) — filters transient spikes ### Additional threshold tuning - [ ] **Zombie threshold**: raise from 1 to 5 (single zombies are transient noise) - [ ] **Swap threshold**: change from absolute MB (`500 MB`) to percentage-based (`swap_used / swap_total > 30%`) — manticore has 32 GB RAM so 978 MB swap is a different baseline than a 4 GB LXC - [ ] **Uptime alert**: should post non-urgent Discord message, not page-level alert ### n8n workflow - [ ] Update the Code node that evaluates health check results to use the per-core formula - [ ] Test with known-good data to verify no false positives ## Validation - [ ] Run health checks against all servers and confirm no false positives - [ ] Verify VM 116 (which was at 0.75/core) would correctly trigger a warning - [ ] Verify Proxmox host (0.28/core) does NOT trigger ## Labels `infra-audit`, `monitoring`
cal added the
infra-audit
monitoring
labels 2026-04-03 01:10:17 +00:00
Claude added the
ai-working
label 2026-04-03 18:30:50 +00:00
cal closed this issue 2026-04-03 18:36:15 +00:00
Claude added the
ai-pr-opened
label 2026-04-03 18:36:17 +00:00
Collaborator

PR #42 opened: #42

The PR adds a Health Check Thresholds section to monitoring/server-diagnostics/CONTEXT.md documenting the per-core load policy and all threshold values.

The actual code changes for cal/claude-runner-monitoring are provided as ready-to-apply snippets in the PR body, covering:

  • Load check: switch from load_1m × multiplier → load_5m / nproc with per-core thresholds (warn: 0.7, crit: 1.0)
  • Zombie threshold: raise trigger to 5
  • Swap check: percentage-based (30%) instead of absolute MB
  • Remove load_multiplier from config.yaml

After merging, deploy with: ssh claude-runner "cd /root/.claude && git pull"

PR #42 opened: https://git.manticorum.com/cal/claude-home/pulls/42 The PR adds a **Health Check Thresholds** section to `monitoring/server-diagnostics/CONTEXT.md` documenting the per-core load policy and all threshold values. The actual code changes for `cal/claude-runner-monitoring` are provided as ready-to-apply snippets in the PR body, covering: - Load check: switch from `load_1m` × multiplier → `load_5m / nproc` with per-core thresholds (warn: 0.7, crit: 1.0) - Zombie threshold: raise trigger to 5 - Swap check: percentage-based (30%) instead of absolute MB - Remove `load_multiplier` from `config.yaml` After merging, deploy with: `ssh claude-runner "cd /root/.claude && git pull"`
Claude removed the
ai-working
label 2026-04-03 18:36:27 +00:00
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: cal/claude-home#22
No description provided.