Tune n8n alert thresholds to per-core load metrics #22
Labels
No Label
ai-changes-requested
ai-failed
ai-pr-opened
ai-reviewed
ai-reviewing
ai-working
infra-audit
monitoring
operations
proxmox
script
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: cal/claude-home#22
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Context
n8n's server health monitoring is alerting on absolute load averages, causing false positives. LXC containers see the Proxmox host's load average (~9), which looks alarming but is only 0.27/core on a 32-core machine.
The monitoring system runs on CT 302 (claude-runner) via
health_check.pyin thecal/claude-runner-monitoringrepo.Current Problem
Required Changes
health_check.py
cpu_count(vianproc) is collected and included in the health check output JSONload_5m / cpu_countinstead of rawload_5mAdditional threshold tuning
500 MB) to percentage-based (swap_used / swap_total > 30%) — manticore has 32 GB RAM so 978 MB swap is a different baseline than a 4 GB LXCn8n workflow
Validation
Labels
infra-audit,monitoringPR #42 opened: #42
The PR adds a Health Check Thresholds section to
monitoring/server-diagnostics/CONTEXT.mddocumenting the per-core load policy and all threshold values.The actual code changes for
cal/claude-runner-monitoringare provided as ready-to-apply snippets in the PR body, covering:load_1m× multiplier →load_5m / nprocwith per-core thresholds (warn: 0.7, crit: 1.0)load_multiplierfromconfig.yamlAfter merging, deploy with:
ssh claude-runner "cd /root/.claude && git pull"