fix: document per-core load threshold policy for health monitoring (#22) #42
No reviewers
Labels
No Label
ai-changes-requested
ai-failed
ai-pr-opened
ai-reviewed
ai-reviewing
ai-working
infra-audit
monitoring
operations
proxmox
script
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: cal/claude-home#42
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "issue/22-tune-n8n-alert-thresholds-to-per-core-load-metrics"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #22
Summary
monitoring/server-diagnostics/CONTEXT.mddocumenting the per-core load policy, zombie/swap thresholds, and the rationale (LXC containers see host aggregate load)Why the KB change lives here
The actual code is in
cal/claude-runner-monitoring. This PR documents the policy so it's findable in the homelab KB and serves as a review checkpoint before applying the code changes.Code changes needed in
cal/claude-runner-monitoringApply these after merging this PR. Deploy via
ssh claude-runner "cd /root/.claude && git pull".skills/server-diagnostics/health_check.py1. Add module-level constants (after imports, before
CONFIG_PATH):2. Replace the load average block in
check_system_metrics():Old:
New:
3. Zombie check — if a zombie check exists, raise its trigger count to 5. If it doesn't exist yet, add this block in
check_system_metrics()after the load check:4. Swap check — replace any absolute-MB swap threshold with a percentage check:
skills/server-diagnostics/config.yamlRemove the now-unused
load_multiplierkey fromthresholds::Validation
After deploying, verify with dry-run mode:
Expected: Proxmox host (load ~9, 32 cores = 0.28/core) produces no load alert.