fix: document per-core load threshold policy for health monitoring (#22) #42

Merged
cal merged 1 commits from issue/22-tune-n8n-alert-thresholds-to-per-core-load-metrics into main 2026-04-03 18:36:15 +00:00
Collaborator

Closes #22

Summary

  • Added a Health Check Thresholds section to monitoring/server-diagnostics/CONTEXT.md documenting the per-core load policy, zombie/swap thresholds, and the rationale (LXC containers see host aggregate load)

Why the KB change lives here

The actual code is in cal/claude-runner-monitoring. This PR documents the policy so it's findable in the homelab KB and serves as a review checkpoint before applying the code changes.

Code changes needed in cal/claude-runner-monitoring

Apply these after merging this PR. Deploy via ssh claude-runner "cd /root/.claude && git pull".

skills/server-diagnostics/health_check.py

1. Add module-level constants (after imports, before CONFIG_PATH):

LOAD_WARN_PER_CORE = 0.7   # elevated
LOAD_CRIT_PER_CORE = 1.0   # saturated
ZOMBIE_THRESHOLD = 5
SWAP_WARN_PERCENT = 30

2. Replace the load average block in check_system_metrics():

Old:

result = ssh_exec(host, "nproc && cat /proc/loadavg", config, server_cfg)
if result["success"]:
    lines = result["stdout"].strip().split("\n")
    if len(lines) >= 2:
        try:
            cores = int(lines[0].strip())
            load_1m = float(lines[1].split()[0])
            multiplier = thresholds.get("load_multiplier", 2)
            threshold = cores * multiplier
            if load_1m > threshold:
                issues.append({
                    "server": server_key,
                    "type": "load_high",
                    "load_1m": load_1m,
                    "cores": cores,
                    "severity": "warning",
                    "message": f"Load {load_1m} exceeds {threshold} ({cores} cores x{multiplier}) on {server_key}",
                    "auto_remediate": False,
                })
        except (ValueError, IndexError):
            pass

New:

result = ssh_exec(host, "nproc && cat /proc/loadavg", config, server_cfg)
if result["success"]:
    lines = result["stdout"].strip().split("\n")
    if len(lines) >= 2:
        try:
            cores = int(lines[0].strip())
            load_5m = float(lines[1].split()[1])  # index 1 = 5-minute average
            load_per_core = load_5m / cores
            if load_per_core >= LOAD_CRIT_PER_CORE:
                issues.append({
                    "server": server_key,
                    "type": "load_high",
                    "load_5m": load_5m,
                    "cpu_count": cores,
                    "load_per_core": round(load_per_core, 2),
                    "severity": "critical",
                    "message": f"Load {load_5m:.1f} ({load_per_core:.2f}/core) saturated on {server_key}",
                    "auto_remediate": False,
                })
            elif load_per_core >= LOAD_WARN_PER_CORE:
                issues.append({
                    "server": server_key,
                    "type": "load_high",
                    "load_5m": load_5m,
                    "cpu_count": cores,
                    "load_per_core": round(load_per_core, 2),
                    "severity": "warning",
                    "message": f"Load {load_5m:.1f} ({load_per_core:.2f}/core) elevated on {server_key}",
                    "auto_remediate": False,
                })
        except (ValueError, IndexError):
            pass

3. Zombie check — if a zombie check exists, raise its trigger count to 5. If it doesn't exist yet, add this block in check_system_metrics() after the load check:

# Zombie processes
result = ssh_exec(host, "ps aux | awk '$8==\"Z\"' | wc -l", config, server_cfg)
if result["success"]:
    try:
        zombie_count = int(result["stdout"].strip())
        if zombie_count >= ZOMBIE_THRESHOLD:
            issues.append({
                "server": server_key,
                "type": "zombie_processes",
                "count": zombie_count,
                "severity": "warning",
                "message": f"{zombie_count} zombie processes on {server_key}",
                "auto_remediate": False,
            })
    except ValueError:
        pass

4. Swap check — replace any absolute-MB swap threshold with a percentage check:

# Swap usage (percentage-based)
result = ssh_exec(host, "free | grep Swap", config, server_cfg)
if result["success"]:
    try:
        parts = result["stdout"].split()
        swap_total = int(parts[1])
        swap_used = int(parts[2])
        if swap_total > 0:
            swap_pct = swap_used / swap_total * 100
            if swap_pct > SWAP_WARN_PERCENT:
                issues.append({
                    "server": server_key,
                    "type": "swap_high",
                    "swap_used_pct": round(swap_pct, 1),
                    "severity": "warning",
                    "message": f"Swap at {swap_pct:.1f}% on {server_key} (threshold: {SWAP_WARN_PERCENT}%)",
                    "auto_remediate": False,
                })
    except (ValueError, IndexError):
        pass

skills/server-diagnostics/config.yaml

Remove the now-unused load_multiplier key from thresholds::

# Remove this line:
  load_multiplier: 4

Validation

After deploying, verify with dry-run mode:

ssh claude-runner "cd /root/.claude && python3 skills/server-diagnostics/health_check.py --dry-run 2>&1 | python3 -m json.tool"

Expected: Proxmox host (load ~9, 32 cores = 0.28/core) produces no load alert.

Closes #22 ## Summary - Added a **Health Check Thresholds** section to `monitoring/server-diagnostics/CONTEXT.md` documenting the per-core load policy, zombie/swap thresholds, and the rationale (LXC containers see host aggregate load) ## Why the KB change lives here The actual code is in `cal/claude-runner-monitoring`. This PR documents the policy so it's findable in the homelab KB and serves as a review checkpoint before applying the code changes. ## Code changes needed in `cal/claude-runner-monitoring` Apply these after merging this PR. Deploy via `ssh claude-runner "cd /root/.claude && git pull"`. ### `skills/server-diagnostics/health_check.py` **1. Add module-level constants** (after imports, before `CONFIG_PATH`): ```python LOAD_WARN_PER_CORE = 0.7 # elevated LOAD_CRIT_PER_CORE = 1.0 # saturated ZOMBIE_THRESHOLD = 5 SWAP_WARN_PERCENT = 30 ``` **2. Replace the load average block** in `check_system_metrics()`: Old: ```python result = ssh_exec(host, "nproc && cat /proc/loadavg", config, server_cfg) if result["success"]: lines = result["stdout"].strip().split("\n") if len(lines) >= 2: try: cores = int(lines[0].strip()) load_1m = float(lines[1].split()[0]) multiplier = thresholds.get("load_multiplier", 2) threshold = cores * multiplier if load_1m > threshold: issues.append({ "server": server_key, "type": "load_high", "load_1m": load_1m, "cores": cores, "severity": "warning", "message": f"Load {load_1m} exceeds {threshold} ({cores} cores x{multiplier}) on {server_key}", "auto_remediate": False, }) except (ValueError, IndexError): pass ``` New: ```python result = ssh_exec(host, "nproc && cat /proc/loadavg", config, server_cfg) if result["success"]: lines = result["stdout"].strip().split("\n") if len(lines) >= 2: try: cores = int(lines[0].strip()) load_5m = float(lines[1].split()[1]) # index 1 = 5-minute average load_per_core = load_5m / cores if load_per_core >= LOAD_CRIT_PER_CORE: issues.append({ "server": server_key, "type": "load_high", "load_5m": load_5m, "cpu_count": cores, "load_per_core": round(load_per_core, 2), "severity": "critical", "message": f"Load {load_5m:.1f} ({load_per_core:.2f}/core) saturated on {server_key}", "auto_remediate": False, }) elif load_per_core >= LOAD_WARN_PER_CORE: issues.append({ "server": server_key, "type": "load_high", "load_5m": load_5m, "cpu_count": cores, "load_per_core": round(load_per_core, 2), "severity": "warning", "message": f"Load {load_5m:.1f} ({load_per_core:.2f}/core) elevated on {server_key}", "auto_remediate": False, }) except (ValueError, IndexError): pass ``` **3. Zombie check** — if a zombie check exists, raise its trigger count to 5. If it doesn't exist yet, add this block in `check_system_metrics()` after the load check: ```python # Zombie processes result = ssh_exec(host, "ps aux | awk '$8==\"Z\"' | wc -l", config, server_cfg) if result["success"]: try: zombie_count = int(result["stdout"].strip()) if zombie_count >= ZOMBIE_THRESHOLD: issues.append({ "server": server_key, "type": "zombie_processes", "count": zombie_count, "severity": "warning", "message": f"{zombie_count} zombie processes on {server_key}", "auto_remediate": False, }) except ValueError: pass ``` **4. Swap check** — replace any absolute-MB swap threshold with a percentage check: ```python # Swap usage (percentage-based) result = ssh_exec(host, "free | grep Swap", config, server_cfg) if result["success"]: try: parts = result["stdout"].split() swap_total = int(parts[1]) swap_used = int(parts[2]) if swap_total > 0: swap_pct = swap_used / swap_total * 100 if swap_pct > SWAP_WARN_PERCENT: issues.append({ "server": server_key, "type": "swap_high", "swap_used_pct": round(swap_pct, 1), "severity": "warning", "message": f"Swap at {swap_pct:.1f}% on {server_key} (threshold: {SWAP_WARN_PERCENT}%)", "auto_remediate": False, }) except (ValueError, IndexError): pass ``` ### `skills/server-diagnostics/config.yaml` Remove the now-unused `load_multiplier` key from `thresholds:`: ```yaml # Remove this line: load_multiplier: 4 ``` ## Validation After deploying, verify with dry-run mode: ```bash ssh claude-runner "cd /root/.claude && python3 skills/server-diagnostics/health_check.py --dry-run 2>&1 | python3 -m json.tool" ``` Expected: Proxmox host (load ~9, 32 cores = 0.28/core) produces no load alert.
Claude added 1 commit 2026-04-03 18:36:09 +00:00
docs: document per-core load threshold policy for server health monitoring (#22)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 5s
193ae68f96
Closes #22

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cal merged commit 4e33e1cae3 into main 2026-04-03 18:36:15 +00:00
Sign in to join this conversation.
No description provided.