claude-memory/graph/procedures/self-managing-n8n-server-health-monitor-with-sub-workflows-226839.md

3.1 KiB

id type title tags importance confidence created updated relations
2268393f-b90f-4ae8-9995-4b942ed2b2f7 procedure Self-managing n8n server health monitor with sub-workflows
n8n
homelab
monitoring
claude-runner
architecture
procedure
0.9 0.8 2026-02-20T04:31:58.029736+00:00 2026-03-01T15:58:58.516927+00:00
target type direction strength edge_id
7fdc5ceb-4b8c-426d-8492-948d106f92bb BUILDS_ON incoming 0.9 a9109127-1691-4cea-a957-8d55320281d7
target type direction strength edge_id
aab3d007-0cdf-4a4f-9b55-096ea4bdc168 RELATED_TO incoming 0.85 effaafab-948b-488f-b388-4bd92f4ec6c2
target type direction strength edge_id
06101183-a78b-4852-86eb-cae5557ace8c BUILDS_ON incoming 0.85 02352349-eb84-4c09-8f2b-2e2feafd4f9a
target type direction strength edge_id
67898e52-470a-470e-b149-43fef0047ae9 RELATED_TO incoming 0.81 f0735351-129b-4bf6-9919-e84eaffa9bcd

Architecture

Master + sub-workflow pattern in n8n for server health monitoring via CT 302 (claude-runner at 10.10.0.148).

Master Workflow: "Server Health Monitor" (id: p7XmW23SgCs3hEkY)

  1. Schedule trigger (every 5 min)
  2. SSH to CT 302 → run list_servers.sh to get server keys from config.yaml as JSON array
  3. Code node: split array into items [{server_key: "arr-stack"}, ...]
  4. Execute Sub-workflow (mode: "each item") → calls "Server Health Check"
  5. Aggregate results (count healthy/remediated/escalated)
  6. If any escalations → Discord summary embed

Sub-workflow: "Server Health Check"

  1. Execute Workflow Trigger — receives { server_key } input
  2. SSH to CT 302: health_check.py --server {server_key}
  3. Parse JSON output (status/exit_code/issues/escalations)
  4. If exit_code == 2 → SSH: remediate.sh with escalation data
  5. Return results to parent

Key Design Decisions

  • Server list from config.yaml — single source of truth on CT 302. Adding a server = edit config.yaml + git pull. No n8n changes needed.
  • Exit code semantics: 0=healthy, 1=auto-remediated (script already sent Discord), 2=needs Claude escalation
  • Discord: Tier 1 alerts handled by notifier.py in health_check.py. Master only sends summary for escalations.
  • SSH chain: n8n (10.10.0.210) → n8n_runner_key → CT 302 (10.10.0.148) → homelab_rsa → target servers
  • SSH credential: "SSH Private Key account" (id: QkbHQ8JmYimUoTcM) — host 10.10.0.148, user root, n8n_runner_key (ed25519)

Files on CT 302

  • /root/.claude/skills/server-diagnostics/config.yaml — server inventory
  • /root/.claude/skills/server-diagnostics/health_check.py — health checker (Python, exit codes 0/1/2)
  • /root/.claude/skills/server-diagnostics/remediate.sh — Claude CLI headless wrapper for escalation
  • /root/.claude/skills/server-diagnostics/list_servers.sh — extracts server keys as JSON from config.yaml (to be created)
  • /root/.claude/skills/server-diagnostics/client.py — SSH diagnostic toolkit for Claude during escalation
  • /root/.claude/skills/server-diagnostics/notifier.py — Discord webhook notifications