From f035fd93af2a13f6b331836c941ff90478a40e0c Mon Sep 17 00:00:00 2001
From: Cal Corum <calcorum@users.noreply.github.com>
Date: Thu, 19 Feb 2026 22:31:58 -0600
Subject: [PATCH] store: Self-managing n8n server health monitor with
 sub-workflows

---
 ...ealth-monitor-with-sub-workflows-226839.md | 48 +++++++++++++++++++
 1 file changed, 48 insertions(+)
 create mode 100644 graph/procedures/self-managing-n8n-server-health-monitor-with-sub-workflows-226839.md

diff --git a/graph/procedures/self-managing-n8n-server-health-monitor-with-sub-workflows-226839.md b/graph/procedures/self-managing-n8n-server-health-monitor-with-sub-workflows-226839.md
new file mode 100644
index 00000000000..5d4a861209b
--- /dev/null
+++ b/graph/procedures/self-managing-n8n-server-health-monitor-with-sub-workflows-226839.md
@@ -0,0 +1,48 @@
+---
+id: 2268393f-b90f-4ae8-9995-4b942ed2b2f7
+type: procedure
+title: "Self-managing n8n server health monitor with sub-workflows"
+tags: [n8n, homelab, monitoring, claude-runner, architecture, procedure]
+importance: 0.9
+confidence: 0.8
+created: "2026-02-20T04:31:58.029736+00:00"
+updated: "2026-02-20T04:31:58.029736+00:00"
+---
+
+## Architecture
+
+Master + sub-workflow pattern in n8n for server health monitoring via CT 302 (claude-runner at 10.10.0.148).
+
+### Master Workflow: "Server Health Monitor" (id: p7XmW23SgCs3hEkY)
+
+1. Schedule trigger (every 5 min)
+2. SSH to CT 302 → run `list_servers.sh` to get server keys from config.yaml as JSON array
+3. Code node: split array into items `[{server_key: "arr-stack"}, ...]`
+4. Execute Sub-workflow (mode: "each item") → calls "Server Health Check"
+5. Aggregate results (count healthy/remediated/escalated)
+6. If any escalations → Discord summary embed
+
+### Sub-workflow: "Server Health Check"
+
+1. Execute Workflow Trigger — receives `{ server_key }` input
+2. SSH to CT 302: `health_check.py --server {server_key}`
+3. Parse JSON output (status/exit_code/issues/escalations)
+4. If exit_code == 2 → SSH: `remediate.sh` with escalation data
+5. Return results to parent
+
+### Key Design Decisions
+
+- **Server list from config.yaml** — single source of truth on CT 302. Adding a server = edit config.yaml + git pull. No n8n changes needed.
+- **Exit code semantics:** 0=healthy, 1=auto-remediated (script already sent Discord), 2=needs Claude escalation
+- **Discord:** Tier 1 alerts handled by notifier.py in health_check.py. Master only sends summary for escalations.
+- **SSH chain:** n8n (10.10.0.210) → n8n_runner_key → CT 302 (10.10.0.148) → homelab_rsa → target servers
+- **SSH credential:** "SSH Private Key account" (id: QkbHQ8JmYimUoTcM) — host 10.10.0.148, user root, n8n_runner_key (ed25519)
+
+### Files on CT 302
+
+- `/root/.claude/skills/server-diagnostics/config.yaml` — server inventory
+- `/root/.claude/skills/server-diagnostics/health_check.py` — health checker (Python, exit codes 0/1/2)
+- `/root/.claude/skills/server-diagnostics/remediate.sh` — Claude CLI headless wrapper for escalation
+- `/root/.claude/skills/server-diagnostics/list_servers.sh` — extracts server keys as JSON from config.yaml (to be created)
+- `/root/.claude/skills/server-diagnostics/client.py` — SSH diagnostic toolkit for Claude during escalation
+- `/root/.claude/skills/server-diagnostics/notifier.py` — Discord webhook notifications