store: Architecture: N8N Server Health Monitor uses Master Loop + Sub-workflow pattern

2026-03-01 09:58:58 -06:00 · 2026-03-01 09:58:58 -06:00 · 3b3487a43f
commit 3b3487a43f
parent c0a62f3669
1 changed files with 38 additions and 0 deletions
--- a/graph/decisions/architecture-n8n-server-health-monitor-uses-master-loop-sub-67898e.md
+++ b/graph/decisions/architecture-n8n-server-health-monitor-uses-master-loop-sub-67898e.md
@ -0,0 +1,38 @@
+---
+id: 67898e52-470a-470e-b149-43fef0047ae9
+type: decision
+title: "Architecture: N8N Server Health Monitor uses Master Loop + Sub-workflow pattern"
+tags: [n8n, server-diagnostics, workflow-architecture, decision, claude-home]
+importance: 0.65
+confidence: 0.8
+created: "2026-03-01T15:58:58.161120+00:00"
+updated: "2026-03-01T15:58:58.161120+00:00"
+---
+
+# N8N Server Health Monitor Architecture
+
+## Workflow IDs
+- **Master Loop**: `p7XmW23SgCs3hEkY` — runs every 5 minutes
+- **Sub-workflow**: `BhzYmWr6NcIDoioy` — per-server health check
+
+## Flow
+1. Master Loop SSHes to CT 300, fetches server list
+2. Splits list into items, calls sub-workflow per server
+3. Sub-workflow runs `health_check.py --server {key}`, parses exit code:
+   - `0` = healthy
+   - `1` = remediated
+   - `2` = escalation → runs `remediate.sh`
+4. Master loop aggregates all results
+5. Discord webhook fires **only if escalations found** (`has_escalations=true`)
+
+## Rationale for Split Design
+- Sub-workflow: 10/10 successes, 0 errors — isolated per-server logic cleanly
+- Master loop handles orchestration and notification separately
+- Conditional Discord notify avoids alert spam on healthy runs
+
+## Notes
+- Sub-workflow `onError` updates on SSH nodes applied without corruption (safe change)
+- The escalation-only Discord path was the source of the silent Master Loop failure bug (missing URL)
+
+## Tags
+n8n, server-diagnostics, workflow-architecture