store: Self-managing n8n server health monitor with sub-workflows

This commit is contained in:
Cal Corum 2026-02-19 22:31:58 -06:00
parent 59b1fedd29
commit f035fd93af

View File

@ -0,0 +1,48 @@
---
id: 2268393f-b90f-4ae8-9995-4b942ed2b2f7
type: procedure
title: "Self-managing n8n server health monitor with sub-workflows"
tags: [n8n, homelab, monitoring, claude-runner, architecture, procedure]
importance: 0.9
confidence: 0.8
created: "2026-02-20T04:31:58.029736+00:00"
updated: "2026-02-20T04:31:58.029736+00:00"
---
## Architecture
Master + sub-workflow pattern in n8n for server health monitoring via CT 302 (claude-runner at 10.10.0.148).
### Master Workflow: "Server Health Monitor" (id: p7XmW23SgCs3hEkY)
1. Schedule trigger (every 5 min)
2. SSH to CT 302 → run `list_servers.sh` to get server keys from config.yaml as JSON array
3. Code node: split array into items `[{server_key: "arr-stack"}, ...]`
4. Execute Sub-workflow (mode: "each item") → calls "Server Health Check"
5. Aggregate results (count healthy/remediated/escalated)
6. If any escalations → Discord summary embed
### Sub-workflow: "Server Health Check"
1. Execute Workflow Trigger — receives `{ server_key }` input
2. SSH to CT 302: `health_check.py --server {server_key}`
3. Parse JSON output (status/exit_code/issues/escalations)
4. If exit_code == 2 → SSH: `remediate.sh` with escalation data
5. Return results to parent
### Key Design Decisions
- **Server list from config.yaml** — single source of truth on CT 302. Adding a server = edit config.yaml + git pull. No n8n changes needed.
- **Exit code semantics:** 0=healthy, 1=auto-remediated (script already sent Discord), 2=needs Claude escalation
- **Discord:** Tier 1 alerts handled by notifier.py in health_check.py. Master only sends summary for escalations.
- **SSH chain:** n8n (10.10.0.210) → n8n_runner_key → CT 302 (10.10.0.148) → homelab_rsa → target servers
- **SSH credential:** "SSH Private Key account" (id: QkbHQ8JmYimUoTcM) — host 10.10.0.148, user root, n8n_runner_key (ed25519)
### Files on CT 302
- `/root/.claude/skills/server-diagnostics/config.yaml` — server inventory
- `/root/.claude/skills/server-diagnostics/health_check.py` — health checker (Python, exit codes 0/1/2)
- `/root/.claude/skills/server-diagnostics/remediate.sh` — Claude CLI headless wrapper for escalation
- `/root/.claude/skills/server-diagnostics/list_servers.sh` — extracts server keys as JSON from config.yaml (to be created)
- `/root/.claude/skills/server-diagnostics/client.py` — SSH diagnostic toolkit for Claude during escalation
- `/root/.claude/skills/server-diagnostics/notifier.py` — Discord webhook notifications