# N8N Workflow Design - Server Troubleshooting This document describes the N8N workflow that orchestrates the automated troubleshooting pipeline. ## Workflow Overview ``` ┌─────────────┐ ┌──────────────┐ ┌───────────┐ ┌──────────────┐ │ Schedule │────▶│ Health Check │────▶│ Has Error?│────▶│ Gather │ │ Trigger │ │ (docker ps) │ │ │ │ Context │ └─────────────┘ └──────────────┘ └─────┬─────┘ └──────┬───────┘ │ │ │ No │ ▼ ▼ [End] ┌──────────────┐ │ Invoke │ │ Claude Code │ └──────┬───────┘ │ ┌─────────────────────┼─────────────────────┐ │ │ │ ▼ ▼ ▼ ┌────────────┐ ┌──────────────┐ ┌──────────────┐ │ Parse JSON │ │ Save Report │ │ Send Discord │ │ Response │ │ to NAS │ │ Notification │ └────────────┘ └──────────────┘ └──────────────┘ ``` ## Workflow JSON Export This can be imported directly into N8N. ```json { "name": "Server Troubleshooting Pipeline", "nodes": [ { "parameters": { "rule": { "interval": [ { "field": "minutes", "minutesInterval": 1 } ] } }, "id": "schedule-trigger", "name": "Every Minute", "type": "n8n-nodes-base.scheduleTrigger", "typeVersion": 1.2, "position": [0, 0] }, { "parameters": { "command": "ssh -i /root/.ssh/claude_lxc_key root@claude-lxc \"ssh -i ~/.ssh/claude_diagnostics_key root@10.10.0.11 'docker ps --format json'\"" }, "id": "docker-health-check", "name": "Check Docker Status", "type": "n8n-nodes-base.executeCommand", "typeVersion": 1, "position": [220, 0] }, { "parameters": { "jsCode": "// Parse docker ps output and check for issues\nconst stdout = $input.first().json.stdout || '';\nconst stderr = $input.first().json.stderr || '';\n\n// Handle empty or error cases\nif (stderr && !stdout) {\n return [{\n json: {\n hasError: true,\n errorType: 'docker_unreachable',\n errorMessage: stderr,\n containers: []\n }\n }];\n}\n\n// Parse container statuses\nconst lines = stdout.trim().split('\\n').filter(l => l);\nconst containers = lines.map(line => {\n try {\n return JSON.parse(line);\n } catch {\n return null;\n }\n}).filter(c => c);\n\n// Check for unhealthy or stopped containers\nconst issues = containers.filter(c => {\n const status = (c.Status || c.State || '').toLowerCase();\n return !status.includes('up') && !status.includes('running');\n});\n\nreturn [{\n json: {\n hasError: issues.length > 0,\n errorType: issues.length > 0 ? 'container_stopped' : null,\n errorMessage: issues.length > 0 \n ? `Containers not running: ${issues.map(i => i.Names || i.Name).join(', ')}`\n : null,\n affectedContainers: issues.map(i => i.Names || i.Name),\n allContainers: containers,\n timestamp: new Date().toISOString()\n }\n}];" }, "id": "parse-docker-status", "name": "Parse Container Status", "type": "n8n-nodes-base.code", "typeVersion": 2, "position": [440, 0] }, { "parameters": { "conditions": { "boolean": [ { "value1": "={{ $json.hasError }}", "value2": true } ] } }, "id": "check-has-error", "name": "Has Error?", "type": "n8n-nodes-base.if", "typeVersion": 2, "position": [660, 0] }, { "parameters": { "command": "ssh -i /root/.ssh/claude_lxc_key root@claude-lxc \"docker logs --tail 100 {{ $json.affectedContainers[0] }} 2>&1 | tail -50\"" }, "id": "gather-logs", "name": "Gather Recent Logs", "type": "n8n-nodes-base.executeCommand", "typeVersion": 1, "position": [880, -100] }, { "parameters": { "jsCode": "// Prepare context for Claude Code\nconst prevData = $('Parse Container Status').first().json;\nconst logs = $input.first().json.stdout || 'No logs available';\n\nconst context = {\n server: 'proxmox-host',\n errorType: prevData.errorType,\n errorMessage: prevData.errorMessage,\n affectedContainers: prevData.affectedContainers,\n timestamp: prevData.timestamp,\n recentLogs: logs.substring(0, 2000), // Limit log size\n allContainerStatus: prevData.allContainers\n};\n\nreturn [{ json: context }];" }, "id": "prepare-context", "name": "Prepare Claude Context", "type": "n8n-nodes-base.code", "typeVersion": 2, "position": [1100, -100] }, { "parameters": { "command": "=ssh -i /root/.ssh/claude_lxc_key root@claude-lxc 'claude -p \"$(cat <<'\\''PROMPT'\\'''\nYou are troubleshooting a server issue. Use the server-diagnostics skill.\n\nServer: {{ $json.server }}\nError Type: {{ $json.errorType }}\nTimestamp: {{ $json.timestamp }}\nError Message: {{ $json.errorMessage }}\nAffected Containers: {{ $json.affectedContainers.join(\", \") }}\n\nRecent Logs:\n{{ $json.recentLogs }}\n\nInstructions:\n1. Use the Python diagnostic client to investigate\n2. Check MemoryGraph for similar past issues\n3. Analyze the root cause\n4. If appropriate, execute low-risk remediation (docker restart)\n5. Store learnings in MemoryGraph\nPROMPT\n)\" --output-format json --json-schema '\\''{ \"type\": \"object\", \"properties\": { \"root_cause\": { \"type\": \"string\" }, \"severity\": { \"type\": \"string\", \"enum\": [\"low\", \"medium\", \"high\", \"critical\"] }, \"affected_services\": { \"type\": \"array\", \"items\": { \"type\": \"string\" } }, \"diagnosis_steps\": { \"type\": \"array\", \"items\": { \"type\": \"string\" } }, \"recommended_actions\": { \"type\": \"array\", \"items\": { \"type\": \"object\", \"properties\": { \"action\": { \"type\": \"string\" }, \"command\": { \"type\": \"string\" }, \"risk_level\": { \"type\": \"string\" }, \"executed\": { \"type\": \"boolean\" } } } }, \"remediation_performed\": { \"type\": \"string\" }, \"additional_context\": { \"type\": \"string\" } }, \"required\": [\"root_cause\", \"severity\", \"affected_services\"] }'\\'' --allowedTools \"Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)\" --max-turns 10'", "timeout": 180000 }, "id": "invoke-claude", "name": "Invoke Claude Code", "type": "n8n-nodes-base.executeCommand", "typeVersion": 1, "position": [1320, -100] }, { "parameters": { "jsCode": "// Parse Claude Code JSON output\nconst stdout = $input.first().json.stdout || '';\nconst stderr = $input.first().json.stderr || '';\nconst exitCode = $input.first().json.exitCode || 0;\n\n// Try to parse JSON response\nlet claudeResponse = null;\nlet parseError = null;\n\ntry {\n // Claude outputs JSON with result field\n const parsed = JSON.parse(stdout);\n claudeResponse = parsed.result ? JSON.parse(parsed.result) : parsed;\n} catch (e) {\n // Try to find JSON in the output\n const jsonMatch = stdout.match(/\\{[\\s\\S]*\\}/);\n if (jsonMatch) {\n try {\n claudeResponse = JSON.parse(jsonMatch[0]);\n } catch {\n parseError = e.message;\n }\n } else {\n parseError = e.message;\n }\n}\n\nconst context = $('Prepare Claude Context').first().json;\n\nreturn [{\n json: {\n success: claudeResponse !== null,\n parseError: parseError,\n rawOutput: stdout.substring(0, 5000),\n stderr: stderr,\n exitCode: exitCode,\n originalContext: context,\n troubleshooting: claudeResponse || {\n root_cause: 'Failed to parse Claude response',\n severity: 'high',\n affected_services: context.affectedContainers,\n recommended_actions: [{ action: 'Manual investigation required', command: '', risk_level: 'none', executed: false }],\n additional_context: stdout.substring(0, 1000)\n },\n timestamp: new Date().toISOString()\n }\n}];" }, "id": "parse-claude-response", "name": "Parse Claude Response", "type": "n8n-nodes-base.code", "typeVersion": 2, "position": [1540, -100] }, { "parameters": { "command": "=mkdir -p /mnt/nas/claude-reports/$(date +%Y-%m) && echo '{{ JSON.stringify($json) }}' > /mnt/nas/claude-reports/$(date +%Y-%m)/{{ $json.timestamp.replace(/:/g, '-') }}.json" }, "id": "save-report", "name": "Save Report to NAS", "type": "n8n-nodes-base.executeCommand", "typeVersion": 1, "position": [1760, -200] }, { "parameters": { "httpMethod": "POST", "path": "discord-webhook-url-here", "options": {} }, "id": "discord-webhook", "name": "Discord Notification", "type": "n8n-nodes-base.httpRequest", "typeVersion": 4.2, "position": [1760, 0] } ], "connections": { "Every Minute": { "main": [ [{ "node": "Check Docker Status", "type": "main", "index": 0 }] ] }, "Check Docker Status": { "main": [ [{ "node": "Parse Container Status", "type": "main", "index": 0 }] ] }, "Parse Container Status": { "main": [ [{ "node": "Has Error?", "type": "main", "index": 0 }] ] }, "Has Error?": { "main": [ [{ "node": "Gather Recent Logs", "type": "main", "index": 0 }], [] ] }, "Gather Recent Logs": { "main": [ [{ "node": "Prepare Claude Context", "type": "main", "index": 0 }] ] }, "Prepare Claude Context": { "main": [ [{ "node": "Invoke Claude Code", "type": "main", "index": 0 }] ] }, "Invoke Claude Code": { "main": [ [{ "node": "Parse Claude Response", "type": "main", "index": 0 }] ] }, "Parse Claude Response": { "main": [ [ { "node": "Save Report to NAS", "type": "main", "index": 0 }, { "node": "Discord Notification", "type": "main", "index": 0 } ] ] } } } ``` ## Node Details ### 1. Schedule Trigger - **Interval:** Every 1 minute - **Purpose:** Regular health checks ### 2. Check Docker Status ```bash ssh -i /root/.ssh/claude_lxc_key root@claude-lxc \ "ssh -i ~/.ssh/claude_diagnostics_key root@10.10.0.11 'docker ps --format json'" ``` This double-SSH pattern: 1. N8N → Claude LXC (where Claude Code and SSH keys live) 2. Claude LXC → Proxmox Host (where Docker runs) ### 3. Parse Container Status (JavaScript) Parses the `docker ps` JSON output and checks for containers that are not "Up" or "running". **Input:** Raw stdout from docker ps **Output:** ```json { "hasError": true, "errorType": "container_stopped", "errorMessage": "Containers not running: tdarr", "affectedContainers": ["tdarr"], "allContainers": [...], "timestamp": "2025-12-19T14:30:00.000Z" } ``` ### 4. Has Error? (IF Node) Branches workflow: - **True path:** Continue to troubleshooting - **False path:** End (no action needed) ### 5. Gather Recent Logs Fetches last 100 lines of logs from the first affected container. ### 6. Prepare Claude Context Combines: - Error information from health check - Recent logs - Timestamp - Server identifier ### 7. Invoke Claude Code The core troubleshooting invocation: ```bash claude -p "" \ --output-format json \ --json-schema '' \ --allowedTools "Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)" \ --max-turns 10 ``` **Timeout:** 180 seconds (3 minutes) ### 8. Parse Claude Response Extracts the structured troubleshooting result from Claude's JSON output. Handles: - Clean JSON parse - JSON embedded in larger output - Parse failures (falls back to raw output) ### 9. Save Report to NAS Saves full troubleshooting report to: ``` /mnt/nas/claude-reports/YYYY-MM/TIMESTAMP.json ``` ### 10. Discord Notification Sends formatted message to Discord channel. ## Discord Webhook Configuration ### Message Format ```javascript // Discord webhook body (in HTTP Request node) { "content": null, "embeds": [ { "title": "{{ $json.troubleshooting.severity === 'critical' ? '🔴' : $json.troubleshooting.severity === 'high' ? '🟠' : $json.troubleshooting.severity === 'medium' ? '🟡' : '🟢' }} Server Alert", "description": "Automated troubleshooting completed", "color": {{ $json.troubleshooting.severity === 'critical' ? 15158332 : $json.troubleshooting.severity === 'high' ? 15105570 : 16776960 }}, "fields": [ { "name": "Root Cause", "value": "{{ $json.troubleshooting.root_cause }}", "inline": false }, { "name": "Affected Services", "value": "{{ $json.troubleshooting.affected_services.join(', ') }}", "inline": true }, { "name": "Severity", "value": "{{ $json.troubleshooting.severity.toUpperCase() }}", "inline": true }, { "name": "Auto-Remediation", "value": "{{ $json.troubleshooting.remediation_performed || 'None performed' }}", "inline": false }, { "name": "Recommended Actions", "value": "{{ $json.troubleshooting.recommended_actions.map(a => '• ' + a.action + (a.executed ? ' ✅' : '')).join('\\n') }}", "inline": false } ], "timestamp": "{{ $json.timestamp }}" } ] } ``` ### Webhook Setup 1. In Discord, go to Server Settings → Integrations → Webhooks 2. Create new webhook in `#server-alerts` channel 3. Copy webhook URL 4. Add to N8N HTTP Request node ## Error Handling ### Claude Code Timeout Add error handling branch after "Invoke Claude Code": ```javascript // In error handler if ($input.first().json.exitCode !== 0) { return [{ json: { success: false, error: 'Claude Code execution failed', stderr: $input.first().json.stderr, fallback_action: 'Manual investigation required' } }]; } ``` ### Retry Logic For transient failures, add retry configuration: ```json { "retry": { "enabled": true, "maxRetries": 3, "waitBetweenRetries": 5000 } } ``` ### Alert Deduplication (Phase 2) Add cooldown logic to prevent alert fatigue: ```javascript // Before triggering Claude const lastAlertKey = `last_alert_${container}`; const lastAlert = await $getWorkflowStaticData(lastAlertKey); const cooldownMs = 5 * 60 * 1000; // 5 minutes if (lastAlert && (Date.now() - lastAlert) < cooldownMs) { return []; // Skip, within cooldown } // After sending alert await $setWorkflowStaticData(lastAlertKey, Date.now()); ``` ## Credentials Required ### N8N Environment 1. **SSH Key to Claude LXC:** `/root/.ssh/claude_lxc_key` 2. **Discord Webhook URL:** Stored in N8N credentials ### Claude LXC Environment 1. **Max Subscription Auth:** Authenticated via device code flow (credentials in `~/.claude/`) 2. **SSH Key to Proxmox:** `~/.ssh/claude_diagnostics_key` ### Handling Auth Expiry Add error detection for authentication failures: ```javascript // In Parse Claude Response node, detect auth errors if (stderr.includes('authenticate') || stderr.includes('unauthorized')) { return [{ json: { success: false, error: 'Claude Code authentication expired', action_required: 'SSH to Claude LXC and run: claude (to re-authenticate)', severity: 'critical' } }]; } ``` ## Testing the Workflow ### Manual Trigger Test 1. In N8N, click "Execute Workflow" on the workflow 2. Check each node's output 3. Verify Discord message received ### Simulated Failure Test ```bash # Stop a container on Proxmox host docker stop tdarr # Wait for next health check (1 minute) # Verify: # 1. Claude Code invoked # 2. Discord notification sent # 3. Report saved to NAS # 4. Container restarted (if remediation enabled) ``` ### End-to-End Verification ```bash # Check N8N execution history # - Go to N8N UI → Executions # - Review successful and failed runs # Check NAS for reports ls -la /mnt/nas/claude-reports/$(date +%Y-%m)/ # Check Discord channel for notifications ``` ## Performance Considerations 1. **Schedule Interval:** 1 minute may be aggressive. Consider 5 minutes for production. 2. **Claude Code Timeout:** 180 seconds is generous. Most diagnoses should complete in 30-60 seconds. 3. **Log Size:** Limit logs passed to Claude to avoid token limits (2000 chars in template). 4. **Parallel Execution:** N8N should handle if previous execution is still running. ## Security Notes 1. **SSH Key Permissions:** Ensure keys are 600 permissions 2. **Webhook URL:** Don't expose in logs or error messages 3. **NAS Mount:** Ensure proper permissions on report directory 4. **API Key:** Never log or expose ANTHROPIC_API_KEY ## Monitoring the Workflow ### N8N Execution Metrics - Successful executions per hour - Failed executions per hour - Average execution time - Claude Code invocation count ### Alert on Workflow Failure Create a second workflow that monitors the primary workflow's health: ``` Trigger: Every 15 minutes Check: Last successful execution within 20 minutes Alert: If no recent success, send Discord alert about monitoring failure ```