Original planning folder (no git repo) for the server diagnostics system that runs on CT 300. Live deployment is on claude-runner; this preserves the Agent SDK reference, PRD with Phase 2/3 roadmap, and N8N workflow designs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
18 KiB
N8N Workflow Design - Server Troubleshooting
This document describes the N8N workflow that orchestrates the automated troubleshooting pipeline.
Workflow Overview
┌─────────────┐ ┌──────────────┐ ┌───────────┐ ┌──────────────┐
│ Schedule │────▶│ Health Check │────▶│ Has Error?│────▶│ Gather │
│ Trigger │ │ (docker ps) │ │ │ │ Context │
└─────────────┘ └──────────────┘ └─────┬─────┘ └──────┬───────┘
│ │
│ No │
▼ ▼
[End] ┌──────────────┐
│ Invoke │
│ Claude Code │
└──────┬───────┘
│
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌────────────┐ ┌──────────────┐ ┌──────────────┐
│ Parse JSON │ │ Save Report │ │ Send Discord │
│ Response │ │ to NAS │ │ Notification │
└────────────┘ └──────────────┘ └──────────────┘
Workflow JSON Export
This can be imported directly into N8N.
{
"name": "Server Troubleshooting Pipeline",
"nodes": [
{
"parameters": {
"rule": {
"interval": [
{
"field": "minutes",
"minutesInterval": 1
}
]
}
},
"id": "schedule-trigger",
"name": "Every Minute",
"type": "n8n-nodes-base.scheduleTrigger",
"typeVersion": 1.2,
"position": [0, 0]
},
{
"parameters": {
"command": "ssh -i /root/.ssh/claude_lxc_key root@claude-lxc \"ssh -i ~/.ssh/claude_diagnostics_key root@10.10.0.11 'docker ps --format json'\""
},
"id": "docker-health-check",
"name": "Check Docker Status",
"type": "n8n-nodes-base.executeCommand",
"typeVersion": 1,
"position": [220, 0]
},
{
"parameters": {
"jsCode": "// Parse docker ps output and check for issues\nconst stdout = $input.first().json.stdout || '';\nconst stderr = $input.first().json.stderr || '';\n\n// Handle empty or error cases\nif (stderr && !stdout) {\n return [{\n json: {\n hasError: true,\n errorType: 'docker_unreachable',\n errorMessage: stderr,\n containers: []\n }\n }];\n}\n\n// Parse container statuses\nconst lines = stdout.trim().split('\\n').filter(l => l);\nconst containers = lines.map(line => {\n try {\n return JSON.parse(line);\n } catch {\n return null;\n }\n}).filter(c => c);\n\n// Check for unhealthy or stopped containers\nconst issues = containers.filter(c => {\n const status = (c.Status || c.State || '').toLowerCase();\n return !status.includes('up') && !status.includes('running');\n});\n\nreturn [{\n json: {\n hasError: issues.length > 0,\n errorType: issues.length > 0 ? 'container_stopped' : null,\n errorMessage: issues.length > 0 \n ? `Containers not running: ${issues.map(i => i.Names || i.Name).join(', ')}`\n : null,\n affectedContainers: issues.map(i => i.Names || i.Name),\n allContainers: containers,\n timestamp: new Date().toISOString()\n }\n}];"
},
"id": "parse-docker-status",
"name": "Parse Container Status",
"type": "n8n-nodes-base.code",
"typeVersion": 2,
"position": [440, 0]
},
{
"parameters": {
"conditions": {
"boolean": [
{
"value1": "={{ $json.hasError }}",
"value2": true
}
]
}
},
"id": "check-has-error",
"name": "Has Error?",
"type": "n8n-nodes-base.if",
"typeVersion": 2,
"position": [660, 0]
},
{
"parameters": {
"command": "ssh -i /root/.ssh/claude_lxc_key root@claude-lxc \"docker logs --tail 100 {{ $json.affectedContainers[0] }} 2>&1 | tail -50\""
},
"id": "gather-logs",
"name": "Gather Recent Logs",
"type": "n8n-nodes-base.executeCommand",
"typeVersion": 1,
"position": [880, -100]
},
{
"parameters": {
"jsCode": "// Prepare context for Claude Code\nconst prevData = $('Parse Container Status').first().json;\nconst logs = $input.first().json.stdout || 'No logs available';\n\nconst context = {\n server: 'proxmox-host',\n errorType: prevData.errorType,\n errorMessage: prevData.errorMessage,\n affectedContainers: prevData.affectedContainers,\n timestamp: prevData.timestamp,\n recentLogs: logs.substring(0, 2000), // Limit log size\n allContainerStatus: prevData.allContainers\n};\n\nreturn [{ json: context }];"
},
"id": "prepare-context",
"name": "Prepare Claude Context",
"type": "n8n-nodes-base.code",
"typeVersion": 2,
"position": [1100, -100]
},
{
"parameters": {
"command": "=ssh -i /root/.ssh/claude_lxc_key root@claude-lxc 'claude -p \"$(cat <<'\\''PROMPT'\\'''\nYou are troubleshooting a server issue. Use the server-diagnostics skill.\n\nServer: {{ $json.server }}\nError Type: {{ $json.errorType }}\nTimestamp: {{ $json.timestamp }}\nError Message: {{ $json.errorMessage }}\nAffected Containers: {{ $json.affectedContainers.join(\", \") }}\n\nRecent Logs:\n{{ $json.recentLogs }}\n\nInstructions:\n1. Use the Python diagnostic client to investigate\n2. Check MemoryGraph for similar past issues\n3. Analyze the root cause\n4. If appropriate, execute low-risk remediation (docker restart)\n5. Store learnings in MemoryGraph\nPROMPT\n)\" --output-format json --json-schema '\\''{ \"type\": \"object\", \"properties\": { \"root_cause\": { \"type\": \"string\" }, \"severity\": { \"type\": \"string\", \"enum\": [\"low\", \"medium\", \"high\", \"critical\"] }, \"affected_services\": { \"type\": \"array\", \"items\": { \"type\": \"string\" } }, \"diagnosis_steps\": { \"type\": \"array\", \"items\": { \"type\": \"string\" } }, \"recommended_actions\": { \"type\": \"array\", \"items\": { \"type\": \"object\", \"properties\": { \"action\": { \"type\": \"string\" }, \"command\": { \"type\": \"string\" }, \"risk_level\": { \"type\": \"string\" }, \"executed\": { \"type\": \"boolean\" } } } }, \"remediation_performed\": { \"type\": \"string\" }, \"additional_context\": { \"type\": \"string\" } }, \"required\": [\"root_cause\", \"severity\", \"affected_services\"] }'\\'' --allowedTools \"Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)\" --max-turns 10'",
"timeout": 180000
},
"id": "invoke-claude",
"name": "Invoke Claude Code",
"type": "n8n-nodes-base.executeCommand",
"typeVersion": 1,
"position": [1320, -100]
},
{
"parameters": {
"jsCode": "// Parse Claude Code JSON output\nconst stdout = $input.first().json.stdout || '';\nconst stderr = $input.first().json.stderr || '';\nconst exitCode = $input.first().json.exitCode || 0;\n\n// Try to parse JSON response\nlet claudeResponse = null;\nlet parseError = null;\n\ntry {\n // Claude outputs JSON with result field\n const parsed = JSON.parse(stdout);\n claudeResponse = parsed.result ? JSON.parse(parsed.result) : parsed;\n} catch (e) {\n // Try to find JSON in the output\n const jsonMatch = stdout.match(/\\{[\\s\\S]*\\}/);\n if (jsonMatch) {\n try {\n claudeResponse = JSON.parse(jsonMatch[0]);\n } catch {\n parseError = e.message;\n }\n } else {\n parseError = e.message;\n }\n}\n\nconst context = $('Prepare Claude Context').first().json;\n\nreturn [{\n json: {\n success: claudeResponse !== null,\n parseError: parseError,\n rawOutput: stdout.substring(0, 5000),\n stderr: stderr,\n exitCode: exitCode,\n originalContext: context,\n troubleshooting: claudeResponse || {\n root_cause: 'Failed to parse Claude response',\n severity: 'high',\n affected_services: context.affectedContainers,\n recommended_actions: [{ action: 'Manual investigation required', command: '', risk_level: 'none', executed: false }],\n additional_context: stdout.substring(0, 1000)\n },\n timestamp: new Date().toISOString()\n }\n}];"
},
"id": "parse-claude-response",
"name": "Parse Claude Response",
"type": "n8n-nodes-base.code",
"typeVersion": 2,
"position": [1540, -100]
},
{
"parameters": {
"command": "=mkdir -p /mnt/nas/claude-reports/$(date +%Y-%m) && echo '{{ JSON.stringify($json) }}' > /mnt/nas/claude-reports/$(date +%Y-%m)/{{ $json.timestamp.replace(/:/g, '-') }}.json"
},
"id": "save-report",
"name": "Save Report to NAS",
"type": "n8n-nodes-base.executeCommand",
"typeVersion": 1,
"position": [1760, -200]
},
{
"parameters": {
"httpMethod": "POST",
"path": "discord-webhook-url-here",
"options": {}
},
"id": "discord-webhook",
"name": "Discord Notification",
"type": "n8n-nodes-base.httpRequest",
"typeVersion": 4.2,
"position": [1760, 0]
}
],
"connections": {
"Every Minute": {
"main": [
[{ "node": "Check Docker Status", "type": "main", "index": 0 }]
]
},
"Check Docker Status": {
"main": [
[{ "node": "Parse Container Status", "type": "main", "index": 0 }]
]
},
"Parse Container Status": {
"main": [
[{ "node": "Has Error?", "type": "main", "index": 0 }]
]
},
"Has Error?": {
"main": [
[{ "node": "Gather Recent Logs", "type": "main", "index": 0 }],
[]
]
},
"Gather Recent Logs": {
"main": [
[{ "node": "Prepare Claude Context", "type": "main", "index": 0 }]
]
},
"Prepare Claude Context": {
"main": [
[{ "node": "Invoke Claude Code", "type": "main", "index": 0 }]
]
},
"Invoke Claude Code": {
"main": [
[{ "node": "Parse Claude Response", "type": "main", "index": 0 }]
]
},
"Parse Claude Response": {
"main": [
[
{ "node": "Save Report to NAS", "type": "main", "index": 0 },
{ "node": "Discord Notification", "type": "main", "index": 0 }
]
]
}
}
}
Node Details
1. Schedule Trigger
- Interval: Every 1 minute
- Purpose: Regular health checks
2. Check Docker Status
ssh -i /root/.ssh/claude_lxc_key root@claude-lxc \
"ssh -i ~/.ssh/claude_diagnostics_key root@10.10.0.11 'docker ps --format json'"
This double-SSH pattern:
- N8N → Claude LXC (where Claude Code and SSH keys live)
- Claude LXC → Proxmox Host (where Docker runs)
3. Parse Container Status (JavaScript)
Parses the docker ps JSON output and checks for containers that are not "Up" or "running".
Input: Raw stdout from docker ps Output:
{
"hasError": true,
"errorType": "container_stopped",
"errorMessage": "Containers not running: tdarr",
"affectedContainers": ["tdarr"],
"allContainers": [...],
"timestamp": "2025-12-19T14:30:00.000Z"
}
4. Has Error? (IF Node)
Branches workflow:
- True path: Continue to troubleshooting
- False path: End (no action needed)
5. Gather Recent Logs
Fetches last 100 lines of logs from the first affected container.
6. Prepare Claude Context
Combines:
- Error information from health check
- Recent logs
- Timestamp
- Server identifier
7. Invoke Claude Code
The core troubleshooting invocation:
claude -p "<prompt>" \
--output-format json \
--json-schema '<schema>' \
--allowedTools "Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)" \
--max-turns 10
Timeout: 180 seconds (3 minutes)
8. Parse Claude Response
Extracts the structured troubleshooting result from Claude's JSON output.
Handles:
- Clean JSON parse
- JSON embedded in larger output
- Parse failures (falls back to raw output)
9. Save Report to NAS
Saves full troubleshooting report to:
/mnt/nas/claude-reports/YYYY-MM/TIMESTAMP.json
10. Discord Notification
Sends formatted message to Discord channel.
Discord Webhook Configuration
Message Format
// Discord webhook body (in HTTP Request node)
{
"content": null,
"embeds": [
{
"title": "{{ $json.troubleshooting.severity === 'critical' ? '🔴' : $json.troubleshooting.severity === 'high' ? '🟠' : $json.troubleshooting.severity === 'medium' ? '🟡' : '🟢' }} Server Alert",
"description": "Automated troubleshooting completed",
"color": {{ $json.troubleshooting.severity === 'critical' ? 15158332 : $json.troubleshooting.severity === 'high' ? 15105570 : 16776960 }},
"fields": [
{
"name": "Root Cause",
"value": "{{ $json.troubleshooting.root_cause }}",
"inline": false
},
{
"name": "Affected Services",
"value": "{{ $json.troubleshooting.affected_services.join(', ') }}",
"inline": true
},
{
"name": "Severity",
"value": "{{ $json.troubleshooting.severity.toUpperCase() }}",
"inline": true
},
{
"name": "Auto-Remediation",
"value": "{{ $json.troubleshooting.remediation_performed || 'None performed' }}",
"inline": false
},
{
"name": "Recommended Actions",
"value": "{{ $json.troubleshooting.recommended_actions.map(a => '• ' + a.action + (a.executed ? ' ✅' : '')).join('\\n') }}",
"inline": false
}
],
"timestamp": "{{ $json.timestamp }}"
}
]
}
Webhook Setup
- In Discord, go to Server Settings → Integrations → Webhooks
- Create new webhook in
#server-alertschannel - Copy webhook URL
- Add to N8N HTTP Request node
Error Handling
Claude Code Timeout
Add error handling branch after "Invoke Claude Code":
// In error handler
if ($input.first().json.exitCode !== 0) {
return [{
json: {
success: false,
error: 'Claude Code execution failed',
stderr: $input.first().json.stderr,
fallback_action: 'Manual investigation required'
}
}];
}
Retry Logic
For transient failures, add retry configuration:
{
"retry": {
"enabled": true,
"maxRetries": 3,
"waitBetweenRetries": 5000
}
}
Alert Deduplication (Phase 2)
Add cooldown logic to prevent alert fatigue:
// Before triggering Claude
const lastAlertKey = `last_alert_${container}`;
const lastAlert = await $getWorkflowStaticData(lastAlertKey);
const cooldownMs = 5 * 60 * 1000; // 5 minutes
if (lastAlert && (Date.now() - lastAlert) < cooldownMs) {
return []; // Skip, within cooldown
}
// After sending alert
await $setWorkflowStaticData(lastAlertKey, Date.now());
Credentials Required
N8N Environment
- SSH Key to Claude LXC:
/root/.ssh/claude_lxc_key - Discord Webhook URL: Stored in N8N credentials
Claude LXC Environment
- Max Subscription Auth: Authenticated via device code flow (credentials in
~/.claude/) - SSH Key to Proxmox:
~/.ssh/claude_diagnostics_key
Handling Auth Expiry
Add error detection for authentication failures:
// In Parse Claude Response node, detect auth errors
if (stderr.includes('authenticate') || stderr.includes('unauthorized')) {
return [{
json: {
success: false,
error: 'Claude Code authentication expired',
action_required: 'SSH to Claude LXC and run: claude (to re-authenticate)',
severity: 'critical'
}
}];
}
Testing the Workflow
Manual Trigger Test
- In N8N, click "Execute Workflow" on the workflow
- Check each node's output
- Verify Discord message received
Simulated Failure Test
# Stop a container on Proxmox host
docker stop tdarr
# Wait for next health check (1 minute)
# Verify:
# 1. Claude Code invoked
# 2. Discord notification sent
# 3. Report saved to NAS
# 4. Container restarted (if remediation enabled)
End-to-End Verification
# Check N8N execution history
# - Go to N8N UI → Executions
# - Review successful and failed runs
# Check NAS for reports
ls -la /mnt/nas/claude-reports/$(date +%Y-%m)/
# Check Discord channel for notifications
Performance Considerations
-
Schedule Interval: 1 minute may be aggressive. Consider 5 minutes for production.
-
Claude Code Timeout: 180 seconds is generous. Most diagnoses should complete in 30-60 seconds.
-
Log Size: Limit logs passed to Claude to avoid token limits (2000 chars in template).
-
Parallel Execution: N8N should handle if previous execution is still running.
Security Notes
- SSH Key Permissions: Ensure keys are 600 permissions
- Webhook URL: Don't expose in logs or error messages
- NAS Mount: Ensure proper permissions on report directory
- API Key: Never log or expose ANTHROPIC_API_KEY
Monitoring the Workflow
N8N Execution Metrics
- Successful executions per hour
- Failed executions per hour
- Average execution time
- Claude Code invocation count
Alert on Workflow Failure
Create a second workflow that monitors the primary workflow's health:
Trigger: Every 15 minutes
Check: Last successful execution within 20 minutes
Alert: If no recent success, send Discord alert about monitoring failure