Original planning folder (no git repo) for the server diagnostics system that runs on CT 300. Live deployment is on claude-runner; this preserves the Agent SDK reference, PRD with Phase 2/3 roadmap, and N8N workflow designs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
496 lines
18 KiB
Markdown
496 lines
18 KiB
Markdown
# N8N Workflow Design - Server Troubleshooting
|
|
|
|
This document describes the N8N workflow that orchestrates the automated troubleshooting pipeline.
|
|
|
|
## Workflow Overview
|
|
|
|
```
|
|
┌─────────────┐ ┌──────────────┐ ┌───────────┐ ┌──────────────┐
|
|
│ Schedule │────▶│ Health Check │────▶│ Has Error?│────▶│ Gather │
|
|
│ Trigger │ │ (docker ps) │ │ │ │ Context │
|
|
└─────────────┘ └──────────────┘ └─────┬─────┘ └──────┬───────┘
|
|
│ │
|
|
│ No │
|
|
▼ ▼
|
|
[End] ┌──────────────┐
|
|
│ Invoke │
|
|
│ Claude Code │
|
|
└──────┬───────┘
|
|
│
|
|
┌─────────────────────┼─────────────────────┐
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌────────────┐ ┌──────────────┐ ┌──────────────┐
|
|
│ Parse JSON │ │ Save Report │ │ Send Discord │
|
|
│ Response │ │ to NAS │ │ Notification │
|
|
└────────────┘ └──────────────┘ └──────────────┘
|
|
```
|
|
|
|
## Workflow JSON Export
|
|
|
|
This can be imported directly into N8N.
|
|
|
|
```json
|
|
{
|
|
"name": "Server Troubleshooting Pipeline",
|
|
"nodes": [
|
|
{
|
|
"parameters": {
|
|
"rule": {
|
|
"interval": [
|
|
{
|
|
"field": "minutes",
|
|
"minutesInterval": 1
|
|
}
|
|
]
|
|
}
|
|
},
|
|
"id": "schedule-trigger",
|
|
"name": "Every Minute",
|
|
"type": "n8n-nodes-base.scheduleTrigger",
|
|
"typeVersion": 1.2,
|
|
"position": [0, 0]
|
|
},
|
|
{
|
|
"parameters": {
|
|
"command": "ssh -i /root/.ssh/claude_lxc_key root@claude-lxc \"ssh -i ~/.ssh/claude_diagnostics_key root@10.10.0.11 'docker ps --format json'\""
|
|
},
|
|
"id": "docker-health-check",
|
|
"name": "Check Docker Status",
|
|
"type": "n8n-nodes-base.executeCommand",
|
|
"typeVersion": 1,
|
|
"position": [220, 0]
|
|
},
|
|
{
|
|
"parameters": {
|
|
"jsCode": "// Parse docker ps output and check for issues\nconst stdout = $input.first().json.stdout || '';\nconst stderr = $input.first().json.stderr || '';\n\n// Handle empty or error cases\nif (stderr && !stdout) {\n return [{\n json: {\n hasError: true,\n errorType: 'docker_unreachable',\n errorMessage: stderr,\n containers: []\n }\n }];\n}\n\n// Parse container statuses\nconst lines = stdout.trim().split('\\n').filter(l => l);\nconst containers = lines.map(line => {\n try {\n return JSON.parse(line);\n } catch {\n return null;\n }\n}).filter(c => c);\n\n// Check for unhealthy or stopped containers\nconst issues = containers.filter(c => {\n const status = (c.Status || c.State || '').toLowerCase();\n return !status.includes('up') && !status.includes('running');\n});\n\nreturn [{\n json: {\n hasError: issues.length > 0,\n errorType: issues.length > 0 ? 'container_stopped' : null,\n errorMessage: issues.length > 0 \n ? `Containers not running: ${issues.map(i => i.Names || i.Name).join(', ')}`\n : null,\n affectedContainers: issues.map(i => i.Names || i.Name),\n allContainers: containers,\n timestamp: new Date().toISOString()\n }\n}];"
|
|
},
|
|
"id": "parse-docker-status",
|
|
"name": "Parse Container Status",
|
|
"type": "n8n-nodes-base.code",
|
|
"typeVersion": 2,
|
|
"position": [440, 0]
|
|
},
|
|
{
|
|
"parameters": {
|
|
"conditions": {
|
|
"boolean": [
|
|
{
|
|
"value1": "={{ $json.hasError }}",
|
|
"value2": true
|
|
}
|
|
]
|
|
}
|
|
},
|
|
"id": "check-has-error",
|
|
"name": "Has Error?",
|
|
"type": "n8n-nodes-base.if",
|
|
"typeVersion": 2,
|
|
"position": [660, 0]
|
|
},
|
|
{
|
|
"parameters": {
|
|
"command": "ssh -i /root/.ssh/claude_lxc_key root@claude-lxc \"docker logs --tail 100 {{ $json.affectedContainers[0] }} 2>&1 | tail -50\""
|
|
},
|
|
"id": "gather-logs",
|
|
"name": "Gather Recent Logs",
|
|
"type": "n8n-nodes-base.executeCommand",
|
|
"typeVersion": 1,
|
|
"position": [880, -100]
|
|
},
|
|
{
|
|
"parameters": {
|
|
"jsCode": "// Prepare context for Claude Code\nconst prevData = $('Parse Container Status').first().json;\nconst logs = $input.first().json.stdout || 'No logs available';\n\nconst context = {\n server: 'proxmox-host',\n errorType: prevData.errorType,\n errorMessage: prevData.errorMessage,\n affectedContainers: prevData.affectedContainers,\n timestamp: prevData.timestamp,\n recentLogs: logs.substring(0, 2000), // Limit log size\n allContainerStatus: prevData.allContainers\n};\n\nreturn [{ json: context }];"
|
|
},
|
|
"id": "prepare-context",
|
|
"name": "Prepare Claude Context",
|
|
"type": "n8n-nodes-base.code",
|
|
"typeVersion": 2,
|
|
"position": [1100, -100]
|
|
},
|
|
{
|
|
"parameters": {
|
|
"command": "=ssh -i /root/.ssh/claude_lxc_key root@claude-lxc 'claude -p \"$(cat <<'\\''PROMPT'\\'''\nYou are troubleshooting a server issue. Use the server-diagnostics skill.\n\nServer: {{ $json.server }}\nError Type: {{ $json.errorType }}\nTimestamp: {{ $json.timestamp }}\nError Message: {{ $json.errorMessage }}\nAffected Containers: {{ $json.affectedContainers.join(\", \") }}\n\nRecent Logs:\n{{ $json.recentLogs }}\n\nInstructions:\n1. Use the Python diagnostic client to investigate\n2. Check MemoryGraph for similar past issues\n3. Analyze the root cause\n4. If appropriate, execute low-risk remediation (docker restart)\n5. Store learnings in MemoryGraph\nPROMPT\n)\" --output-format json --json-schema '\\''{ \"type\": \"object\", \"properties\": { \"root_cause\": { \"type\": \"string\" }, \"severity\": { \"type\": \"string\", \"enum\": [\"low\", \"medium\", \"high\", \"critical\"] }, \"affected_services\": { \"type\": \"array\", \"items\": { \"type\": \"string\" } }, \"diagnosis_steps\": { \"type\": \"array\", \"items\": { \"type\": \"string\" } }, \"recommended_actions\": { \"type\": \"array\", \"items\": { \"type\": \"object\", \"properties\": { \"action\": { \"type\": \"string\" }, \"command\": { \"type\": \"string\" }, \"risk_level\": { \"type\": \"string\" }, \"executed\": { \"type\": \"boolean\" } } } }, \"remediation_performed\": { \"type\": \"string\" }, \"additional_context\": { \"type\": \"string\" } }, \"required\": [\"root_cause\", \"severity\", \"affected_services\"] }'\\'' --allowedTools \"Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)\" --max-turns 10'",
|
|
"timeout": 180000
|
|
},
|
|
"id": "invoke-claude",
|
|
"name": "Invoke Claude Code",
|
|
"type": "n8n-nodes-base.executeCommand",
|
|
"typeVersion": 1,
|
|
"position": [1320, -100]
|
|
},
|
|
{
|
|
"parameters": {
|
|
"jsCode": "// Parse Claude Code JSON output\nconst stdout = $input.first().json.stdout || '';\nconst stderr = $input.first().json.stderr || '';\nconst exitCode = $input.first().json.exitCode || 0;\n\n// Try to parse JSON response\nlet claudeResponse = null;\nlet parseError = null;\n\ntry {\n // Claude outputs JSON with result field\n const parsed = JSON.parse(stdout);\n claudeResponse = parsed.result ? JSON.parse(parsed.result) : parsed;\n} catch (e) {\n // Try to find JSON in the output\n const jsonMatch = stdout.match(/\\{[\\s\\S]*\\}/);\n if (jsonMatch) {\n try {\n claudeResponse = JSON.parse(jsonMatch[0]);\n } catch {\n parseError = e.message;\n }\n } else {\n parseError = e.message;\n }\n}\n\nconst context = $('Prepare Claude Context').first().json;\n\nreturn [{\n json: {\n success: claudeResponse !== null,\n parseError: parseError,\n rawOutput: stdout.substring(0, 5000),\n stderr: stderr,\n exitCode: exitCode,\n originalContext: context,\n troubleshooting: claudeResponse || {\n root_cause: 'Failed to parse Claude response',\n severity: 'high',\n affected_services: context.affectedContainers,\n recommended_actions: [{ action: 'Manual investigation required', command: '', risk_level: 'none', executed: false }],\n additional_context: stdout.substring(0, 1000)\n },\n timestamp: new Date().toISOString()\n }\n}];"
|
|
},
|
|
"id": "parse-claude-response",
|
|
"name": "Parse Claude Response",
|
|
"type": "n8n-nodes-base.code",
|
|
"typeVersion": 2,
|
|
"position": [1540, -100]
|
|
},
|
|
{
|
|
"parameters": {
|
|
"command": "=mkdir -p /mnt/nas/claude-reports/$(date +%Y-%m) && echo '{{ JSON.stringify($json) }}' > /mnt/nas/claude-reports/$(date +%Y-%m)/{{ $json.timestamp.replace(/:/g, '-') }}.json"
|
|
},
|
|
"id": "save-report",
|
|
"name": "Save Report to NAS",
|
|
"type": "n8n-nodes-base.executeCommand",
|
|
"typeVersion": 1,
|
|
"position": [1760, -200]
|
|
},
|
|
{
|
|
"parameters": {
|
|
"httpMethod": "POST",
|
|
"path": "discord-webhook-url-here",
|
|
"options": {}
|
|
},
|
|
"id": "discord-webhook",
|
|
"name": "Discord Notification",
|
|
"type": "n8n-nodes-base.httpRequest",
|
|
"typeVersion": 4.2,
|
|
"position": [1760, 0]
|
|
}
|
|
],
|
|
"connections": {
|
|
"Every Minute": {
|
|
"main": [
|
|
[{ "node": "Check Docker Status", "type": "main", "index": 0 }]
|
|
]
|
|
},
|
|
"Check Docker Status": {
|
|
"main": [
|
|
[{ "node": "Parse Container Status", "type": "main", "index": 0 }]
|
|
]
|
|
},
|
|
"Parse Container Status": {
|
|
"main": [
|
|
[{ "node": "Has Error?", "type": "main", "index": 0 }]
|
|
]
|
|
},
|
|
"Has Error?": {
|
|
"main": [
|
|
[{ "node": "Gather Recent Logs", "type": "main", "index": 0 }],
|
|
[]
|
|
]
|
|
},
|
|
"Gather Recent Logs": {
|
|
"main": [
|
|
[{ "node": "Prepare Claude Context", "type": "main", "index": 0 }]
|
|
]
|
|
},
|
|
"Prepare Claude Context": {
|
|
"main": [
|
|
[{ "node": "Invoke Claude Code", "type": "main", "index": 0 }]
|
|
]
|
|
},
|
|
"Invoke Claude Code": {
|
|
"main": [
|
|
[{ "node": "Parse Claude Response", "type": "main", "index": 0 }]
|
|
]
|
|
},
|
|
"Parse Claude Response": {
|
|
"main": [
|
|
[
|
|
{ "node": "Save Report to NAS", "type": "main", "index": 0 },
|
|
{ "node": "Discord Notification", "type": "main", "index": 0 }
|
|
]
|
|
]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Node Details
|
|
|
|
### 1. Schedule Trigger
|
|
|
|
- **Interval:** Every 1 minute
|
|
- **Purpose:** Regular health checks
|
|
|
|
### 2. Check Docker Status
|
|
|
|
```bash
|
|
ssh -i /root/.ssh/claude_lxc_key root@claude-lxc \
|
|
"ssh -i ~/.ssh/claude_diagnostics_key root@10.10.0.11 'docker ps --format json'"
|
|
```
|
|
|
|
This double-SSH pattern:
|
|
1. N8N → Claude LXC (where Claude Code and SSH keys live)
|
|
2. Claude LXC → Proxmox Host (where Docker runs)
|
|
|
|
### 3. Parse Container Status (JavaScript)
|
|
|
|
Parses the `docker ps` JSON output and checks for containers that are not "Up" or "running".
|
|
|
|
**Input:** Raw stdout from docker ps
|
|
**Output:**
|
|
```json
|
|
{
|
|
"hasError": true,
|
|
"errorType": "container_stopped",
|
|
"errorMessage": "Containers not running: tdarr",
|
|
"affectedContainers": ["tdarr"],
|
|
"allContainers": [...],
|
|
"timestamp": "2025-12-19T14:30:00.000Z"
|
|
}
|
|
```
|
|
|
|
### 4. Has Error? (IF Node)
|
|
|
|
Branches workflow:
|
|
- **True path:** Continue to troubleshooting
|
|
- **False path:** End (no action needed)
|
|
|
|
### 5. Gather Recent Logs
|
|
|
|
Fetches last 100 lines of logs from the first affected container.
|
|
|
|
### 6. Prepare Claude Context
|
|
|
|
Combines:
|
|
- Error information from health check
|
|
- Recent logs
|
|
- Timestamp
|
|
- Server identifier
|
|
|
|
### 7. Invoke Claude Code
|
|
|
|
The core troubleshooting invocation:
|
|
|
|
```bash
|
|
claude -p "<prompt>" \
|
|
--output-format json \
|
|
--json-schema '<schema>' \
|
|
--allowedTools "Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)" \
|
|
--max-turns 10
|
|
```
|
|
|
|
**Timeout:** 180 seconds (3 minutes)
|
|
|
|
### 8. Parse Claude Response
|
|
|
|
Extracts the structured troubleshooting result from Claude's JSON output.
|
|
|
|
Handles:
|
|
- Clean JSON parse
|
|
- JSON embedded in larger output
|
|
- Parse failures (falls back to raw output)
|
|
|
|
### 9. Save Report to NAS
|
|
|
|
Saves full troubleshooting report to:
|
|
```
|
|
/mnt/nas/claude-reports/YYYY-MM/TIMESTAMP.json
|
|
```
|
|
|
|
### 10. Discord Notification
|
|
|
|
Sends formatted message to Discord channel.
|
|
|
|
## Discord Webhook Configuration
|
|
|
|
### Message Format
|
|
|
|
```javascript
|
|
// Discord webhook body (in HTTP Request node)
|
|
{
|
|
"content": null,
|
|
"embeds": [
|
|
{
|
|
"title": "{{ $json.troubleshooting.severity === 'critical' ? '🔴' : $json.troubleshooting.severity === 'high' ? '🟠' : $json.troubleshooting.severity === 'medium' ? '🟡' : '🟢' }} Server Alert",
|
|
"description": "Automated troubleshooting completed",
|
|
"color": {{ $json.troubleshooting.severity === 'critical' ? 15158332 : $json.troubleshooting.severity === 'high' ? 15105570 : 16776960 }},
|
|
"fields": [
|
|
{
|
|
"name": "Root Cause",
|
|
"value": "{{ $json.troubleshooting.root_cause }}",
|
|
"inline": false
|
|
},
|
|
{
|
|
"name": "Affected Services",
|
|
"value": "{{ $json.troubleshooting.affected_services.join(', ') }}",
|
|
"inline": true
|
|
},
|
|
{
|
|
"name": "Severity",
|
|
"value": "{{ $json.troubleshooting.severity.toUpperCase() }}",
|
|
"inline": true
|
|
},
|
|
{
|
|
"name": "Auto-Remediation",
|
|
"value": "{{ $json.troubleshooting.remediation_performed || 'None performed' }}",
|
|
"inline": false
|
|
},
|
|
{
|
|
"name": "Recommended Actions",
|
|
"value": "{{ $json.troubleshooting.recommended_actions.map(a => '• ' + a.action + (a.executed ? ' ✅' : '')).join('\\n') }}",
|
|
"inline": false
|
|
}
|
|
],
|
|
"timestamp": "{{ $json.timestamp }}"
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Webhook Setup
|
|
|
|
1. In Discord, go to Server Settings → Integrations → Webhooks
|
|
2. Create new webhook in `#server-alerts` channel
|
|
3. Copy webhook URL
|
|
4. Add to N8N HTTP Request node
|
|
|
|
## Error Handling
|
|
|
|
### Claude Code Timeout
|
|
|
|
Add error handling branch after "Invoke Claude Code":
|
|
|
|
```javascript
|
|
// In error handler
|
|
if ($input.first().json.exitCode !== 0) {
|
|
return [{
|
|
json: {
|
|
success: false,
|
|
error: 'Claude Code execution failed',
|
|
stderr: $input.first().json.stderr,
|
|
fallback_action: 'Manual investigation required'
|
|
}
|
|
}];
|
|
}
|
|
```
|
|
|
|
### Retry Logic
|
|
|
|
For transient failures, add retry configuration:
|
|
|
|
```json
|
|
{
|
|
"retry": {
|
|
"enabled": true,
|
|
"maxRetries": 3,
|
|
"waitBetweenRetries": 5000
|
|
}
|
|
}
|
|
```
|
|
|
|
### Alert Deduplication (Phase 2)
|
|
|
|
Add cooldown logic to prevent alert fatigue:
|
|
|
|
```javascript
|
|
// Before triggering Claude
|
|
const lastAlertKey = `last_alert_${container}`;
|
|
const lastAlert = await $getWorkflowStaticData(lastAlertKey);
|
|
const cooldownMs = 5 * 60 * 1000; // 5 minutes
|
|
|
|
if (lastAlert && (Date.now() - lastAlert) < cooldownMs) {
|
|
return []; // Skip, within cooldown
|
|
}
|
|
|
|
// After sending alert
|
|
await $setWorkflowStaticData(lastAlertKey, Date.now());
|
|
```
|
|
|
|
## Credentials Required
|
|
|
|
### N8N Environment
|
|
|
|
1. **SSH Key to Claude LXC:** `/root/.ssh/claude_lxc_key`
|
|
2. **Discord Webhook URL:** Stored in N8N credentials
|
|
|
|
### Claude LXC Environment
|
|
|
|
1. **Max Subscription Auth:** Authenticated via device code flow (credentials in `~/.claude/`)
|
|
2. **SSH Key to Proxmox:** `~/.ssh/claude_diagnostics_key`
|
|
|
|
### Handling Auth Expiry
|
|
|
|
Add error detection for authentication failures:
|
|
|
|
```javascript
|
|
// In Parse Claude Response node, detect auth errors
|
|
if (stderr.includes('authenticate') || stderr.includes('unauthorized')) {
|
|
return [{
|
|
json: {
|
|
success: false,
|
|
error: 'Claude Code authentication expired',
|
|
action_required: 'SSH to Claude LXC and run: claude (to re-authenticate)',
|
|
severity: 'critical'
|
|
}
|
|
}];
|
|
}
|
|
```
|
|
|
|
## Testing the Workflow
|
|
|
|
### Manual Trigger Test
|
|
|
|
1. In N8N, click "Execute Workflow" on the workflow
|
|
2. Check each node's output
|
|
3. Verify Discord message received
|
|
|
|
### Simulated Failure Test
|
|
|
|
```bash
|
|
# Stop a container on Proxmox host
|
|
docker stop tdarr
|
|
|
|
# Wait for next health check (1 minute)
|
|
# Verify:
|
|
# 1. Claude Code invoked
|
|
# 2. Discord notification sent
|
|
# 3. Report saved to NAS
|
|
# 4. Container restarted (if remediation enabled)
|
|
```
|
|
|
|
### End-to-End Verification
|
|
|
|
```bash
|
|
# Check N8N execution history
|
|
# - Go to N8N UI → Executions
|
|
# - Review successful and failed runs
|
|
|
|
# Check NAS for reports
|
|
ls -la /mnt/nas/claude-reports/$(date +%Y-%m)/
|
|
|
|
# Check Discord channel for notifications
|
|
```
|
|
|
|
## Performance Considerations
|
|
|
|
1. **Schedule Interval:** 1 minute may be aggressive. Consider 5 minutes for production.
|
|
|
|
2. **Claude Code Timeout:** 180 seconds is generous. Most diagnoses should complete in 30-60 seconds.
|
|
|
|
3. **Log Size:** Limit logs passed to Claude to avoid token limits (2000 chars in template).
|
|
|
|
4. **Parallel Execution:** N8N should handle if previous execution is still running.
|
|
|
|
## Security Notes
|
|
|
|
1. **SSH Key Permissions:** Ensure keys are 600 permissions
|
|
2. **Webhook URL:** Don't expose in logs or error messages
|
|
3. **NAS Mount:** Ensure proper permissions on report directory
|
|
4. **API Key:** Never log or expose ANTHROPIC_API_KEY
|
|
|
|
## Monitoring the Workflow
|
|
|
|
### N8N Execution Metrics
|
|
|
|
- Successful executions per hour
|
|
- Failed executions per hour
|
|
- Average execution time
|
|
- Claude Code invocation count
|
|
|
|
### Alert on Workflow Failure
|
|
|
|
Create a second workflow that monitors the primary workflow's health:
|
|
|
|
```
|
|
Trigger: Every 15 minutes
|
|
Check: Last successful execution within 20 minutes
|
|
Alert: If no recent success, send Discord alert about monitoring failure
|
|
```
|