claude-home/legacy/headless-claude/docs/n8n-workflow-design.md

# N8N Workflow Design - Server Troubleshooting

This document describes the N8N workflow that orchestrates the automated troubleshooting pipeline.

## Workflow Overview

```
┌─────────────┐     ┌──────────────┐     ┌───────────┐     ┌──────────────┐
│  Schedule   │────▶│ Health Check │────▶│ Has Error?│────▶│ Gather       │
│  Trigger    │     │ (docker ps)  │     │           │     │ Context      │
└─────────────┘     └──────────────┘     └─────┬─────┘     └──────┬───────┘
                                               │                   │
                                               │ No                │
                                               ▼                   ▼
                                          [End]          ┌──────────────┐
                                                         │ Invoke       │
                                                         │ Claude Code  │
                                                         └──────┬───────┘
                                                                │
                                          ┌─────────────────────┼─────────────────────┐
                                          │                     │                     │
                                          ▼                     ▼                     ▼
                                   ┌────────────┐       ┌──────────────┐      ┌──────────────┐
                                   │ Parse JSON │       │ Save Report  │      │ Send Discord │
                                   │ Response   │       │ to NAS       │      │ Notification │
                                   └────────────┘       └──────────────┘      └──────────────┘
```

## Workflow JSON Export

This can be imported directly into N8N.

```json
{
  "name": "Server Troubleshooting Pipeline",
  "nodes": [
    {
      "parameters": {
        "rule": {
          "interval": [
            {
              "field": "minutes",
              "minutesInterval": 1
            }
          ]
        }
      },
      "id": "schedule-trigger",
      "name": "Every Minute",
      "type": "n8n-nodes-base.scheduleTrigger",
      "typeVersion": 1.2,
      "position": [0, 0]
    },
    {
      "parameters": {
        "command": "ssh -i /root/.ssh/claude_lxc_key root@claude-lxc \"ssh -i ~/.ssh/claude_diagnostics_key root@10.10.0.11 'docker ps --format json'\""
      },
      "id": "docker-health-check",
      "name": "Check Docker Status",
      "type": "n8n-nodes-base.executeCommand",
      "typeVersion": 1,
      "position": [220, 0]
    },
    {
      "parameters": {
        "jsCode": "// Parse docker ps output and check for issues\nconst stdout = $input.first().json.stdout || '';\nconst stderr = $input.first().json.stderr || '';\n\n// Handle empty or error cases\nif (stderr && !stdout) {\n  return [{\n    json: {\n      hasError: true,\n      errorType: 'docker_unreachable',\n      errorMessage: stderr,\n      containers: []\n    }\n  }];\n}\n\n// Parse container statuses\nconst lines = stdout.trim().split('\\n').filter(l => l);\nconst containers = lines.map(line => {\n  try {\n    return JSON.parse(line);\n  } catch {\n    return null;\n  }\n}).filter(c => c);\n\n// Check for unhealthy or stopped containers\nconst issues = containers.filter(c => {\n  const status = (c.Status || c.State || '').toLowerCase();\n  return !status.includes('up') && !status.includes('running');\n});\n\nreturn [{\n  json: {\n    hasError: issues.length > 0,\n    errorType: issues.length > 0 ? 'container_stopped' : null,\n    errorMessage: issues.length > 0 \n      ? `Containers not running: ${issues.map(i => i.Names || i.Name).join(', ')}`\n      : null,\n    affectedContainers: issues.map(i => i.Names || i.Name),\n    allContainers: containers,\n    timestamp: new Date().toISOString()\n  }\n}];"
      },
      "id": "parse-docker-status",
      "name": "Parse Container Status",
      "type": "n8n-nodes-base.code",
      "typeVersion": 2,
      "position": [440, 0]
    },
    {
      "parameters": {
        "conditions": {
          "boolean": [
            {
              "value1": "={{ $json.hasError }}",
              "value2": true
            }
          ]
        }
      },
      "id": "check-has-error",
      "name": "Has Error?",
      "type": "n8n-nodes-base.if",
      "typeVersion": 2,
      "position": [660, 0]
    },
    {
      "parameters": {
        "command": "ssh -i /root/.ssh/claude_lxc_key root@claude-lxc \"docker logs --tail 100 {{ $json.affectedContainers[0] }} 2>&1 | tail -50\""
      },
      "id": "gather-logs",
      "name": "Gather Recent Logs",
      "type": "n8n-nodes-base.executeCommand",
      "typeVersion": 1,
      "position": [880, -100]
    },
    {
      "parameters": {
        "jsCode": "// Prepare context for Claude Code\nconst prevData = $('Parse Container Status').first().json;\nconst logs = $input.first().json.stdout || 'No logs available';\n\nconst context = {\n  server: 'proxmox-host',\n  errorType: prevData.errorType,\n  errorMessage: prevData.errorMessage,\n  affectedContainers: prevData.affectedContainers,\n  timestamp: prevData.timestamp,\n  recentLogs: logs.substring(0, 2000), // Limit log size\n  allContainerStatus: prevData.allContainers\n};\n\nreturn [{ json: context }];"
      },
      "id": "prepare-context",
      "name": "Prepare Claude Context",
      "type": "n8n-nodes-base.code",
      "typeVersion": 2,
      "position": [1100, -100]
    },
    {
      "parameters": {
        "command": "=ssh -i /root/.ssh/claude_lxc_key root@claude-lxc 'claude -p \"$(cat <<'\\''PROMPT'\\'''\nYou are troubleshooting a server issue. Use the server-diagnostics skill.\n\nServer: {{ $json.server }}\nError Type: {{ $json.errorType }}\nTimestamp: {{ $json.timestamp }}\nError Message: {{ $json.errorMessage }}\nAffected Containers: {{ $json.affectedContainers.join(\", \") }}\n\nRecent Logs:\n{{ $json.recentLogs }}\n\nInstructions:\n1. Use the Python diagnostic client to investigate\n2. Check MemoryGraph for similar past issues\n3. Analyze the root cause\n4. If appropriate, execute low-risk remediation (docker restart)\n5. Store learnings in MemoryGraph\nPROMPT\n)\" --output-format json --json-schema '\\''{ \"type\": \"object\", \"properties\": { \"root_cause\": { \"type\": \"string\" }, \"severity\": { \"type\": \"string\", \"enum\": [\"low\", \"medium\", \"high\", \"critical\"] }, \"affected_services\": { \"type\": \"array\", \"items\": { \"type\": \"string\" } }, \"diagnosis_steps\": { \"type\": \"array\", \"items\": { \"type\": \"string\" } }, \"recommended_actions\": { \"type\": \"array\", \"items\": { \"type\": \"object\", \"properties\": { \"action\": { \"type\": \"string\" }, \"command\": { \"type\": \"string\" }, \"risk_level\": { \"type\": \"string\" }, \"executed\": { \"type\": \"boolean\" } } } }, \"remediation_performed\": { \"type\": \"string\" }, \"additional_context\": { \"type\": \"string\" } }, \"required\": [\"root_cause\", \"severity\", \"affected_services\"] }'\\'' --allowedTools \"Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)\" --max-turns 10'",
        "timeout": 180000
      },
      "id": "invoke-claude",
      "name": "Invoke Claude Code",
      "type": "n8n-nodes-base.executeCommand",
      "typeVersion": 1,
      "position": [1320, -100]
    },
    {
      "parameters": {
        "jsCode": "// Parse Claude Code JSON output\nconst stdout = $input.first().json.stdout || '';\nconst stderr = $input.first().json.stderr || '';\nconst exitCode = $input.first().json.exitCode || 0;\n\n// Try to parse JSON response\nlet claudeResponse = null;\nlet parseError = null;\n\ntry {\n  // Claude outputs JSON with result field\n  const parsed = JSON.parse(stdout);\n  claudeResponse = parsed.result ? JSON.parse(parsed.result) : parsed;\n} catch (e) {\n  // Try to find JSON in the output\n  const jsonMatch = stdout.match(/\\{[\\s\\S]*\\}/);\n  if (jsonMatch) {\n    try {\n      claudeResponse = JSON.parse(jsonMatch[0]);\n    } catch {\n      parseError = e.message;\n    }\n  } else {\n    parseError = e.message;\n  }\n}\n\nconst context = $('Prepare Claude Context').first().json;\n\nreturn [{\n  json: {\n    success: claudeResponse !== null,\n    parseError: parseError,\n    rawOutput: stdout.substring(0, 5000),\n    stderr: stderr,\n    exitCode: exitCode,\n    originalContext: context,\n    troubleshooting: claudeResponse || {\n      root_cause: 'Failed to parse Claude response',\n      severity: 'high',\n      affected_services: context.affectedContainers,\n      recommended_actions: [{ action: 'Manual investigation required', command: '', risk_level: 'none', executed: false }],\n      additional_context: stdout.substring(0, 1000)\n    },\n    timestamp: new Date().toISOString()\n  }\n}];"
      },
      "id": "parse-claude-response",
      "name": "Parse Claude Response",
      "type": "n8n-nodes-base.code",
      "typeVersion": 2,
      "position": [1540, -100]
    },
    {
      "parameters": {
        "command": "=mkdir -p /mnt/nas/claude-reports/$(date +%Y-%m) && echo '{{ JSON.stringify($json) }}' > /mnt/nas/claude-reports/$(date +%Y-%m)/{{ $json.timestamp.replace(/:/g, '-') }}.json"
      },
      "id": "save-report",
      "name": "Save Report to NAS",
      "type": "n8n-nodes-base.executeCommand",
      "typeVersion": 1,
      "position": [1760, -200]
    },
    {
      "parameters": {
        "httpMethod": "POST",
        "path": "discord-webhook-url-here",
        "options": {}
      },
      "id": "discord-webhook",
      "name": "Discord Notification",
      "type": "n8n-nodes-base.httpRequest",
      "typeVersion": 4.2,
      "position": [1760, 0]
    }
  ],
  "connections": {
    "Every Minute": {
      "main": [
        [{ "node": "Check Docker Status", "type": "main", "index": 0 }]
      ]
    },
    "Check Docker Status": {
      "main": [
        [{ "node": "Parse Container Status", "type": "main", "index": 0 }]
      ]
    },
    "Parse Container Status": {
      "main": [
        [{ "node": "Has Error?", "type": "main", "index": 0 }]
      ]
    },
    "Has Error?": {
      "main": [
        [{ "node": "Gather Recent Logs", "type": "main", "index": 0 }],
        []
      ]
    },
    "Gather Recent Logs": {
      "main": [
        [{ "node": "Prepare Claude Context", "type": "main", "index": 0 }]
      ]
    },
    "Prepare Claude Context": {
      "main": [
        [{ "node": "Invoke Claude Code", "type": "main", "index": 0 }]
      ]
    },
    "Invoke Claude Code": {
      "main": [
        [{ "node": "Parse Claude Response", "type": "main", "index": 0 }]
      ]
    },
    "Parse Claude Response": {
      "main": [
        [
          { "node": "Save Report to NAS", "type": "main", "index": 0 },
          { "node": "Discord Notification", "type": "main", "index": 0 }
        ]
      ]
    }
  }
}
```

## Node Details

### 1. Schedule Trigger

- **Interval:** Every 1 minute
- **Purpose:** Regular health checks

### 2. Check Docker Status

```bash
ssh -i /root/.ssh/claude_lxc_key root@claude-lxc \
  "ssh -i ~/.ssh/claude_diagnostics_key root@10.10.0.11 'docker ps --format json'"
```

This double-SSH pattern:
1. N8N → Claude LXC (where Claude Code and SSH keys live)
2. Claude LXC → Proxmox Host (where Docker runs)

### 3. Parse Container Status (JavaScript)

Parses the `docker ps` JSON output and checks for containers that are not "Up" or "running".

**Input:** Raw stdout from docker ps
**Output:**
```json
{
  "hasError": true,
  "errorType": "container_stopped",
  "errorMessage": "Containers not running: tdarr",
  "affectedContainers": ["tdarr"],
  "allContainers": [...],
  "timestamp": "2025-12-19T14:30:00.000Z"
}
```

### 4. Has Error? (IF Node)

Branches workflow:
- **True path:** Continue to troubleshooting
- **False path:** End (no action needed)

### 5. Gather Recent Logs

Fetches last 100 lines of logs from the first affected container.

### 6. Prepare Claude Context

Combines:
- Error information from health check
- Recent logs
- Timestamp
- Server identifier

### 7. Invoke Claude Code

The core troubleshooting invocation:

```bash
claude -p "<prompt>" \
  --output-format json \
  --json-schema '<schema>' \
  --allowedTools "Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)" \
  --max-turns 10
```

**Timeout:** 180 seconds (3 minutes)

### 8. Parse Claude Response

Extracts the structured troubleshooting result from Claude's JSON output.

Handles:
- Clean JSON parse
- JSON embedded in larger output
- Parse failures (falls back to raw output)

### 9. Save Report to NAS

Saves full troubleshooting report to:
```
/mnt/nas/claude-reports/YYYY-MM/TIMESTAMP.json
```

### 10. Discord Notification

Sends formatted message to Discord channel.

## Discord Webhook Configuration

### Message Format

```javascript
// Discord webhook body (in HTTP Request node)
{
  "content": null,
  "embeds": [
    {
      "title": "{{ $json.troubleshooting.severity === 'critical' ? '🔴' : $json.troubleshooting.severity === 'high' ? '🟠' : $json.troubleshooting.severity === 'medium' ? '🟡' : '🟢' }} Server Alert",
      "description": "Automated troubleshooting completed",
      "color": {{ $json.troubleshooting.severity === 'critical' ? 15158332 : $json.troubleshooting.severity === 'high' ? 15105570 : 16776960 }},
      "fields": [
        {
          "name": "Root Cause",
          "value": "{{ $json.troubleshooting.root_cause }}",
          "inline": false
        },
        {
          "name": "Affected Services",
          "value": "{{ $json.troubleshooting.affected_services.join(', ') }}",
          "inline": true
        },
        {
          "name": "Severity",
          "value": "{{ $json.troubleshooting.severity.toUpperCase() }}",
          "inline": true
        },
        {
          "name": "Auto-Remediation",
          "value": "{{ $json.troubleshooting.remediation_performed || 'None performed' }}",
          "inline": false
        },
        {
          "name": "Recommended Actions",
          "value": "{{ $json.troubleshooting.recommended_actions.map(a => '• ' + a.action + (a.executed ? ' ✅' : '')).join('\\n') }}",
          "inline": false
        }
      ],
      "timestamp": "{{ $json.timestamp }}"
    }
  ]
}
```

### Webhook Setup

1. In Discord, go to Server Settings → Integrations → Webhooks
2. Create new webhook in `#server-alerts` channel
3. Copy webhook URL
4. Add to N8N HTTP Request node

## Error Handling

### Claude Code Timeout

Add error handling branch after "Invoke Claude Code":

```javascript
// In error handler
if ($input.first().json.exitCode !== 0) {
  return [{
    json: {
      success: false,
      error: 'Claude Code execution failed',
      stderr: $input.first().json.stderr,
      fallback_action: 'Manual investigation required'
    }
  }];
}
```

### Retry Logic

For transient failures, add retry configuration:

```json
{
  "retry": {
    "enabled": true,
    "maxRetries": 3,
    "waitBetweenRetries": 5000
  }
}
```

### Alert Deduplication (Phase 2)

Add cooldown logic to prevent alert fatigue:

```javascript
// Before triggering Claude
const lastAlertKey = `last_alert_${container}`;
const lastAlert = await $getWorkflowStaticData(lastAlertKey);
const cooldownMs = 5 * 60 * 1000; // 5 minutes

if (lastAlert && (Date.now() - lastAlert) < cooldownMs) {
  return []; // Skip, within cooldown
}

// After sending alert
await $setWorkflowStaticData(lastAlertKey, Date.now());
```

## Credentials Required

### N8N Environment

1. **SSH Key to Claude LXC:** `/root/.ssh/claude_lxc_key`
2. **Discord Webhook URL:** Stored in N8N credentials

### Claude LXC Environment

1. **Max Subscription Auth:** Authenticated via device code flow (credentials in `~/.claude/`)
2. **SSH Key to Proxmox:** `~/.ssh/claude_diagnostics_key`

### Handling Auth Expiry

Add error detection for authentication failures:

```javascript
// In Parse Claude Response node, detect auth errors
if (stderr.includes('authenticate') || stderr.includes('unauthorized')) {
  return [{
    json: {
      success: false,
      error: 'Claude Code authentication expired',
      action_required: 'SSH to Claude LXC and run: claude (to re-authenticate)',
      severity: 'critical'
    }
  }];
}
```

## Testing the Workflow

### Manual Trigger Test

1. In N8N, click "Execute Workflow" on the workflow
2. Check each node's output
3. Verify Discord message received

### Simulated Failure Test

```bash
# Stop a container on Proxmox host
docker stop tdarr

# Wait for next health check (1 minute)
# Verify:
# 1. Claude Code invoked
# 2. Discord notification sent
# 3. Report saved to NAS
# 4. Container restarted (if remediation enabled)
```

### End-to-End Verification

```bash
# Check N8N execution history
# - Go to N8N UI → Executions
# - Review successful and failed runs

# Check NAS for reports
ls -la /mnt/nas/claude-reports/$(date +%Y-%m)/

# Check Discord channel for notifications
```

## Performance Considerations

1. **Schedule Interval:** 1 minute may be aggressive. Consider 5 minutes for production.

2. **Claude Code Timeout:** 180 seconds is generous. Most diagnoses should complete in 30-60 seconds.

3. **Log Size:** Limit logs passed to Claude to avoid token limits (2000 chars in template).

4. **Parallel Execution:** N8N should handle if previous execution is still running.

## Security Notes

1. **SSH Key Permissions:** Ensure keys are 600 permissions
2. **Webhook URL:** Don't expose in logs or error messages
3. **NAS Mount:** Ensure proper permissions on report directory
4. **API Key:** Never log or expose ANTHROPIC_API_KEY

## Monitoring the Workflow

### N8N Execution Metrics

- Successful executions per hour
- Failed executions per hour
- Average execution time
- Claude Code invocation count

### Alert on Workflow Failure

Create a second workflow that monitors the primary workflow's health:

```
Trigger: Every 15 minutes
Check: Last successful execution within 20 minutes
Alert: If no recent success, send Discord alert about monitoring failure
```