claude-home/legacy/headless-claude/PRD.md
Cal Corum babf062d6a docs: archive headless-claude design docs to legacy/
Original planning folder (no git repo) for the server diagnostics system
that runs on CT 300. Live deployment is on claude-runner; this preserves
the Agent SDK reference, PRD with Phase 2/3 roadmap, and N8N workflow designs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 08:15:13 -06:00

27 KiB

Product Requirements Document: N8N-to-Claude Code Automated Server Troubleshooting

Document Information

  • Version: 2.0
  • Last Updated: December 19, 2025
  • Author: System Architect
  • Status: Architecture Finalized

1. Executive Summary

This PRD defines the implementation of an automated server troubleshooting system that integrates N8N workflow automation with Claude Code's headless mode to detect, diagnose, and resolve home server issues with minimal human intervention.

1.1 Problem Statement

Currently, home server errors require manual detection, log analysis, and troubleshooting. This reactive approach leads to:

  • Extended downtime between error occurrence and resolution
  • Manual context-gathering across multiple systems
  • Repetitive troubleshooting of common issues
  • Lack of automated diagnostic workflows

1.2 Solution Overview

An automated pipeline where N8N health checks trigger Claude Code in headless mode, enabling AI-powered autonomous troubleshooting. Claude Code uses a custom Skill with embedded Python client (not MCP) for server diagnostics, with results stored in MemoryGraph for pattern learning.

1.3 Key Architectural Decisions

Decision Choice Rationale
Tool Execution Claude Code Skill with Python client Simpler than MCP, integrated with existing Proxmox skill
N8N Integration Direct CLI invocation No Bridge API needed, simpler architecture
Deployment Dedicated LXC on Proxmox Isolated, snapshotable, always-on
Authentication Max subscription (OAuth) No API key management, generous rate limits included
Server Access SSH from Claude Code LXC Key-based auth, secure
Command Control Allow/Deny lists in settings.json Native Claude Code integration
Phase 1 Scope Docker containers Most common failure point
Remediation Low-risk commands enabled e.g., docker restart
Memory MemoryGraph integration Pattern learning across sessions
Notifications Discord Reliable, existing infrastructure

2. Goals and Success Metrics

2.1 Goals

  • Primary: Enable autonomous troubleshooting of home server errors without human intervention
  • Secondary: Reduce mean time to diagnosis (MTTD) for server issues
  • Tertiary: Create a reusable framework for AI-assisted infrastructure management

2.2 Success Metrics

Metric Target Measurement Method
Automated issue detection 100% N8N health check coverage
Successful autonomous diagnosis >70% Issues diagnosed without human input
Mean time to diagnosis <5 minutes Time from error to root cause identification
False positive rate <10% Incorrect diagnoses / total diagnoses
System uptime improvement +15% Compare pre/post implementation

2.3 Non-Goals

  • Real-time monitoring dashboard (use existing N8N UI)
  • Multi-user access control (single-user home server environment)
  • Integration with commercial monitoring platforms (Datadog, New Relic, etc.)
  • High-risk autonomous fixes (shutdown, delete, format) without approval

3. User Stories

3.1 Primary User: Home Server Administrator

Story 1: Autonomous Error Response

AS A home server administrator
I WANT N8N to automatically trigger Claude Code when errors occur
SO THAT troubleshooting begins immediately without my intervention

ACCEPTANCE CRITERIA:
- N8N health check detects error within 60 seconds
- Claude Code receives error context within 5 seconds of detection
- Troubleshooting begins autonomously without manual trigger

Story 2: Comprehensive Diagnostic Access

AS A home server administrator
I WANT Claude Code to have access to server logs, metrics, and diagnostic commands
SO THAT it can perform thorough root cause analysis

ACCEPTANCE CRITERIA:
- Claude Code can read system logs via Skill's Python client
- Claude Code can execute whitelisted diagnostic commands via SSH
- All diagnostic actions are logged for audit

Story 3: Actionable Troubleshooting Output

AS A home server administrator
I WANT Claude Code to provide structured troubleshooting results
SO THAT I can quickly understand the issue and recommended actions

ACCEPTANCE CRITERIA:
- Output includes: root cause, severity, recommended actions
- Output is stored on NAS for historical reference
- Discord notifications sent for critical issues
- Results stored in MemoryGraph for pattern learning

Story 4: Safe Autonomous Operations

AS A home server administrator
I WANT controls on what Claude Code can execute autonomously
SO THAT critical systems are protected from unintended changes

ACCEPTANCE CRITERIA:
- Read-only operations execute without approval
- Low-risk write operations (docker restart) allowed via whitelist
- Critical commands blocked via deny list in settings.json

4. System Architecture

4.1 Component Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                        Proxmox Host Network                              │
│                                                                          │
│  ┌──────────────────┐                                                   │
│  │  Docker Services │  ← Phase 1 Target                                 │
│  │  - Tdarr         │                                                   │
│  │  - Portainer     │                                                   │
│  │  - Other containers                                                  │
│  └────────┬─────────┘                                                   │
│           │                                                              │
│           │ SSH (key-based)                                             │
│           │                                                              │
│  ┌────────▼─────────┐         ┌─────────────────────────────────┐      │
│  │  N8N Container   │         │  Claude Code LXC                │      │
│  │  (Health Checks) │────────▶│  - Headless mode                │      │
│  │                  │  CLI    │  - Server Diagnostics Skill     │      │
│  │  - Cron triggers │  invoke │  - SSH keys installed           │      │
│  │  - Docker API    │         │  - ANTHROPIC_API_KEY set        │      │
│  │  - Discord notify│◀────────│  - MemoryGraph client           │      │
│  └──────────────────┘  JSON   └────────┬────────────────────────┘      │
│                        output          │                                │
│                                        │ Python client                  │
│                                        ▼                                │
│                        ┌───────────────────────────────────────┐       │
│                        │  Server Diagnostics Skill             │       │
│                        │  ~/.claude/skills/server-diagnostics/ │       │
│                        │                                       │       │
│                        │  ├── SKILL.md (context & workflows)   │       │
│                        │  ├── client.py (Python diagnostic lib)│       │
│                        │  └── config.yaml (server inventory)   │       │
│                        └───────────────────────────────────────┘       │
│                                                                          │
│  ┌──────────────────┐         ┌─────────────────────────────────┐      │
│  │  NAS Storage     │         │  MemoryGraph                    │      │
│  │  - Report output │         │  - Pattern storage              │      │
│  │  - Historical    │         │  - Solution recall              │      │
│  │    troubleshooting│        │  - Cross-session learning       │      │
│  └──────────────────┘         └─────────────────────────────────┘      │
└─────────────────────────────────────────────────────────────────────────┘

4.2 Component Details

4.2.1 N8N Workflow Container

  • Technology: N8N (new deployment)
  • Responsibilities:
    • Execute periodic health checks on Docker containers
    • Detect error conditions based on defined thresholds
    • Aggregate error context (logs, timestamps, affected services)
    • Invoke Claude Code directly via Execute Command node
    • Parse JSON response from Claude Code
    • Send Discord notifications with recommendations
    • (Optional) Auto-execute low-risk remediation actions

4.2.2 Claude Code LXC Container

  • Technology: Dedicated LXC on Proxmox
  • Resources: 2 vCPU, 2GB RAM, 16GB disk
  • Authentication: Claude Max subscription (device code OAuth flow, credentials persist in ~/.claude/)
  • Responsibilities:
    • Execute in headless mode when triggered by N8N
    • Load Server Diagnostics Skill automatically
    • Use Python client to gather diagnostic information via SSH
    • Analyze error context and logs
    • Generate structured JSON troubleshooting output
    • Store learnings in MemoryGraph

4.2.3 Server Diagnostics Skill

  • Location: ~/.claude/skills/server-diagnostics/
  • Technology: Python client library with CLI wrapper
  • Extends: Existing Proxmox skill patterns
  • Responsibilities:
    • Provide context for troubleshooting workflows
    • Expose Python functions for diagnostics:
      • read_logs(server, log_type, lines, filter)
      • run_diagnostic(server, command)
      • get_metrics(server, metric_type)
      • get_docker_status(server, container)
      • docker_restart(server, container) (low-risk remediation)
    • Enforce command whitelisting via skill configuration
    • Log all operations for audit

4.2.4 MemoryGraph Integration

  • Technology: Existing MemoryGraph skill
  • Responsibilities:
    • Store successful troubleshooting patterns
    • Recall similar past issues during diagnosis
    • Track which solutions worked/failed
    • Build knowledge graph of infrastructure issues

5. Technical Requirements

5.1 Claude Code Headless Mode Configuration

5.1.1 Invocation Pattern (from N8N)

claude -p "<prompt>" \
  --output-format json \
  --json-schema '<schema>' \
  --allowedTools "Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)"

Parameters:

  • -p : Prompt containing error context and instructions
  • --output-format json : Structured JSON output for parsing
  • --json-schema : Enforces structured output shape (result in structured_output field)
  • --allowedTools : Pre-approve safe tools with prefix-matched Bash scoping (space before * is significant)

5.1.2 Prompt Template

You are troubleshooting a server issue. Use the server-diagnostics skill.

Server: {server_name}
Error Type: {error_type}
Timestamp: {timestamp}
Error Message: {error_message}

Initial Context from N8N:
{context_from_n8n}

Instructions:
1. Use the Python diagnostic client to gather additional information
2. Check MemoryGraph for similar past issues
3. Analyze the root cause
4. If appropriate, execute low-risk remediation (e.g., docker restart)
5. Store learnings in MemoryGraph

Output a JSON response with:
{
  "root_cause": "string describing the root cause",
  "severity": "low" | "medium" | "high" | "critical",
  "affected_services": ["list", "of", "services"],
  "diagnosis_steps": ["steps", "taken", "to", "diagnose"],
  "recommended_actions": [
    {
      "action": "description",
      "command": "actual command",
      "risk_level": "none" | "low" | "medium" | "high",
      "executed": true | false
    }
  ],
  "remediation_performed": "description of any auto-remediation done",
  "memory_graph_entries": ["list of memories stored"],
  "additional_context": "any other relevant information"
}

5.2 Server Diagnostics Skill Specification

5.2.1 Skill Structure

~/.claude/skills/server-diagnostics/
├── SKILL.md              # Skill context and workflows
├── client.py             # Python diagnostic library
├── config.yaml           # Server inventory and settings
└── commands/
    ├── docker.py         # Docker-specific diagnostics
    ├── system.py         # System-level diagnostics
    └── network.py        # Network diagnostics

5.2.2 Python Client Interface

# client.py - Core diagnostic functions

class ServerDiagnostics:
    def __init__(self, config_path: str = "config.yaml"):
        """Initialize with server inventory from config."""

    def read_logs(
        self,
        server: str,
        log_type: Literal["system", "docker", "application", "custom"],
        lines: int = 100,
        filter: str | None = None
    ) -> str:
        """Read logs from specified server via SSH."""

    def run_diagnostic(
        self,
        server: str,
        command: Literal[
            "disk_usage", "memory_usage", "cpu_usage",
            "process_list", "network_status", "docker_ps",
            "service_status", "port_check"
        ],
        params: dict | None = None
    ) -> dict:
        """Execute whitelisted diagnostic command."""

    def get_docker_status(
        self,
        server: str,
        container: str | None = None
    ) -> dict:
        """Get Docker container status and health."""

    def docker_restart(
        self,
        server: str,
        container: str
    ) -> dict:
        """Restart a Docker container (low-risk remediation)."""

    def get_metrics(
        self,
        server: str,
        metric_type: Literal["cpu", "memory", "disk", "network", "all"] = "all"
    ) -> dict:
        """Get current system metrics."""

5.2.3 Configuration File (config.yaml)

# Server inventory and skill settings
servers:
  proxmox-host:
    hostname: 192.168.1.100
    ssh_user: root
    ssh_key: ~/.ssh/claude_diagnostics_key
    docker_socket: /var/run/docker.sock

# Containers to monitor (Phase 1)
docker_containers:
  - name: tdarr
    critical: true
    restart_allowed: true
  - name: portainer
    critical: true
    restart_allowed: true
  - name: n8n
    critical: true
    restart_allowed: false  # Don't restart yourself!

# Command whitelist (maps to actual commands)
diagnostic_commands:
  disk_usage: "df -h"
  memory_usage: "free -h"
  cpu_usage: "top -bn1 | head -20"
  process_list: "ps aux --sort=-%mem | head -20"
  network_status: "ss -tuln"
  docker_ps: "docker ps -a --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}'"
  service_status: "systemctl status {service}"
  port_check: "nc -zv {host} {port}"

# Remediation commands (low-risk only)
remediation_commands:
  docker_restart: "docker restart {container}"
  docker_logs: "docker logs --tail 500 {container}"

# Denied commands (never execute)
denied_patterns:
  - "rm -rf"
  - "dd if="
  - "mkfs"
  - ":(){:|:&};:"
  - "shutdown"
  - "reboot"
  - "init 0"
  - "systemctl stop"
  - "> /dev/sd"

5.3 Command Control in settings.json

Claude Code's ~/.claude/settings.json will include:

{
  "permissions": {
    "allow": [
      "Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)",
      "Bash(ssh proxmox-host docker *)",
      "Bash(ssh proxmox-host systemctl status *)",
      "Bash(ssh proxmox-host journalctl *)",
      "Bash(ssh proxmox-host df *)",
      "Bash(ssh proxmox-host free *)",
      "Bash(ssh proxmox-host ps *)",
      "Bash(ssh proxmox-host top *)",
      "Bash(ssh proxmox-host ss *)",
      "Bash(ssh proxmox-host nc -zv *)"
    ],
    "deny": [
      "Bash(rm -rf *)",
      "Bash(dd *)",
      "Bash(mkfs *)",
      "Bash(shutdown *)",
      "Bash(reboot *)",
      "Bash(init *)",
      "Bash(*> /dev/sd*)"
    ]
  }
}

Note: The --allowedTools flag uses permission rule syntax. The space before * is significant — Bash(python3 *) matches commands starting with python3 , while Bash(python3*) would also match python3something.

5.4 N8N Execute Command Configuration

# N8N Execute Command Node
claude -p "$(cat <<'EOF'
You are troubleshooting a server issue. Use the server-diagnostics skill.

Server: {{ $json.server }}
Error Type: {{ $json.error_type }}
Timestamp: {{ $json.timestamp }}
Error Message: {{ $json.error_message }}

Initial Context:
{{ $json.context }}

[... rest of prompt template ...]
EOF
)" --output-format json \
  --json-schema '{ "type": "object", "properties": { "root_cause": { "type": "string" }, "severity": { "type": "string", "enum": ["low","medium","high","critical"] }, "affected_services": { "type": "array", "items": { "type": "string" } }, "diagnosis_steps": { "type": "array", "items": { "type": "string" } }, "recommended_actions": { "type": "array", "items": { "type": "object" } }, "remediation_performed": { "type": "string" }, "additional_context": { "type": "string" } }, "required": ["root_cause","severity","affected_services"] }' \
  --allowedTools "Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)"

5.5 Security Requirements

5.5.1 SSH Security

  • Dedicated SSH key pair for Claude Code LXC
  • Key installed only on target servers (Proxmox host initially)
  • No password authentication
  • Key has restricted permissions (read-only where possible)

5.5.2 Command Control

  • Allow list in settings.json for pre-approved commands
  • Deny list blocks dangerous operations
  • Skill config.yaml provides additional validation layer
  • Denied patterns checked before command execution

5.5.3 Network Security

  • Claude Code LXC on internal network only
  • N8N accesses Claude Code via local exec (same Proxmox host)
  • No external API exposure

5.5.4 Audit Logging

  • All diagnostic commands logged with:
    • Timestamp
    • Command executed
    • Server targeted
    • Result summary
    • Session context
  • Logs stored on NAS alongside reports
  • MemoryGraph entries provide additional audit trail

6. Data Flow

6.1 Happy Path: Error Detection to Resolution

1. N8N Schedule Trigger (every 60s)
   ↓
2. N8N Health Check (Execute Command: docker ps, check container health)
   ↓
3. Error Detected (container unhealthy or stopped)
   ↓
4. N8N aggregates context:
   - Container name and status
   - Recent docker logs (last 50 lines)
   - Current timestamp
   ↓
5. N8N Execute Command Node:
   claude -p "{prompt with context}" --output-format json --allowedTools "..."
   ↓
6. Claude Code (headless):
   - Loads server-diagnostics skill
   - Recalls similar issues from MemoryGraph
   - Runs Python diagnostic client:
     * get_docker_status(proxmox-host, container)
     * read_logs(proxmox-host, docker, 200, "error")
     * get_metrics(proxmox-host, all)
   - Analyzes gathered data
   - If low-risk: executes docker_restart()
   - Stores learnings in MemoryGraph
   - Outputs structured JSON
   ↓
7. N8N parses JSON response
   ↓
8. N8N saves report to NAS
   ↓
9. N8N sends Discord notification:
   - Severity emoji
   - Root cause summary
   - Actions taken/recommended
   - Link to full report
   ↓
10. Administrator reviews (if needed)

6.2 Error Handling

Scenario: Claude Code invocation fails

  • N8N captures stderr and exit code
  • N8N retries up to 3 times with exponential backoff
  • If all retries fail, send critical Discord alert

Scenario: SSH connection fails

  • Python client returns error in response
  • Claude Code incorporates into analysis
  • Recommendation: "Unable to connect to {server}, check network/SSH"

Scenario: Malformed output from Claude Code

  • N8N attempts JSON parse
  • On failure, save raw output to NAS
  • Send Discord alert with raw output attached

Scenario: Remediation fails

  • Capture error from docker restart
  • Do not retry automatically
  • Report failure and suggest manual intervention

7. Implementation Phases

Phase 1: Foundation

Deliverables:

  • Claude Code LXC container provisioned and configured
  • SSH key pair generated and installed on Proxmox host
  • Server Diagnostics Skill with basic Docker tools
  • N8N workflow for Docker container health checks
  • Discord webhook integration
  • Basic MemoryGraph integration

Success Criteria:

  • N8N can trigger Claude Code headless
  • Claude Code can diagnose Docker container issues
  • Discord notifications received for test alerts
  • At least one troubleshooting result stored in MemoryGraph

Phase 2: Enhancement

Deliverables:

  • Complete diagnostic command library
  • Expand to system-level diagnostics (disk, memory, network)
  • Add more Docker containers to monitoring
  • Implement alert deduplication/fatigue detection
  • Enhanced prompt engineering for better diagnoses

Success Criteria:

  • System handles 10+ different error scenarios
  • MemoryGraph successfully recalls relevant past issues
  • <5% false positive rate achieved

Phase 3: Expansion

Deliverables:

  • Extend to additional Proxmox VMs/LXCs
  • Integration with existing Proxmox skill
  • NAS report archival and cleanup
  • SMS notifications (optional)
  • Documentation and runbooks

Success Criteria:

  • System achieves >70% autonomous diagnosis rate
  • All documentation complete
  • Full homelab coverage

8. Configuration Examples

8.1 Claude Code LXC Environment

# Authentication: Use Max subscription (no API key needed)
# Run 'claude' interactively once to authenticate via device code flow
# Credentials persist in ~/.claude/

# Skills are auto-loaded from:
# ~/.claude/skills/

8.2 N8N Workflow Nodes

# Trigger Node
type: Schedule
interval: 1
unit: minutes

# Health Check Node
type: Execute Command
command: ssh proxmox-host "docker ps --format '{{.Names}},{{.Status}}' | grep -v 'Up'"

# Branch Node
condition: "{{ $json.stdout !== '' }}"

# Claude Code Node
type: Execute Command
command: |
  claude -p "..." --output-format json --allowedTools "Read,Bash,Grep,Glob"  
timeout: 180000

# Parse JSON Node
type: Code
code: return JSON.parse($json.stdout)

# Discord Node
type: Discord Webhook
content: |
  **{{ $json.severity === 'critical' ? '🔴' : $json.severity === 'high' ? '🟠' : '🟡' }} Server Alert**

  **Root Cause:** {{ $json.root_cause }}
  **Affected:** {{ $json.affected_services.join(', ') }}
  **Auto-remediation:** {{ $json.remediation_performed || 'None' }}

  **Recommended Actions:**
  {{ $json.recommended_actions.map(a => '• ' + a.action).join('\n') }}  

8.3 Discord Webhook Setup

webhook_url: https://discord.com/api/webhooks/{id}/{token}
channel: #server-alerts
mentions:
  critical: "@here"
  high: ""
  medium: ""
  low: ""

9. Testing Strategy

9.1 Unit Tests

  • Python diagnostic client functions
  • SSH connection handling
  • Command whitelist/deny validation
  • JSON output parsing

9.2 Integration Tests

  • End-to-end: N8N → Claude Code → SSH → Response
  • MemoryGraph storage and recall
  • Discord notification delivery
  • NAS report storage

9.3 Simulated Failures

Test Case Simulation Method Expected Behavior
Container crash docker stop tdarr Detect, diagnose, restart
High memory stress-ng in container Detect, identify process
Disk full Create large temp file Detect, recommend cleanup
Network issue Block container port Detect, diagnose connectivity
SSH failure Temporarily revoke key Graceful error, alert sent

10. Risks and Mitigations

10.1 Technical Risks

Risk Impact Probability Mitigation
OAuth token expiry Medium Medium Monitor for auth failures, re-authenticate when needed
SSH key compromise Critical Low Rotate keys, minimal permissions
Claude hallucinations Medium Medium Require approval for high-risk
N8N container failure High Low Health check N8N itself
MemoryGraph corruption Low Low Regular backups

10.2 Operational Risks

Risk Impact Probability Mitigation
False positives Medium High Tune thresholds, deduplication
Alert fatigue Medium Medium Cooldown periods, severity filtering
Over-reliance Medium Low Maintain manual runbooks

11. Open Questions - RESOLVED

Question Resolution
MCP vs Skills? Skills - simpler, extends existing Proxmox skill
Bridge API needed? No - N8N calls Claude Code directly
Where to run Claude Code? Dedicated LXC on Proxmox
Read-only vs remediation? Low-risk remediation allowed (docker restart)
Command control method? settings.json allow/deny lists
Phase 1 scope? Docker containers on Proxmox host
Notification channel? Discord (SMS future enhancement)
MemoryGraph integration? Yes - pattern learning enabled

12. Appendices

Appendix A: Glossary

  • N8N: Workflow automation platform
  • Claude Code: AI-powered coding assistant with headless mode
  • Headless Mode: Non-interactive execution mode for automation
  • Skill: Claude Code context/capability extension via markdown + code
  • MemoryGraph: Graph-based memory system for pattern storage
  • LXC: Linux Container on Proxmox
  • MTTD: Mean Time To Diagnosis

Appendix B: References

Appendix C: Change Log

Version Date Author Changes
1.0 2025-12-19 System Architect Initial draft (MCP-based)
2.0 2025-12-19 Cal + Jarvis Architecture revision: Skills instead of MCP, direct N8N invocation, LXC deployment, MemoryGraph integration

End of Document