# Product Requirements Document: N8N-to-Claude Code Automated Server Troubleshooting ## Document Information - **Version:** 2.0 - **Last Updated:** December 19, 2025 - **Author:** System Architect - **Status:** Architecture Finalized ----- ## 1. Executive Summary This PRD defines the implementation of an automated server troubleshooting system that integrates N8N workflow automation with Claude Code's headless mode to detect, diagnose, and resolve home server issues with minimal human intervention. ### 1.1 Problem Statement Currently, home server errors require manual detection, log analysis, and troubleshooting. This reactive approach leads to: - Extended downtime between error occurrence and resolution - Manual context-gathering across multiple systems - Repetitive troubleshooting of common issues - Lack of automated diagnostic workflows ### 1.2 Solution Overview An automated pipeline where N8N health checks trigger Claude Code in headless mode, enabling AI-powered autonomous troubleshooting. Claude Code uses a custom **Skill with embedded Python client** (not MCP) for server diagnostics, with results stored in MemoryGraph for pattern learning. ### 1.3 Key Architectural Decisions | Decision | Choice | Rationale | |----------|--------|-----------| | Tool Execution | Claude Code Skill with Python client | Simpler than MCP, integrated with existing Proxmox skill | | N8N Integration | Direct CLI invocation | No Bridge API needed, simpler architecture | | Deployment | Dedicated LXC on Proxmox | Isolated, snapshotable, always-on | | Authentication | Max subscription (OAuth) | No API key management, generous rate limits included | | Server Access | SSH from Claude Code LXC | Key-based auth, secure | | Command Control | Allow/Deny lists in settings.json | Native Claude Code integration | | Phase 1 Scope | Docker containers | Most common failure point | | Remediation | Low-risk commands enabled | e.g., `docker restart` | | Memory | MemoryGraph integration | Pattern learning across sessions | | Notifications | Discord | Reliable, existing infrastructure | ----- ## 2. Goals and Success Metrics ### 2.1 Goals - **Primary:** Enable autonomous troubleshooting of home server errors without human intervention - **Secondary:** Reduce mean time to diagnosis (MTTD) for server issues - **Tertiary:** Create a reusable framework for AI-assisted infrastructure management ### 2.2 Success Metrics | Metric | Target | Measurement Method | |--------|--------|-------------------| | Automated issue detection | 100% | N8N health check coverage | | Successful autonomous diagnosis | >70% | Issues diagnosed without human input | | Mean time to diagnosis | <5 minutes | Time from error to root cause identification | | False positive rate | <10% | Incorrect diagnoses / total diagnoses | | System uptime improvement | +15% | Compare pre/post implementation | ### 2.3 Non-Goals - Real-time monitoring dashboard (use existing N8N UI) - Multi-user access control (single-user home server environment) - Integration with commercial monitoring platforms (Datadog, New Relic, etc.) - High-risk autonomous fixes (shutdown, delete, format) without approval ----- ## 3. User Stories ### 3.1 Primary User: Home Server Administrator **Story 1: Autonomous Error Response** ``` AS A home server administrator I WANT N8N to automatically trigger Claude Code when errors occur SO THAT troubleshooting begins immediately without my intervention ACCEPTANCE CRITERIA: - N8N health check detects error within 60 seconds - Claude Code receives error context within 5 seconds of detection - Troubleshooting begins autonomously without manual trigger ``` **Story 2: Comprehensive Diagnostic Access** ``` AS A home server administrator I WANT Claude Code to have access to server logs, metrics, and diagnostic commands SO THAT it can perform thorough root cause analysis ACCEPTANCE CRITERIA: - Claude Code can read system logs via Skill's Python client - Claude Code can execute whitelisted diagnostic commands via SSH - All diagnostic actions are logged for audit ``` **Story 3: Actionable Troubleshooting Output** ``` AS A home server administrator I WANT Claude Code to provide structured troubleshooting results SO THAT I can quickly understand the issue and recommended actions ACCEPTANCE CRITERIA: - Output includes: root cause, severity, recommended actions - Output is stored on NAS for historical reference - Discord notifications sent for critical issues - Results stored in MemoryGraph for pattern learning ``` **Story 4: Safe Autonomous Operations** ``` AS A home server administrator I WANT controls on what Claude Code can execute autonomously SO THAT critical systems are protected from unintended changes ACCEPTANCE CRITERIA: - Read-only operations execute without approval - Low-risk write operations (docker restart) allowed via whitelist - Critical commands blocked via deny list in settings.json ``` ----- ## 4. System Architecture ### 4.1 Component Overview ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ Proxmox Host Network │ │ │ │ ┌──────────────────┐ │ │ │ Docker Services │ ← Phase 1 Target │ │ │ - Tdarr │ │ │ │ - Portainer │ │ │ │ - Other containers │ │ └────────┬─────────┘ │ │ │ │ │ │ SSH (key-based) │ │ │ │ │ ┌────────▼─────────┐ ┌─────────────────────────────────┐ │ │ │ N8N Container │ │ Claude Code LXC │ │ │ │ (Health Checks) │────────▶│ - Headless mode │ │ │ │ │ CLI │ - Server Diagnostics Skill │ │ │ │ - Cron triggers │ invoke │ - SSH keys installed │ │ │ │ - Docker API │ │ - ANTHROPIC_API_KEY set │ │ │ │ - Discord notify│◀────────│ - MemoryGraph client │ │ │ └──────────────────┘ JSON └────────┬────────────────────────┘ │ │ output │ │ │ │ Python client │ │ ▼ │ │ ┌───────────────────────────────────────┐ │ │ │ Server Diagnostics Skill │ │ │ │ ~/.claude/skills/server-diagnostics/ │ │ │ │ │ │ │ │ ├── SKILL.md (context & workflows) │ │ │ │ ├── client.py (Python diagnostic lib)│ │ │ │ └── config.yaml (server inventory) │ │ │ └───────────────────────────────────────┘ │ │ │ │ ┌──────────────────┐ ┌─────────────────────────────────┐ │ │ │ NAS Storage │ │ MemoryGraph │ │ │ │ - Report output │ │ - Pattern storage │ │ │ │ - Historical │ │ - Solution recall │ │ │ │ troubleshooting│ │ - Cross-session learning │ │ │ └──────────────────┘ └─────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────┘ ``` ### 4.2 Component Details #### 4.2.1 N8N Workflow Container - **Technology:** N8N (new deployment) - **Responsibilities:** - Execute periodic health checks on Docker containers - Detect error conditions based on defined thresholds - Aggregate error context (logs, timestamps, affected services) - Invoke Claude Code directly via Execute Command node - Parse JSON response from Claude Code - Send Discord notifications with recommendations - (Optional) Auto-execute low-risk remediation actions #### 4.2.2 Claude Code LXC Container - **Technology:** Dedicated LXC on Proxmox - **Resources:** 2 vCPU, 2GB RAM, 16GB disk - **Authentication:** Claude Max subscription (device code OAuth flow, credentials persist in ~/.claude/) - **Responsibilities:** - Execute in headless mode when triggered by N8N - Load Server Diagnostics Skill automatically - Use Python client to gather diagnostic information via SSH - Analyze error context and logs - Generate structured JSON troubleshooting output - Store learnings in MemoryGraph #### 4.2.3 Server Diagnostics Skill - **Location:** `~/.claude/skills/server-diagnostics/` - **Technology:** Python client library with CLI wrapper - **Extends:** Existing Proxmox skill patterns - **Responsibilities:** - Provide context for troubleshooting workflows - Expose Python functions for diagnostics: - `read_logs(server, log_type, lines, filter)` - `run_diagnostic(server, command)` - `get_metrics(server, metric_type)` - `get_docker_status(server, container)` - `docker_restart(server, container)` (low-risk remediation) - Enforce command whitelisting via skill configuration - Log all operations for audit #### 4.2.4 MemoryGraph Integration - **Technology:** Existing MemoryGraph skill - **Responsibilities:** - Store successful troubleshooting patterns - Recall similar past issues during diagnosis - Track which solutions worked/failed - Build knowledge graph of infrastructure issues ----- ## 5. Technical Requirements ### 5.1 Claude Code Headless Mode Configuration #### 5.1.1 Invocation Pattern (from N8N) ```bash claude -p "" \ --output-format json \ --json-schema '' \ --allowedTools "Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)" ``` **Parameters:** - `-p` : Prompt containing error context and instructions - `--output-format json` : Structured JSON output for parsing - `--json-schema` : Enforces structured output shape (result in `structured_output` field) - `--allowedTools` : Pre-approve safe tools with prefix-matched Bash scoping (space before `*` is significant) #### 5.1.2 Prompt Template ``` You are troubleshooting a server issue. Use the server-diagnostics skill. Server: {server_name} Error Type: {error_type} Timestamp: {timestamp} Error Message: {error_message} Initial Context from N8N: {context_from_n8n} Instructions: 1. Use the Python diagnostic client to gather additional information 2. Check MemoryGraph for similar past issues 3. Analyze the root cause 4. If appropriate, execute low-risk remediation (e.g., docker restart) 5. Store learnings in MemoryGraph Output a JSON response with: { "root_cause": "string describing the root cause", "severity": "low" | "medium" | "high" | "critical", "affected_services": ["list", "of", "services"], "diagnosis_steps": ["steps", "taken", "to", "diagnose"], "recommended_actions": [ { "action": "description", "command": "actual command", "risk_level": "none" | "low" | "medium" | "high", "executed": true | false } ], "remediation_performed": "description of any auto-remediation done", "memory_graph_entries": ["list of memories stored"], "additional_context": "any other relevant information" } ``` ### 5.2 Server Diagnostics Skill Specification #### 5.2.1 Skill Structure ``` ~/.claude/skills/server-diagnostics/ ├── SKILL.md # Skill context and workflows ├── client.py # Python diagnostic library ├── config.yaml # Server inventory and settings └── commands/ ├── docker.py # Docker-specific diagnostics ├── system.py # System-level diagnostics └── network.py # Network diagnostics ``` #### 5.2.2 Python Client Interface ```python # client.py - Core diagnostic functions class ServerDiagnostics: def __init__(self, config_path: str = "config.yaml"): """Initialize with server inventory from config.""" def read_logs( self, server: str, log_type: Literal["system", "docker", "application", "custom"], lines: int = 100, filter: str | None = None ) -> str: """Read logs from specified server via SSH.""" def run_diagnostic( self, server: str, command: Literal[ "disk_usage", "memory_usage", "cpu_usage", "process_list", "network_status", "docker_ps", "service_status", "port_check" ], params: dict | None = None ) -> dict: """Execute whitelisted diagnostic command.""" def get_docker_status( self, server: str, container: str | None = None ) -> dict: """Get Docker container status and health.""" def docker_restart( self, server: str, container: str ) -> dict: """Restart a Docker container (low-risk remediation).""" def get_metrics( self, server: str, metric_type: Literal["cpu", "memory", "disk", "network", "all"] = "all" ) -> dict: """Get current system metrics.""" ``` #### 5.2.3 Configuration File (config.yaml) ```yaml # Server inventory and skill settings servers: proxmox-host: hostname: 192.168.1.100 ssh_user: root ssh_key: ~/.ssh/claude_diagnostics_key docker_socket: /var/run/docker.sock # Containers to monitor (Phase 1) docker_containers: - name: tdarr critical: true restart_allowed: true - name: portainer critical: true restart_allowed: true - name: n8n critical: true restart_allowed: false # Don't restart yourself! # Command whitelist (maps to actual commands) diagnostic_commands: disk_usage: "df -h" memory_usage: "free -h" cpu_usage: "top -bn1 | head -20" process_list: "ps aux --sort=-%mem | head -20" network_status: "ss -tuln" docker_ps: "docker ps -a --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}'" service_status: "systemctl status {service}" port_check: "nc -zv {host} {port}" # Remediation commands (low-risk only) remediation_commands: docker_restart: "docker restart {container}" docker_logs: "docker logs --tail 500 {container}" # Denied commands (never execute) denied_patterns: - "rm -rf" - "dd if=" - "mkfs" - ":(){:|:&};:" - "shutdown" - "reboot" - "init 0" - "systemctl stop" - "> /dev/sd" ``` ### 5.3 Command Control in settings.json Claude Code's `~/.claude/settings.json` will include: ```json { "permissions": { "allow": [ "Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)", "Bash(ssh proxmox-host docker *)", "Bash(ssh proxmox-host systemctl status *)", "Bash(ssh proxmox-host journalctl *)", "Bash(ssh proxmox-host df *)", "Bash(ssh proxmox-host free *)", "Bash(ssh proxmox-host ps *)", "Bash(ssh proxmox-host top *)", "Bash(ssh proxmox-host ss *)", "Bash(ssh proxmox-host nc -zv *)" ], "deny": [ "Bash(rm -rf *)", "Bash(dd *)", "Bash(mkfs *)", "Bash(shutdown *)", "Bash(reboot *)", "Bash(init *)", "Bash(*> /dev/sd*)" ] } } ``` > **Note:** The `--allowedTools` flag uses [permission rule syntax](https://code.claude.com/docs/en/settings#permission-rule-syntax). The space before `*` is significant — `Bash(python3 *)` matches commands starting with `python3 `, while `Bash(python3*)` would also match `python3something`. ### 5.4 N8N Execute Command Configuration ```bash # N8N Execute Command Node claude -p "$(cat <<'EOF' You are troubleshooting a server issue. Use the server-diagnostics skill. Server: {{ $json.server }} Error Type: {{ $json.error_type }} Timestamp: {{ $json.timestamp }} Error Message: {{ $json.error_message }} Initial Context: {{ $json.context }} [... rest of prompt template ...] EOF )" --output-format json \ --json-schema '{ "type": "object", "properties": { "root_cause": { "type": "string" }, "severity": { "type": "string", "enum": ["low","medium","high","critical"] }, "affected_services": { "type": "array", "items": { "type": "string" } }, "diagnosis_steps": { "type": "array", "items": { "type": "string" } }, "recommended_actions": { "type": "array", "items": { "type": "object" } }, "remediation_performed": { "type": "string" }, "additional_context": { "type": "string" } }, "required": ["root_cause","severity","affected_services"] }' \ --allowedTools "Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)" ``` ### 5.5 Security Requirements #### 5.5.1 SSH Security - Dedicated SSH key pair for Claude Code LXC - Key installed only on target servers (Proxmox host initially) - No password authentication - Key has restricted permissions (read-only where possible) #### 5.5.2 Command Control - Allow list in `settings.json` for pre-approved commands - Deny list blocks dangerous operations - Skill config.yaml provides additional validation layer - Denied patterns checked before command execution #### 5.5.3 Network Security - Claude Code LXC on internal network only - N8N accesses Claude Code via local exec (same Proxmox host) - No external API exposure #### 5.5.4 Audit Logging - All diagnostic commands logged with: - Timestamp - Command executed - Server targeted - Result summary - Session context - Logs stored on NAS alongside reports - MemoryGraph entries provide additional audit trail ----- ## 6. Data Flow ### 6.1 Happy Path: Error Detection to Resolution ``` 1. N8N Schedule Trigger (every 60s) ↓ 2. N8N Health Check (Execute Command: docker ps, check container health) ↓ 3. Error Detected (container unhealthy or stopped) ↓ 4. N8N aggregates context: - Container name and status - Recent docker logs (last 50 lines) - Current timestamp ↓ 5. N8N Execute Command Node: claude -p "{prompt with context}" --output-format json --allowedTools "..." ↓ 6. Claude Code (headless): - Loads server-diagnostics skill - Recalls similar issues from MemoryGraph - Runs Python diagnostic client: * get_docker_status(proxmox-host, container) * read_logs(proxmox-host, docker, 200, "error") * get_metrics(proxmox-host, all) - Analyzes gathered data - If low-risk: executes docker_restart() - Stores learnings in MemoryGraph - Outputs structured JSON ↓ 7. N8N parses JSON response ↓ 8. N8N saves report to NAS ↓ 9. N8N sends Discord notification: - Severity emoji - Root cause summary - Actions taken/recommended - Link to full report ↓ 10. Administrator reviews (if needed) ``` ### 6.2 Error Handling **Scenario: Claude Code invocation fails** - N8N captures stderr and exit code - N8N retries up to 3 times with exponential backoff - If all retries fail, send critical Discord alert **Scenario: SSH connection fails** - Python client returns error in response - Claude Code incorporates into analysis - Recommendation: "Unable to connect to {server}, check network/SSH" **Scenario: Malformed output from Claude Code** - N8N attempts JSON parse - On failure, save raw output to NAS - Send Discord alert with raw output attached **Scenario: Remediation fails** - Capture error from docker restart - Do not retry automatically - Report failure and suggest manual intervention ----- ## 7. Implementation Phases ### Phase 1: Foundation **Deliverables:** - Claude Code LXC container provisioned and configured - SSH key pair generated and installed on Proxmox host - Server Diagnostics Skill with basic Docker tools - N8N workflow for Docker container health checks - Discord webhook integration - Basic MemoryGraph integration **Success Criteria:** - N8N can trigger Claude Code headless - Claude Code can diagnose Docker container issues - Discord notifications received for test alerts - At least one troubleshooting result stored in MemoryGraph ### Phase 2: Enhancement **Deliverables:** - Complete diagnostic command library - Expand to system-level diagnostics (disk, memory, network) - Add more Docker containers to monitoring - Implement alert deduplication/fatigue detection - Enhanced prompt engineering for better diagnoses **Success Criteria:** - System handles 10+ different error scenarios - MemoryGraph successfully recalls relevant past issues - <5% false positive rate achieved ### Phase 3: Expansion **Deliverables:** - Extend to additional Proxmox VMs/LXCs - Integration with existing Proxmox skill - NAS report archival and cleanup - SMS notifications (optional) - Documentation and runbooks **Success Criteria:** - System achieves >70% autonomous diagnosis rate - All documentation complete - Full homelab coverage ----- ## 8. Configuration Examples ### 8.1 Claude Code LXC Environment ```bash # Authentication: Use Max subscription (no API key needed) # Run 'claude' interactively once to authenticate via device code flow # Credentials persist in ~/.claude/ # Skills are auto-loaded from: # ~/.claude/skills/ ``` ### 8.2 N8N Workflow Nodes ```yaml # Trigger Node type: Schedule interval: 1 unit: minutes # Health Check Node type: Execute Command command: ssh proxmox-host "docker ps --format '{{.Names}},{{.Status}}' | grep -v 'Up'" # Branch Node condition: "{{ $json.stdout !== '' }}" # Claude Code Node type: Execute Command command: | claude -p "..." --output-format json --allowedTools "Read,Bash,Grep,Glob" timeout: 180000 # Parse JSON Node type: Code code: return JSON.parse($json.stdout) # Discord Node type: Discord Webhook content: | **{{ $json.severity === 'critical' ? '🔴' : $json.severity === 'high' ? '🟠' : '🟡' }} Server Alert** **Root Cause:** {{ $json.root_cause }} **Affected:** {{ $json.affected_services.join(', ') }} **Auto-remediation:** {{ $json.remediation_performed || 'None' }} **Recommended Actions:** {{ $json.recommended_actions.map(a => '• ' + a.action).join('\n') }} ``` ### 8.3 Discord Webhook Setup ```yaml webhook_url: https://discord.com/api/webhooks/{id}/{token} channel: #server-alerts mentions: critical: "@here" high: "" medium: "" low: "" ``` ----- ## 9. Testing Strategy ### 9.1 Unit Tests - Python diagnostic client functions - SSH connection handling - Command whitelist/deny validation - JSON output parsing ### 9.2 Integration Tests - End-to-end: N8N → Claude Code → SSH → Response - MemoryGraph storage and recall - Discord notification delivery - NAS report storage ### 9.3 Simulated Failures | Test Case | Simulation Method | Expected Behavior | |-----------|-------------------|-------------------| | Container crash | `docker stop tdarr` | Detect, diagnose, restart | | High memory | stress-ng in container | Detect, identify process | | Disk full | Create large temp file | Detect, recommend cleanup | | Network issue | Block container port | Detect, diagnose connectivity | | SSH failure | Temporarily revoke key | Graceful error, alert sent | ----- ## 10. Risks and Mitigations ### 10.1 Technical Risks | Risk | Impact | Probability | Mitigation | |------|--------|-------------|------------| | OAuth token expiry | Medium | Medium | Monitor for auth failures, re-authenticate when needed | | SSH key compromise | Critical | Low | Rotate keys, minimal permissions | | Claude hallucinations | Medium | Medium | Require approval for high-risk | | N8N container failure | High | Low | Health check N8N itself | | MemoryGraph corruption | Low | Low | Regular backups | ### 10.2 Operational Risks | Risk | Impact | Probability | Mitigation | |------|--------|-------------|------------| | False positives | Medium | High | Tune thresholds, deduplication | | Alert fatigue | Medium | Medium | Cooldown periods, severity filtering | | Over-reliance | Medium | Low | Maintain manual runbooks | ----- ## 11. Open Questions - RESOLVED | Question | Resolution | |----------|------------| | MCP vs Skills? | **Skills** - simpler, extends existing Proxmox skill | | Bridge API needed? | **No** - N8N calls Claude Code directly | | Where to run Claude Code? | **Dedicated LXC** on Proxmox | | Read-only vs remediation? | **Low-risk remediation allowed** (docker restart) | | Command control method? | **settings.json allow/deny lists** | | Phase 1 scope? | **Docker containers** on Proxmox host | | Notification channel? | **Discord** (SMS future enhancement) | | MemoryGraph integration? | **Yes** - pattern learning enabled | ----- ## 12. Appendices ### Appendix A: Glossary - **N8N:** Workflow automation platform - **Claude Code:** AI-powered coding assistant with headless mode - **Headless Mode:** Non-interactive execution mode for automation - **Skill:** Claude Code context/capability extension via markdown + code - **MemoryGraph:** Graph-based memory system for pattern storage - **LXC:** Linux Container on Proxmox - **MTTD:** Mean Time To Diagnosis ### Appendix B: References - [Claude Code CLI Reference](https://docs.anthropic.com/en/docs/claude-code/cli) - [Claude Code Skills Documentation](https://docs.anthropic.com/en/docs/claude-code/skills) - [N8N Execute Command Node](https://docs.n8n.io/integrations/builtin/core-nodes/n8n-nodes-base.executecommand/) - [Discord Webhooks Guide](https://discord.com/developers/docs/resources/webhook) ### Appendix C: Change Log | Version | Date | Author | Changes | |---------|------|--------|---------| | 1.0 | 2025-12-19 | System Architect | Initial draft (MCP-based) | | 2.0 | 2025-12-19 | Cal + Jarvis | Architecture revision: Skills instead of MCP, direct N8N invocation, LXC deployment, MemoryGraph integration | ----- **End of Document**