docs: archive headless-claude design docs to legacy/

Original planning folder (no git repo) for the server diagnostics system
that runs on CT 300. Live deployment is on claude-runner; this preserves
the Agent SDK reference, PRD with Phase 2/3 roadmap, and N8N workflow designs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Cal Corum 2026-03-01 08:15:13 -06:00
parent 28abde7c9f
commit babf062d6a
9 changed files with 3242 additions and 0 deletions

View File

@ -0,0 +1,65 @@
# Headless Claude - N8N Server Diagnostics
Automated server monitoring system: N8N triggers Claude Code in headless mode to diagnose and remediate home server issues.
## Agent SDK Reference
Full docs: https://code.claude.com/docs/en/headless
CLI reference: https://code.claude.com/docs/en/cli-reference
Agent SDK (Python/TS): https://platform.claude.com/docs/en/agent-sdk/overview
## Key CLI Patterns
### Invocation
All headless invocations use `-p` (print/prompt mode). This is non-interactive — no user prompts, no skill slash commands.
```bash
claude -p "<prompt>" \
--output-format json \
--json-schema '<schema>' \
--allowedTools "Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)"
```
### Structured Output (`--json-schema`)
- When `--json-schema` is provided, the response JSON has a `structured_output` field conforming to the schema
- Falls back to `result` field (free text) if schema is not used
- Always parse `structured_output` first, `result` as fallback
### Tool Permissions (`--allowedTools`)
- Uses permission rule syntax with prefix matching
- **Space before `*` matters**: `Bash(python3 *)` matches `python3 <anything>`, but `Bash(python3*)` also matches `python3something`
- Scope Bash narrowly — prefer `Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)` over broad `Bash`
### Session Resumption
```bash
session_id=$(claude -p "..." --output-format json | jq -r '.session_id')
claude -p "follow up" --resume "$session_id"
```
### System Prompt Customization
- `--append-system-prompt` adds to default behavior
- `--system-prompt` fully replaces default prompt
## Infrastructure
| Component | Location | Notes |
|---|---|---|
| Claude Code LXC | CT 300, 10.10.0.148 | Runs claude CLI, SSH keys to targets |
| N8N | CT 210, 10.10.0.210 | Workflow orchestration |
| Target server | paper-dynasty, 10.10.0.88 | Docker containers monitored |
| Diagnostics skill | `~/.claude/skills/server-diagnostics/` on CT 300 | Python client + config |
## Project Structure
- `PRD.md` — Product requirements (v2.0)
- `n8n-workflow-import.json` — Production N8N workflow (importable)
- `docs/n8n-workflow-design.md` — Full 10-node workflow design
- `docs/skill-architecture.md` — Server diagnostics skill spec
- `docs/lxc-setup-guide.md` — LXC provisioning
- `docs/n8n-setup-instructions.md` — N8N credential setup
## Phase Status
- Phase 1 (Foundation): **Complete** — live, polling every 5 min
- Phase 2 (Enhancement): Pending — alert dedup, expanded monitoring
- Phase 3 (Expansion): Pending — multi-server, Proxmox VMs/LXCs

View File

@ -0,0 +1,786 @@
# Product Requirements Document: N8N-to-Claude Code Automated Server Troubleshooting
## Document Information
- **Version:** 2.0
- **Last Updated:** December 19, 2025
- **Author:** System Architect
- **Status:** Architecture Finalized
-----
## 1. Executive Summary
This PRD defines the implementation of an automated server troubleshooting system that integrates N8N workflow automation with Claude Code's headless mode to detect, diagnose, and resolve home server issues with minimal human intervention.
### 1.1 Problem Statement
Currently, home server errors require manual detection, log analysis, and troubleshooting. This reactive approach leads to:
- Extended downtime between error occurrence and resolution
- Manual context-gathering across multiple systems
- Repetitive troubleshooting of common issues
- Lack of automated diagnostic workflows
### 1.2 Solution Overview
An automated pipeline where N8N health checks trigger Claude Code in headless mode, enabling AI-powered autonomous troubleshooting. Claude Code uses a custom **Skill with embedded Python client** (not MCP) for server diagnostics, with results stored in MemoryGraph for pattern learning.
### 1.3 Key Architectural Decisions
| Decision | Choice | Rationale |
|----------|--------|-----------|
| Tool Execution | Claude Code Skill with Python client | Simpler than MCP, integrated with existing Proxmox skill |
| N8N Integration | Direct CLI invocation | No Bridge API needed, simpler architecture |
| Deployment | Dedicated LXC on Proxmox | Isolated, snapshotable, always-on |
| Authentication | Max subscription (OAuth) | No API key management, generous rate limits included |
| Server Access | SSH from Claude Code LXC | Key-based auth, secure |
| Command Control | Allow/Deny lists in settings.json | Native Claude Code integration |
| Phase 1 Scope | Docker containers | Most common failure point |
| Remediation | Low-risk commands enabled | e.g., `docker restart` |
| Memory | MemoryGraph integration | Pattern learning across sessions |
| Notifications | Discord | Reliable, existing infrastructure |
-----
## 2. Goals and Success Metrics
### 2.1 Goals
- **Primary:** Enable autonomous troubleshooting of home server errors without human intervention
- **Secondary:** Reduce mean time to diagnosis (MTTD) for server issues
- **Tertiary:** Create a reusable framework for AI-assisted infrastructure management
### 2.2 Success Metrics
| Metric | Target | Measurement Method |
|--------|--------|-------------------|
| Automated issue detection | 100% | N8N health check coverage |
| Successful autonomous diagnosis | >70% | Issues diagnosed without human input |
| Mean time to diagnosis | <5 minutes | Time from error to root cause identification |
| False positive rate | <10% | Incorrect diagnoses / total diagnoses |
| System uptime improvement | +15% | Compare pre/post implementation |
### 2.3 Non-Goals
- Real-time monitoring dashboard (use existing N8N UI)
- Multi-user access control (single-user home server environment)
- Integration with commercial monitoring platforms (Datadog, New Relic, etc.)
- High-risk autonomous fixes (shutdown, delete, format) without approval
-----
## 3. User Stories
### 3.1 Primary User: Home Server Administrator
**Story 1: Autonomous Error Response**
```
AS A home server administrator
I WANT N8N to automatically trigger Claude Code when errors occur
SO THAT troubleshooting begins immediately without my intervention
ACCEPTANCE CRITERIA:
- N8N health check detects error within 60 seconds
- Claude Code receives error context within 5 seconds of detection
- Troubleshooting begins autonomously without manual trigger
```
**Story 2: Comprehensive Diagnostic Access**
```
AS A home server administrator
I WANT Claude Code to have access to server logs, metrics, and diagnostic commands
SO THAT it can perform thorough root cause analysis
ACCEPTANCE CRITERIA:
- Claude Code can read system logs via Skill's Python client
- Claude Code can execute whitelisted diagnostic commands via SSH
- All diagnostic actions are logged for audit
```
**Story 3: Actionable Troubleshooting Output**
```
AS A home server administrator
I WANT Claude Code to provide structured troubleshooting results
SO THAT I can quickly understand the issue and recommended actions
ACCEPTANCE CRITERIA:
- Output includes: root cause, severity, recommended actions
- Output is stored on NAS for historical reference
- Discord notifications sent for critical issues
- Results stored in MemoryGraph for pattern learning
```
**Story 4: Safe Autonomous Operations**
```
AS A home server administrator
I WANT controls on what Claude Code can execute autonomously
SO THAT critical systems are protected from unintended changes
ACCEPTANCE CRITERIA:
- Read-only operations execute without approval
- Low-risk write operations (docker restart) allowed via whitelist
- Critical commands blocked via deny list in settings.json
```
-----
## 4. System Architecture
### 4.1 Component Overview
```
┌─────────────────────────────────────────────────────────────────────────┐
│ Proxmox Host Network │
│ │
│ ┌──────────────────┐ │
│ │ Docker Services │ ← Phase 1 Target │
│ │ - Tdarr │ │
│ │ - Portainer │ │
│ │ - Other containers │
│ └────────┬─────────┘ │
│ │ │
│ │ SSH (key-based) │
│ │ │
│ ┌────────▼─────────┐ ┌─────────────────────────────────┐ │
│ │ N8N Container │ │ Claude Code LXC │ │
│ │ (Health Checks) │────────▶│ - Headless mode │ │
│ │ │ CLI │ - Server Diagnostics Skill │ │
│ │ - Cron triggers │ invoke │ - SSH keys installed │ │
│ │ - Docker API │ │ - ANTHROPIC_API_KEY set │ │
│ │ - Discord notify│◀────────│ - MemoryGraph client │ │
│ └──────────────────┘ JSON └────────┬────────────────────────┘ │
│ output │ │
│ │ Python client │
│ ▼ │
│ ┌───────────────────────────────────────┐ │
│ │ Server Diagnostics Skill │ │
│ │ ~/.claude/skills/server-diagnostics/ │ │
│ │ │ │
│ │ ├── SKILL.md (context & workflows) │ │
│ │ ├── client.py (Python diagnostic lib)│ │
│ │ └── config.yaml (server inventory) │ │
│ └───────────────────────────────────────┘ │
│ │
│ ┌──────────────────┐ ┌─────────────────────────────────┐ │
│ │ NAS Storage │ │ MemoryGraph │ │
│ │ - Report output │ │ - Pattern storage │ │
│ │ - Historical │ │ - Solution recall │ │
│ │ troubleshooting│ │ - Cross-session learning │ │
│ └──────────────────┘ └─────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
```
### 4.2 Component Details
#### 4.2.1 N8N Workflow Container
- **Technology:** N8N (new deployment)
- **Responsibilities:**
- Execute periodic health checks on Docker containers
- Detect error conditions based on defined thresholds
- Aggregate error context (logs, timestamps, affected services)
- Invoke Claude Code directly via Execute Command node
- Parse JSON response from Claude Code
- Send Discord notifications with recommendations
- (Optional) Auto-execute low-risk remediation actions
#### 4.2.2 Claude Code LXC Container
- **Technology:** Dedicated LXC on Proxmox
- **Resources:** 2 vCPU, 2GB RAM, 16GB disk
- **Authentication:** Claude Max subscription (device code OAuth flow, credentials persist in ~/.claude/)
- **Responsibilities:**
- Execute in headless mode when triggered by N8N
- Load Server Diagnostics Skill automatically
- Use Python client to gather diagnostic information via SSH
- Analyze error context and logs
- Generate structured JSON troubleshooting output
- Store learnings in MemoryGraph
#### 4.2.3 Server Diagnostics Skill
- **Location:** `~/.claude/skills/server-diagnostics/`
- **Technology:** Python client library with CLI wrapper
- **Extends:** Existing Proxmox skill patterns
- **Responsibilities:**
- Provide context for troubleshooting workflows
- Expose Python functions for diagnostics:
- `read_logs(server, log_type, lines, filter)`
- `run_diagnostic(server, command)`
- `get_metrics(server, metric_type)`
- `get_docker_status(server, container)`
- `docker_restart(server, container)` (low-risk remediation)
- Enforce command whitelisting via skill configuration
- Log all operations for audit
#### 4.2.4 MemoryGraph Integration
- **Technology:** Existing MemoryGraph skill
- **Responsibilities:**
- Store successful troubleshooting patterns
- Recall similar past issues during diagnosis
- Track which solutions worked/failed
- Build knowledge graph of infrastructure issues
-----
## 5. Technical Requirements
### 5.1 Claude Code Headless Mode Configuration
#### 5.1.1 Invocation Pattern (from N8N)
```bash
claude -p "<prompt>" \
--output-format json \
--json-schema '<schema>' \
--allowedTools "Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)"
```
**Parameters:**
- `-p` : Prompt containing error context and instructions
- `--output-format json` : Structured JSON output for parsing
- `--json-schema` : Enforces structured output shape (result in `structured_output` field)
- `--allowedTools` : Pre-approve safe tools with prefix-matched Bash scoping (space before `*` is significant)
#### 5.1.2 Prompt Template
```
You are troubleshooting a server issue. Use the server-diagnostics skill.
Server: {server_name}
Error Type: {error_type}
Timestamp: {timestamp}
Error Message: {error_message}
Initial Context from N8N:
{context_from_n8n}
Instructions:
1. Use the Python diagnostic client to gather additional information
2. Check MemoryGraph for similar past issues
3. Analyze the root cause
4. If appropriate, execute low-risk remediation (e.g., docker restart)
5. Store learnings in MemoryGraph
Output a JSON response with:
{
"root_cause": "string describing the root cause",
"severity": "low" | "medium" | "high" | "critical",
"affected_services": ["list", "of", "services"],
"diagnosis_steps": ["steps", "taken", "to", "diagnose"],
"recommended_actions": [
{
"action": "description",
"command": "actual command",
"risk_level": "none" | "low" | "medium" | "high",
"executed": true | false
}
],
"remediation_performed": "description of any auto-remediation done",
"memory_graph_entries": ["list of memories stored"],
"additional_context": "any other relevant information"
}
```
### 5.2 Server Diagnostics Skill Specification
#### 5.2.1 Skill Structure
```
~/.claude/skills/server-diagnostics/
├── SKILL.md # Skill context and workflows
├── client.py # Python diagnostic library
├── config.yaml # Server inventory and settings
└── commands/
├── docker.py # Docker-specific diagnostics
├── system.py # System-level diagnostics
└── network.py # Network diagnostics
```
#### 5.2.2 Python Client Interface
```python
# client.py - Core diagnostic functions
class ServerDiagnostics:
def __init__(self, config_path: str = "config.yaml"):
"""Initialize with server inventory from config."""
def read_logs(
self,
server: str,
log_type: Literal["system", "docker", "application", "custom"],
lines: int = 100,
filter: str | None = None
) -> str:
"""Read logs from specified server via SSH."""
def run_diagnostic(
self,
server: str,
command: Literal[
"disk_usage", "memory_usage", "cpu_usage",
"process_list", "network_status", "docker_ps",
"service_status", "port_check"
],
params: dict | None = None
) -> dict:
"""Execute whitelisted diagnostic command."""
def get_docker_status(
self,
server: str,
container: str | None = None
) -> dict:
"""Get Docker container status and health."""
def docker_restart(
self,
server: str,
container: str
) -> dict:
"""Restart a Docker container (low-risk remediation)."""
def get_metrics(
self,
server: str,
metric_type: Literal["cpu", "memory", "disk", "network", "all"] = "all"
) -> dict:
"""Get current system metrics."""
```
#### 5.2.3 Configuration File (config.yaml)
```yaml
# Server inventory and skill settings
servers:
proxmox-host:
hostname: 192.168.1.100
ssh_user: root
ssh_key: ~/.ssh/claude_diagnostics_key
docker_socket: /var/run/docker.sock
# Containers to monitor (Phase 1)
docker_containers:
- name: tdarr
critical: true
restart_allowed: true
- name: portainer
critical: true
restart_allowed: true
- name: n8n
critical: true
restart_allowed: false # Don't restart yourself!
# Command whitelist (maps to actual commands)
diagnostic_commands:
disk_usage: "df -h"
memory_usage: "free -h"
cpu_usage: "top -bn1 | head -20"
process_list: "ps aux --sort=-%mem | head -20"
network_status: "ss -tuln"
docker_ps: "docker ps -a --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}'"
service_status: "systemctl status {service}"
port_check: "nc -zv {host} {port}"
# Remediation commands (low-risk only)
remediation_commands:
docker_restart: "docker restart {container}"
docker_logs: "docker logs --tail 500 {container}"
# Denied commands (never execute)
denied_patterns:
- "rm -rf"
- "dd if="
- "mkfs"
- ":(){:|:&};:"
- "shutdown"
- "reboot"
- "init 0"
- "systemctl stop"
- "> /dev/sd"
```
### 5.3 Command Control in settings.json
Claude Code's `~/.claude/settings.json` will include:
```json
{
"permissions": {
"allow": [
"Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)",
"Bash(ssh proxmox-host docker *)",
"Bash(ssh proxmox-host systemctl status *)",
"Bash(ssh proxmox-host journalctl *)",
"Bash(ssh proxmox-host df *)",
"Bash(ssh proxmox-host free *)",
"Bash(ssh proxmox-host ps *)",
"Bash(ssh proxmox-host top *)",
"Bash(ssh proxmox-host ss *)",
"Bash(ssh proxmox-host nc -zv *)"
],
"deny": [
"Bash(rm -rf *)",
"Bash(dd *)",
"Bash(mkfs *)",
"Bash(shutdown *)",
"Bash(reboot *)",
"Bash(init *)",
"Bash(*> /dev/sd*)"
]
}
}
```
> **Note:** The `--allowedTools` flag uses [permission rule syntax](https://code.claude.com/docs/en/settings#permission-rule-syntax). The space before `*` is significant — `Bash(python3 *)` matches commands starting with `python3 `, while `Bash(python3*)` would also match `python3something`.
### 5.4 N8N Execute Command Configuration
```bash
# N8N Execute Command Node
claude -p "$(cat <<'EOF'
You are troubleshooting a server issue. Use the server-diagnostics skill.
Server: {{ $json.server }}
Error Type: {{ $json.error_type }}
Timestamp: {{ $json.timestamp }}
Error Message: {{ $json.error_message }}
Initial Context:
{{ $json.context }}
[... rest of prompt template ...]
EOF
)" --output-format json \
--json-schema '{ "type": "object", "properties": { "root_cause": { "type": "string" }, "severity": { "type": "string", "enum": ["low","medium","high","critical"] }, "affected_services": { "type": "array", "items": { "type": "string" } }, "diagnosis_steps": { "type": "array", "items": { "type": "string" } }, "recommended_actions": { "type": "array", "items": { "type": "object" } }, "remediation_performed": { "type": "string" }, "additional_context": { "type": "string" } }, "required": ["root_cause","severity","affected_services"] }' \
--allowedTools "Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)"
```
### 5.5 Security Requirements
#### 5.5.1 SSH Security
- Dedicated SSH key pair for Claude Code LXC
- Key installed only on target servers (Proxmox host initially)
- No password authentication
- Key has restricted permissions (read-only where possible)
#### 5.5.2 Command Control
- Allow list in `settings.json` for pre-approved commands
- Deny list blocks dangerous operations
- Skill config.yaml provides additional validation layer
- Denied patterns checked before command execution
#### 5.5.3 Network Security
- Claude Code LXC on internal network only
- N8N accesses Claude Code via local exec (same Proxmox host)
- No external API exposure
#### 5.5.4 Audit Logging
- All diagnostic commands logged with:
- Timestamp
- Command executed
- Server targeted
- Result summary
- Session context
- Logs stored on NAS alongside reports
- MemoryGraph entries provide additional audit trail
-----
## 6. Data Flow
### 6.1 Happy Path: Error Detection to Resolution
```
1. N8N Schedule Trigger (every 60s)
2. N8N Health Check (Execute Command: docker ps, check container health)
3. Error Detected (container unhealthy or stopped)
4. N8N aggregates context:
- Container name and status
- Recent docker logs (last 50 lines)
- Current timestamp
5. N8N Execute Command Node:
claude -p "{prompt with context}" --output-format json --allowedTools "..."
6. Claude Code (headless):
- Loads server-diagnostics skill
- Recalls similar issues from MemoryGraph
- Runs Python diagnostic client:
* get_docker_status(proxmox-host, container)
* read_logs(proxmox-host, docker, 200, "error")
* get_metrics(proxmox-host, all)
- Analyzes gathered data
- If low-risk: executes docker_restart()
- Stores learnings in MemoryGraph
- Outputs structured JSON
7. N8N parses JSON response
8. N8N saves report to NAS
9. N8N sends Discord notification:
- Severity emoji
- Root cause summary
- Actions taken/recommended
- Link to full report
10. Administrator reviews (if needed)
```
### 6.2 Error Handling
**Scenario: Claude Code invocation fails**
- N8N captures stderr and exit code
- N8N retries up to 3 times with exponential backoff
- If all retries fail, send critical Discord alert
**Scenario: SSH connection fails**
- Python client returns error in response
- Claude Code incorporates into analysis
- Recommendation: "Unable to connect to {server}, check network/SSH"
**Scenario: Malformed output from Claude Code**
- N8N attempts JSON parse
- On failure, save raw output to NAS
- Send Discord alert with raw output attached
**Scenario: Remediation fails**
- Capture error from docker restart
- Do not retry automatically
- Report failure and suggest manual intervention
-----
## 7. Implementation Phases
### Phase 1: Foundation
**Deliverables:**
- Claude Code LXC container provisioned and configured
- SSH key pair generated and installed on Proxmox host
- Server Diagnostics Skill with basic Docker tools
- N8N workflow for Docker container health checks
- Discord webhook integration
- Basic MemoryGraph integration
**Success Criteria:**
- N8N can trigger Claude Code headless
- Claude Code can diagnose Docker container issues
- Discord notifications received for test alerts
- At least one troubleshooting result stored in MemoryGraph
### Phase 2: Enhancement
**Deliverables:**
- Complete diagnostic command library
- Expand to system-level diagnostics (disk, memory, network)
- Add more Docker containers to monitoring
- Implement alert deduplication/fatigue detection
- Enhanced prompt engineering for better diagnoses
**Success Criteria:**
- System handles 10+ different error scenarios
- MemoryGraph successfully recalls relevant past issues
- <5% false positive rate achieved
### Phase 3: Expansion
**Deliverables:**
- Extend to additional Proxmox VMs/LXCs
- Integration with existing Proxmox skill
- NAS report archival and cleanup
- SMS notifications (optional)
- Documentation and runbooks
**Success Criteria:**
- System achieves >70% autonomous diagnosis rate
- All documentation complete
- Full homelab coverage
-----
## 8. Configuration Examples
### 8.1 Claude Code LXC Environment
```bash
# Authentication: Use Max subscription (no API key needed)
# Run 'claude' interactively once to authenticate via device code flow
# Credentials persist in ~/.claude/
# Skills are auto-loaded from:
# ~/.claude/skills/
```
### 8.2 N8N Workflow Nodes
```yaml
# Trigger Node
type: Schedule
interval: 1
unit: minutes
# Health Check Node
type: Execute Command
command: ssh proxmox-host "docker ps --format '{{.Names}},{{.Status}}' | grep -v 'Up'"
# Branch Node
condition: "{{ $json.stdout !== '' }}"
# Claude Code Node
type: Execute Command
command: |
claude -p "..." --output-format json --allowedTools "Read,Bash,Grep,Glob"
timeout: 180000
# Parse JSON Node
type: Code
code: return JSON.parse($json.stdout)
# Discord Node
type: Discord Webhook
content: |
**{{ $json.severity === 'critical' ? '🔴' : $json.severity === 'high' ? '🟠' : '🟡' }} Server Alert**
**Root Cause:** {{ $json.root_cause }}
**Affected:** {{ $json.affected_services.join(', ') }}
**Auto-remediation:** {{ $json.remediation_performed || 'None' }}
**Recommended Actions:**
{{ $json.recommended_actions.map(a => '• ' + a.action).join('\n') }}
```
### 8.3 Discord Webhook Setup
```yaml
webhook_url: https://discord.com/api/webhooks/{id}/{token}
channel: #server-alerts
mentions:
critical: "@here"
high: ""
medium: ""
low: ""
```
-----
## 9. Testing Strategy
### 9.1 Unit Tests
- Python diagnostic client functions
- SSH connection handling
- Command whitelist/deny validation
- JSON output parsing
### 9.2 Integration Tests
- End-to-end: N8N → Claude Code → SSH → Response
- MemoryGraph storage and recall
- Discord notification delivery
- NAS report storage
### 9.3 Simulated Failures
| Test Case | Simulation Method | Expected Behavior |
|-----------|-------------------|-------------------|
| Container crash | `docker stop tdarr` | Detect, diagnose, restart |
| High memory | stress-ng in container | Detect, identify process |
| Disk full | Create large temp file | Detect, recommend cleanup |
| Network issue | Block container port | Detect, diagnose connectivity |
| SSH failure | Temporarily revoke key | Graceful error, alert sent |
-----
## 10. Risks and Mitigations
### 10.1 Technical Risks
| Risk | Impact | Probability | Mitigation |
|------|--------|-------------|------------|
| OAuth token expiry | Medium | Medium | Monitor for auth failures, re-authenticate when needed |
| SSH key compromise | Critical | Low | Rotate keys, minimal permissions |
| Claude hallucinations | Medium | Medium | Require approval for high-risk |
| N8N container failure | High | Low | Health check N8N itself |
| MemoryGraph corruption | Low | Low | Regular backups |
### 10.2 Operational Risks
| Risk | Impact | Probability | Mitigation |
|------|--------|-------------|------------|
| False positives | Medium | High | Tune thresholds, deduplication |
| Alert fatigue | Medium | Medium | Cooldown periods, severity filtering |
| Over-reliance | Medium | Low | Maintain manual runbooks |
-----
## 11. Open Questions - RESOLVED
| Question | Resolution |
|----------|------------|
| MCP vs Skills? | **Skills** - simpler, extends existing Proxmox skill |
| Bridge API needed? | **No** - N8N calls Claude Code directly |
| Where to run Claude Code? | **Dedicated LXC** on Proxmox |
| Read-only vs remediation? | **Low-risk remediation allowed** (docker restart) |
| Command control method? | **settings.json allow/deny lists** |
| Phase 1 scope? | **Docker containers** on Proxmox host |
| Notification channel? | **Discord** (SMS future enhancement) |
| MemoryGraph integration? | **Yes** - pattern learning enabled |
-----
## 12. Appendices
### Appendix A: Glossary
- **N8N:** Workflow automation platform
- **Claude Code:** AI-powered coding assistant with headless mode
- **Headless Mode:** Non-interactive execution mode for automation
- **Skill:** Claude Code context/capability extension via markdown + code
- **MemoryGraph:** Graph-based memory system for pattern storage
- **LXC:** Linux Container on Proxmox
- **MTTD:** Mean Time To Diagnosis
### Appendix B: References
- [Claude Code CLI Reference](https://docs.anthropic.com/en/docs/claude-code/cli)
- [Claude Code Skills Documentation](https://docs.anthropic.com/en/docs/claude-code/skills)
- [N8N Execute Command Node](https://docs.n8n.io/integrations/builtin/core-nodes/n8n-nodes-base.executecommand/)
- [Discord Webhooks Guide](https://discord.com/developers/docs/resources/webhook)
### Appendix C: Change Log
| Version | Date | Author | Changes |
|---------|------|--------|---------|
| 1.0 | 2025-12-19 | System Architect | Initial draft (MCP-based) |
| 2.0 | 2025-12-19 | Cal + Jarvis | Architecture revision: Skills instead of MCP, direct N8N invocation, LXC deployment, MemoryGraph integration |
-----
**End of Document**

View File

@ -0,0 +1,19 @@
# Headless Claude - Server Diagnostics (Archived)
Design docs and PRD for the headless Claude server monitoring system.
## Live Deployment
- **CT 300 (`claude-runner`, 10.10.0.148):** `~/.claude/skills/server-diagnostics/`
- Triggered by N8N (CT 210) on schedule
- Tier 1 (health_check.py) runs autonomously; Tier 2 (client.py) escalates to Claude
## Why This Is Archived
Original planning folder with no git repo. The implementation was deployed directly
to CT 300. This archive preserves the design docs, particularly:
- `CLAUDE.md` — Agent SDK reference, CLI patterns, structured output, tool permissions
- `PRD.md` — Full product requirements (v2.0) with Phase 2/3 roadmap
- `docs/skill-architecture.md` — Detailed skill design spec
- `docs/n8n-workflow-design.md` — 10-node N8N workflow design
## Archived
2026-03-01 from `/mnt/NV2/Development/headless-claude/`

View File

@ -0,0 +1,478 @@
# Claude Code LXC Setup Guide
This guide walks through setting up a dedicated LXC container on Proxmox for running Claude Code in headless mode.
## Prerequisites
- Proxmox VE host with available resources
- Ubuntu/Debian LXC template
- Anthropic API key (from console.anthropic.com)
- SSH access to Proxmox host
## 1. Create LXC Container
### Via Proxmox Web UI
1. Navigate to your Proxmox node
2. Click **Create CT**
3. Configure:
- **CT ID:** 300 (or next available)
- **Hostname:** claude-code
- **Password:** Set a secure password
- **SSH Public Key:** Add your key for access
4. Template:
- **Template:** `ubuntu-22.04-standard` or `debian-12-standard`
5. Resources:
- **Disk:** 16 GB (local-lvm)
- **CPU:** 2 cores
- **Memory:** 2048 MB
- **Swap:** 512 MB
6. Network:
- **Bridge:** vmbr0
- **IPv4:** DHCP or static (e.g., 10.10.0.50/24)
- **Gateway:** Your network gateway
7. DNS:
- Use host settings or configure manually
8. Click **Finish** and start the container
### Via CLI (pct)
```bash
# SSH to Proxmox host
ssh root@proxmox-host
# Create container
pct create 300 local:vztmpl/ubuntu-22.04-standard_22.04-1_amd64.tar.zst \
--hostname claude-code \
--memory 2048 \
--swap 512 \
--cores 2 \
--rootfs local-lvm:16 \
--net0 name=eth0,bridge=vmbr0,ip=dhcp \
--unprivileged 1 \
--features nesting=1 \
--start 1
# Set password
pct exec 300 -- passwd root
```
## 2. Initial Container Setup
```bash
# Enter container
pct enter 300
# Or SSH: ssh root@10.10.0.50
# Update system
apt update && apt upgrade -y
# Install essential packages
apt install -y \
curl \
wget \
git \
openssh-client \
python3 \
python3-pip \
python3-venv \
ca-certificates \
gnupg \
jq \
vim
# Install Node.js 20.x (required for Claude Code)
curl -fsSL https://deb.nodesource.com/setup_20.x | bash -
apt install -y nodejs
# Verify installations
node --version # Should be v20.x
npm --version
python3 --version
```
## 3. Install Claude Code CLI
```bash
# Install Claude Code globally
npm install -g @anthropic-ai/claude-code
# Verify installation
claude --version
```
## 4. Configure Authentication (Max Subscription)
Using your Claude Max subscription is the recommended approach - no API key management needed, and you get the generous Max tier rate limits included in your $20/month subscription.
### Authenticate with Max Subscription
```bash
# Run Claude Code interactively
claude
# You'll see a device code prompt:
# ┌────────────────────────────────────────────────────┐
# │ To authenticate, visit: │
# │ https://console.anthropic.com/device │
# │ │
# │ Enter code: ABCD-1234 │
# └────────────────────────────────────────────────────┘
# 1. Open the URL in your browser (on any device)
# 2. Log in with your Anthropic account (Max subscription)
# 3. Enter the code shown in the terminal
# 4. Claude Code will confirm authentication
```
Credentials are stored in `~/.claude/` and persist across sessions.
### Verify Authentication
```bash
# Test headless mode
claude -p "Say hello" --output-format json
# Should return JSON with response like:
# {"type":"result","result":"Hello! How can I help you today?","session_id":"..."}
```
### Token Refresh
OAuth tokens may expire after weeks/months. If headless mode starts failing with authentication errors:
1. SSH into the LXC
2. Run `claude` interactively
3. Re-authenticate via device code flow
Consider adding a health check in N8N to detect auth failures and alert you.
## 5. Set Up SSH Keys for Server Access
```bash
# Create .ssh directory
mkdir -p ~/.ssh
chmod 700 ~/.ssh
# Generate dedicated key pair for diagnostics
ssh-keygen -t ed25519 -f ~/.ssh/claude_diagnostics_key -N "" -C "claude-code-diagnostics"
# View public key (copy this)
cat ~/.ssh/claude_diagnostics_key.pub
```
### Install Key on Target Servers
```bash
# On Proxmox host (or other target servers)
# Add the public key to authorized_keys
echo "ssh-ed25519 AAAA... claude-code-diagnostics" >> ~/.ssh/authorized_keys
# Or use ssh-copy-id from Claude Code LXC
ssh-copy-id -i ~/.ssh/claude_diagnostics_key.pub root@10.10.0.11
```
### Test SSH Connection
```bash
# From Claude Code LXC
ssh -i ~/.ssh/claude_diagnostics_key root@10.10.0.11 "hostname && docker ps"
# Should show hostname and Docker containers
```
## 6. Install Python Dependencies
```bash
# Install uv (fast Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.bashrc
# Create virtual environment for skills
mkdir -p ~/.claude/skills
cd ~/.claude/skills
# Install skill dependencies
uv venv
source .venv/bin/activate
uv pip install paramiko pyyaml
```
## 7. Install Server Diagnostics Skill
```bash
# Create skill directory
mkdir -p ~/.claude/skills/server-diagnostics
# Copy skill files (from development machine)
# Or clone from repo when available
```
### Create config.yaml
```bash
cat > ~/.claude/skills/server-diagnostics/config.yaml << 'EOF'
# Server Diagnostics Configuration
servers:
proxmox-host:
hostname: 10.10.0.11 # Update with actual IP
ssh_user: root
ssh_key: ~/.ssh/claude_diagnostics_key
description: "Main Proxmox host running Docker services"
docker_containers:
- name: tdarr
critical: true
restart_allowed: true
- name: portainer
critical: true
restart_allowed: true
- name: n8n
critical: true
restart_allowed: false
- name: plex
critical: true
restart_allowed: true
diagnostic_commands:
disk_usage: "df -h"
memory_usage: "free -h"
cpu_usage: "top -bn1 | head -20"
process_list: "ps aux --sort=-%mem | head -20"
network_status: "ss -tuln"
docker_ps: "docker ps -a --format 'table {{.Names}}\\t{{.Status}}\\t{{.Ports}}'"
remediation_commands:
docker_restart: "docker restart {container}"
docker_logs: "docker logs --tail 500 {container}"
denied_patterns:
- "rm -rf"
- "dd if="
- "mkfs"
- "shutdown"
- "reboot"
- "> /dev/sd"
EOF
```
## 8. Configure Claude Code settings.json
```bash
mkdir -p ~/.claude
cat > ~/.claude/settings.json << 'EOF'
{
"permissions": {
"allow": [
"Bash(ssh root@10.10.0.11 docker:*)",
"Bash(ssh root@10.10.0.11 systemctl status:*)",
"Bash(ssh root@10.10.0.11 journalctl:*)",
"Bash(ssh root@10.10.0.11 df:*)",
"Bash(ssh root@10.10.0.11 free:*)",
"Bash(ssh root@10.10.0.11 ps:*)",
"Bash(ssh root@10.10.0.11 top:*)",
"Bash(ssh root@10.10.0.11 ss:*)",
"Bash(python3 ~/.claude/skills/server-diagnostics/client.py:*)"
],
"deny": [
"Bash(rm -rf:*)",
"Bash(dd:*)",
"Bash(mkfs:*)",
"Bash(shutdown:*)",
"Bash(reboot:*)",
"Bash(*> /dev/sd*)"
]
},
"model": "sonnet"
}
EOF
```
## 9. Configure SSH for Known Hosts
```bash
# Pre-add Proxmox host to known hosts to avoid prompts
ssh-keyscan -H 10.10.0.11 >> ~/.ssh/known_hosts
```
## 10. Create Test Script
```bash
cat > ~/test-claude-headless.sh << 'EOF'
#!/bin/bash
# Test Claude Code headless mode
echo "Testing Claude Code headless mode..."
# Simple test
result=$(claude -p "What is 2+2? Reply with just the number." --output-format json 2>&1)
if echo "$result" | jq -e '.result' > /dev/null 2>&1; then
echo "✅ Claude Code headless mode working"
echo "Response: $(echo "$result" | jq -r '.result')"
elif echo "$result" | grep -q "authenticate"; then
echo "❌ Authentication required - run 'claude' interactively to log in"
exit 1
else
echo "❌ Claude Code headless mode failed"
echo "Output: $result"
exit 1
fi
# Test SSH access
echo ""
echo "Testing SSH to Proxmox host..."
if ssh -i ~/.ssh/claude_diagnostics_key -o ConnectTimeout=5 root@10.10.0.11 "echo 'SSH OK'" 2>/dev/null; then
echo "✅ SSH connection working"
else
echo "❌ SSH connection failed"
exit 1
fi
# Test Docker access
echo ""
echo "Testing Docker access on Proxmox host..."
containers=$(ssh -i ~/.ssh/claude_diagnostics_key root@10.10.0.11 "docker ps --format '{{.Names}}'" 2>/dev/null)
if [ -n "$containers" ]; then
echo "✅ Docker access working"
echo "Containers found:"
echo "$containers" | sed 's/^/ - /'
else
echo "⚠️ No containers found or Docker not accessible"
fi
echo ""
echo "All tests completed!"
EOF
chmod +x ~/test-claude-headless.sh
```
## 11. Run Verification Tests
```bash
# Run the test script
~/test-claude-headless.sh
```
Expected output:
```
Testing Claude Code headless mode...
✅ Claude Code headless mode working
Response: 4
Testing SSH to Proxmox host...
✅ SSH connection working
Testing Docker access on Proxmox host...
✅ Docker access working
Containers found:
- tdarr
- portainer
- n8n
- plex
All tests completed!
```
## 12. Configure N8N Access (Next Step)
The N8N container needs to be able to invoke Claude Code on this LXC. Options:
### Option A: SSH from N8N to Claude LXC
```bash
# On N8N container, generate key and copy to Claude LXC
ssh-keygen -t ed25519 -f ~/.ssh/claude_lxc_key -N ""
ssh-copy-id -i ~/.ssh/claude_lxc_key.pub root@10.10.0.50
# N8N Execute Command will use:
# ssh -i ~/.ssh/claude_lxc_key root@10.10.0.50 "claude -p '...' --output-format json"
```
### Option B: Local Execution (if N8N runs on same host)
If N8N runs in a container on the Proxmox host, you can mount the Claude LXC filesystem or use `pct exec`:
```bash
# From Proxmox host
pct exec 300 -- claude -p "..." --output-format json
```
## Troubleshooting
### Claude Code Not Found
```bash
# Check npm global path
npm config get prefix
# Ensure it's in PATH
export PATH="$PATH:$(npm config get prefix)/bin"
```
### SSH Permission Denied
```bash
# Check key permissions
chmod 600 ~/.ssh/claude_diagnostics_key
chmod 644 ~/.ssh/claude_diagnostics_key.pub
# Check authorized_keys on target
ssh root@10.10.0.11 "cat ~/.ssh/authorized_keys"
```
### Authentication Expired
```bash
# If headless mode returns auth errors, re-authenticate:
claude
# Follow the device code flow to log in again
# Credentials will be refreshed in ~/.claude/
```
### Container Network Issues
```bash
# Check network configuration
ip addr
ping -c 3 google.com
# If no connectivity, check Proxmox network settings
```
## Security Considerations
1. **OAuth Credentials**: Authentication tokens are stored in `~/.claude/`. Ensure the LXC has restricted access and backups don't expose this directory.
2. **SSH Key Scope**: The `claude_diagnostics_key` should only be installed on servers that need automated diagnostics.
3. **Minimal Permissions**: The SSH key on target servers could use `command=` restrictions for additional security (Phase 2 enhancement).
4. **Network Isolation**: Consider placing the Claude LXC on an internal-only network segment.
5. **Session Security**: Your Max subscription is tied to this LXC. Don't share access to the container.
## Snapshot Before Production
```bash
# On Proxmox host, create snapshot
pct snapshot 300 before-production --description "Claude Code LXC fully configured"
```
## Next Steps
1. ✅ LXC created and configured
2. ✅ Claude Code installed and authenticated
3. ✅ SSH keys installed on target servers
4. ⏳ Install server-diagnostics skill (when code ready)
5. ⏳ Configure N8N workflow
6. ⏳ Test end-to-end pipeline

View File

@ -0,0 +1,139 @@
# N8N Setup Instructions
This guide walks through configuring the N8N health check workflow.
## Prerequisites
- N8N running at http://10.10.0.210:5678
- SSH access configured between N8N LXC and Claude Code LXC (already done)
- Discord webhook configured (already done)
## Step 1: Create SSH Credential in N8N
1. Open N8N at http://10.10.0.210:5678
2. Go to **Settings** (gear icon) → **Credentials**
3. Click **Add Credential**
4. Search for and select **SSH**
5. Configure:
- **Credential Name:** `Claude Code LXC`
- **Host:** `10.10.0.148`
- **Port:** `22`
- **Username:** `root`
- **Authentication:** `Private Key`
- **Private Key:** Paste the following:
```
-----BEGIN OPENSSH PRIVATE KEY-----
b3BlbnNzaC1rZXktdjEAAAAABG5vbmUAAAAEbm9uZQAAAAAAAAABAAAAMwAAAAtzc2gtZW
QyNTUxOQAAACAtHWX+TOsNkiL7TP1niwpO5AJjBThghODpa1q93g7bagAAAJiq1/Cyqtfw
sgAAAAtzc2gtZWQyNTUxOQAAACAtHWX+TOsNkiL7TP1niwpO5AJjBThghODpa1q93g7bag
AAAEBVXct4LSCsRexl9JYZtrYx4YVoioYBdWHlPinW+PudBS0dZf5M6w2SIvtM/WeLCk7k
AmMFOGCE4OlrWr3eDttqAAAAEm44bi10by1jbGF1ZGUtY29kZQECAw==
-----END OPENSSH PRIVATE KEY-----
```
6. Click **Save**
## Step 2: Import the Workflow
1. Go to **Workflows** in N8N
2. Click **Add Workflow** → **Import from File**
3. Select: `/mnt/NV2/Development/headless-claude/n8n-workflow-import.json`
- Or copy the workflow from this repo
4. After import, open the workflow
## Step 3: Link the SSH Credential
1. Click on the **Run Claude Diagnostics** node
2. In the **Credential** dropdown, select `Claude Code LXC`
3. Click **Save**
## Step 4: Test the Workflow
1. Click **Execute Workflow** (play button)
2. Watch each node execute:
- **Every 5 Minutes** → triggers
- **Run Claude Diagnostics** → SSHs to Claude Code LXC
- **Parse Claude Response** → extracts result
- **Has Issues?** → routes based on health
- **Discord Alert** or **Discord OK** → sends notification
3. Check Discord for the notification
## Step 5: Activate the Workflow
1. Toggle the workflow to **Active** (top right)
2. The workflow will now run every 5 minutes automatically
## Workflow Behavior
### Schedule
- Runs every 5 minutes by default
- Adjust in "Every 5 Minutes" node settings
### Health Check
- Calls Claude Code headless mode
- Runs `python3 ~/.claude/skills/server-diagnostics/client.py health paper-dynasty`
- Claude analyzes results and summarizes
### Notifications
- **Issues Found**: Red Discord embed with alert details
- **All Healthy**: Green Discord embed (disabled by default - enable "Discord OK" node if wanted)
### Cost
- Each health check costs ~$0.08 (Claude API usage)
- At 5-minute intervals: ~$0.08 × 12 × 24 = ~$23/day
- Consider increasing interval to 15-30 minutes for cost savings
## Customization
### Change Check Interval
1. Open "Every 5 Minutes" node
2. Adjust `minutesInterval` value
### Add More Servers
1. Update config.yaml on Claude Code LXC
2. Modify the Claude prompt in "Run Claude Diagnostics" node
### Enable OK Notifications
1. Click "Discord OK (Optional)" node
2. Toggle off "Disabled" setting
## Troubleshooting
### SSH Connection Failed
```bash
# Test from N8N LXC
ssh -i ~/.ssh/n8n_to_claude root@10.10.0.148 "hostname"
```
### Claude Command Not Found
```bash
# Verify Claude path on Claude Code LXC
/root/.local/bin/claude --version
```
### Authentication Expired
If Claude returns auth errors, SSH to Claude Code LXC and re-authenticate:
```bash
ssh root@10.10.0.148
/root/.local/bin/claude # Follow device code flow
```
## Architecture
```
N8N (10.10.0.210)
├── Schedule Trigger (every 5 min)
└── SSH to Claude Code LXC (10.10.0.148)
└── Claude Code Headless Mode
└── server-diagnostics skill
└── SSH to Paper Dynasty (10.10.0.88)
└── Docker health checks
```

View File

@ -0,0 +1,495 @@
# N8N Workflow Design - Server Troubleshooting
This document describes the N8N workflow that orchestrates the automated troubleshooting pipeline.
## Workflow Overview
```
┌─────────────┐ ┌──────────────┐ ┌───────────┐ ┌──────────────┐
│ Schedule │────▶│ Health Check │────▶│ Has Error?│────▶│ Gather │
│ Trigger │ │ (docker ps) │ │ │ │ Context │
└─────────────┘ └──────────────┘ └─────┬─────┘ └──────┬───────┘
│ │
│ No │
▼ ▼
[End] ┌──────────────┐
│ Invoke │
│ Claude Code │
└──────┬───────┘
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌────────────┐ ┌──────────────┐ ┌──────────────┐
│ Parse JSON │ │ Save Report │ │ Send Discord │
│ Response │ │ to NAS │ │ Notification │
└────────────┘ └──────────────┘ └──────────────┘
```
## Workflow JSON Export
This can be imported directly into N8N.
```json
{
"name": "Server Troubleshooting Pipeline",
"nodes": [
{
"parameters": {
"rule": {
"interval": [
{
"field": "minutes",
"minutesInterval": 1
}
]
}
},
"id": "schedule-trigger",
"name": "Every Minute",
"type": "n8n-nodes-base.scheduleTrigger",
"typeVersion": 1.2,
"position": [0, 0]
},
{
"parameters": {
"command": "ssh -i /root/.ssh/claude_lxc_key root@claude-lxc \"ssh -i ~/.ssh/claude_diagnostics_key root@10.10.0.11 'docker ps --format json'\""
},
"id": "docker-health-check",
"name": "Check Docker Status",
"type": "n8n-nodes-base.executeCommand",
"typeVersion": 1,
"position": [220, 0]
},
{
"parameters": {
"jsCode": "// Parse docker ps output and check for issues\nconst stdout = $input.first().json.stdout || '';\nconst stderr = $input.first().json.stderr || '';\n\n// Handle empty or error cases\nif (stderr && !stdout) {\n return [{\n json: {\n hasError: true,\n errorType: 'docker_unreachable',\n errorMessage: stderr,\n containers: []\n }\n }];\n}\n\n// Parse container statuses\nconst lines = stdout.trim().split('\\n').filter(l => l);\nconst containers = lines.map(line => {\n try {\n return JSON.parse(line);\n } catch {\n return null;\n }\n}).filter(c => c);\n\n// Check for unhealthy or stopped containers\nconst issues = containers.filter(c => {\n const status = (c.Status || c.State || '').toLowerCase();\n return !status.includes('up') && !status.includes('running');\n});\n\nreturn [{\n json: {\n hasError: issues.length > 0,\n errorType: issues.length > 0 ? 'container_stopped' : null,\n errorMessage: issues.length > 0 \n ? `Containers not running: ${issues.map(i => i.Names || i.Name).join(', ')}`\n : null,\n affectedContainers: issues.map(i => i.Names || i.Name),\n allContainers: containers,\n timestamp: new Date().toISOString()\n }\n}];"
},
"id": "parse-docker-status",
"name": "Parse Container Status",
"type": "n8n-nodes-base.code",
"typeVersion": 2,
"position": [440, 0]
},
{
"parameters": {
"conditions": {
"boolean": [
{
"value1": "={{ $json.hasError }}",
"value2": true
}
]
}
},
"id": "check-has-error",
"name": "Has Error?",
"type": "n8n-nodes-base.if",
"typeVersion": 2,
"position": [660, 0]
},
{
"parameters": {
"command": "ssh -i /root/.ssh/claude_lxc_key root@claude-lxc \"docker logs --tail 100 {{ $json.affectedContainers[0] }} 2>&1 | tail -50\""
},
"id": "gather-logs",
"name": "Gather Recent Logs",
"type": "n8n-nodes-base.executeCommand",
"typeVersion": 1,
"position": [880, -100]
},
{
"parameters": {
"jsCode": "// Prepare context for Claude Code\nconst prevData = $('Parse Container Status').first().json;\nconst logs = $input.first().json.stdout || 'No logs available';\n\nconst context = {\n server: 'proxmox-host',\n errorType: prevData.errorType,\n errorMessage: prevData.errorMessage,\n affectedContainers: prevData.affectedContainers,\n timestamp: prevData.timestamp,\n recentLogs: logs.substring(0, 2000), // Limit log size\n allContainerStatus: prevData.allContainers\n};\n\nreturn [{ json: context }];"
},
"id": "prepare-context",
"name": "Prepare Claude Context",
"type": "n8n-nodes-base.code",
"typeVersion": 2,
"position": [1100, -100]
},
{
"parameters": {
"command": "=ssh -i /root/.ssh/claude_lxc_key root@claude-lxc 'claude -p \"$(cat <<'\\''PROMPT'\\'''\nYou are troubleshooting a server issue. Use the server-diagnostics skill.\n\nServer: {{ $json.server }}\nError Type: {{ $json.errorType }}\nTimestamp: {{ $json.timestamp }}\nError Message: {{ $json.errorMessage }}\nAffected Containers: {{ $json.affectedContainers.join(\", \") }}\n\nRecent Logs:\n{{ $json.recentLogs }}\n\nInstructions:\n1. Use the Python diagnostic client to investigate\n2. Check MemoryGraph for similar past issues\n3. Analyze the root cause\n4. If appropriate, execute low-risk remediation (docker restart)\n5. Store learnings in MemoryGraph\nPROMPT\n)\" --output-format json --json-schema '\\''{ \"type\": \"object\", \"properties\": { \"root_cause\": { \"type\": \"string\" }, \"severity\": { \"type\": \"string\", \"enum\": [\"low\", \"medium\", \"high\", \"critical\"] }, \"affected_services\": { \"type\": \"array\", \"items\": { \"type\": \"string\" } }, \"diagnosis_steps\": { \"type\": \"array\", \"items\": { \"type\": \"string\" } }, \"recommended_actions\": { \"type\": \"array\", \"items\": { \"type\": \"object\", \"properties\": { \"action\": { \"type\": \"string\" }, \"command\": { \"type\": \"string\" }, \"risk_level\": { \"type\": \"string\" }, \"executed\": { \"type\": \"boolean\" } } } }, \"remediation_performed\": { \"type\": \"string\" }, \"additional_context\": { \"type\": \"string\" } }, \"required\": [\"root_cause\", \"severity\", \"affected_services\"] }'\\'' --allowedTools \"Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)\" --max-turns 10'",
"timeout": 180000
},
"id": "invoke-claude",
"name": "Invoke Claude Code",
"type": "n8n-nodes-base.executeCommand",
"typeVersion": 1,
"position": [1320, -100]
},
{
"parameters": {
"jsCode": "// Parse Claude Code JSON output\nconst stdout = $input.first().json.stdout || '';\nconst stderr = $input.first().json.stderr || '';\nconst exitCode = $input.first().json.exitCode || 0;\n\n// Try to parse JSON response\nlet claudeResponse = null;\nlet parseError = null;\n\ntry {\n // Claude outputs JSON with result field\n const parsed = JSON.parse(stdout);\n claudeResponse = parsed.result ? JSON.parse(parsed.result) : parsed;\n} catch (e) {\n // Try to find JSON in the output\n const jsonMatch = stdout.match(/\\{[\\s\\S]*\\}/);\n if (jsonMatch) {\n try {\n claudeResponse = JSON.parse(jsonMatch[0]);\n } catch {\n parseError = e.message;\n }\n } else {\n parseError = e.message;\n }\n}\n\nconst context = $('Prepare Claude Context').first().json;\n\nreturn [{\n json: {\n success: claudeResponse !== null,\n parseError: parseError,\n rawOutput: stdout.substring(0, 5000),\n stderr: stderr,\n exitCode: exitCode,\n originalContext: context,\n troubleshooting: claudeResponse || {\n root_cause: 'Failed to parse Claude response',\n severity: 'high',\n affected_services: context.affectedContainers,\n recommended_actions: [{ action: 'Manual investigation required', command: '', risk_level: 'none', executed: false }],\n additional_context: stdout.substring(0, 1000)\n },\n timestamp: new Date().toISOString()\n }\n}];"
},
"id": "parse-claude-response",
"name": "Parse Claude Response",
"type": "n8n-nodes-base.code",
"typeVersion": 2,
"position": [1540, -100]
},
{
"parameters": {
"command": "=mkdir -p /mnt/nas/claude-reports/$(date +%Y-%m) && echo '{{ JSON.stringify($json) }}' > /mnt/nas/claude-reports/$(date +%Y-%m)/{{ $json.timestamp.replace(/:/g, '-') }}.json"
},
"id": "save-report",
"name": "Save Report to NAS",
"type": "n8n-nodes-base.executeCommand",
"typeVersion": 1,
"position": [1760, -200]
},
{
"parameters": {
"httpMethod": "POST",
"path": "discord-webhook-url-here",
"options": {}
},
"id": "discord-webhook",
"name": "Discord Notification",
"type": "n8n-nodes-base.httpRequest",
"typeVersion": 4.2,
"position": [1760, 0]
}
],
"connections": {
"Every Minute": {
"main": [
[{ "node": "Check Docker Status", "type": "main", "index": 0 }]
]
},
"Check Docker Status": {
"main": [
[{ "node": "Parse Container Status", "type": "main", "index": 0 }]
]
},
"Parse Container Status": {
"main": [
[{ "node": "Has Error?", "type": "main", "index": 0 }]
]
},
"Has Error?": {
"main": [
[{ "node": "Gather Recent Logs", "type": "main", "index": 0 }],
[]
]
},
"Gather Recent Logs": {
"main": [
[{ "node": "Prepare Claude Context", "type": "main", "index": 0 }]
]
},
"Prepare Claude Context": {
"main": [
[{ "node": "Invoke Claude Code", "type": "main", "index": 0 }]
]
},
"Invoke Claude Code": {
"main": [
[{ "node": "Parse Claude Response", "type": "main", "index": 0 }]
]
},
"Parse Claude Response": {
"main": [
[
{ "node": "Save Report to NAS", "type": "main", "index": 0 },
{ "node": "Discord Notification", "type": "main", "index": 0 }
]
]
}
}
}
```
## Node Details
### 1. Schedule Trigger
- **Interval:** Every 1 minute
- **Purpose:** Regular health checks
### 2. Check Docker Status
```bash
ssh -i /root/.ssh/claude_lxc_key root@claude-lxc \
"ssh -i ~/.ssh/claude_diagnostics_key root@10.10.0.11 'docker ps --format json'"
```
This double-SSH pattern:
1. N8N → Claude LXC (where Claude Code and SSH keys live)
2. Claude LXC → Proxmox Host (where Docker runs)
### 3. Parse Container Status (JavaScript)
Parses the `docker ps` JSON output and checks for containers that are not "Up" or "running".
**Input:** Raw stdout from docker ps
**Output:**
```json
{
"hasError": true,
"errorType": "container_stopped",
"errorMessage": "Containers not running: tdarr",
"affectedContainers": ["tdarr"],
"allContainers": [...],
"timestamp": "2025-12-19T14:30:00.000Z"
}
```
### 4. Has Error? (IF Node)
Branches workflow:
- **True path:** Continue to troubleshooting
- **False path:** End (no action needed)
### 5. Gather Recent Logs
Fetches last 100 lines of logs from the first affected container.
### 6. Prepare Claude Context
Combines:
- Error information from health check
- Recent logs
- Timestamp
- Server identifier
### 7. Invoke Claude Code
The core troubleshooting invocation:
```bash
claude -p "<prompt>" \
--output-format json \
--json-schema '<schema>' \
--allowedTools "Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)" \
--max-turns 10
```
**Timeout:** 180 seconds (3 minutes)
### 8. Parse Claude Response
Extracts the structured troubleshooting result from Claude's JSON output.
Handles:
- Clean JSON parse
- JSON embedded in larger output
- Parse failures (falls back to raw output)
### 9. Save Report to NAS
Saves full troubleshooting report to:
```
/mnt/nas/claude-reports/YYYY-MM/TIMESTAMP.json
```
### 10. Discord Notification
Sends formatted message to Discord channel.
## Discord Webhook Configuration
### Message Format
```javascript
// Discord webhook body (in HTTP Request node)
{
"content": null,
"embeds": [
{
"title": "{{ $json.troubleshooting.severity === 'critical' ? '🔴' : $json.troubleshooting.severity === 'high' ? '🟠' : $json.troubleshooting.severity === 'medium' ? '🟡' : '🟢' }} Server Alert",
"description": "Automated troubleshooting completed",
"color": {{ $json.troubleshooting.severity === 'critical' ? 15158332 : $json.troubleshooting.severity === 'high' ? 15105570 : 16776960 }},
"fields": [
{
"name": "Root Cause",
"value": "{{ $json.troubleshooting.root_cause }}",
"inline": false
},
{
"name": "Affected Services",
"value": "{{ $json.troubleshooting.affected_services.join(', ') }}",
"inline": true
},
{
"name": "Severity",
"value": "{{ $json.troubleshooting.severity.toUpperCase() }}",
"inline": true
},
{
"name": "Auto-Remediation",
"value": "{{ $json.troubleshooting.remediation_performed || 'None performed' }}",
"inline": false
},
{
"name": "Recommended Actions",
"value": "{{ $json.troubleshooting.recommended_actions.map(a => '• ' + a.action + (a.executed ? ' ✅' : '')).join('\\n') }}",
"inline": false
}
],
"timestamp": "{{ $json.timestamp }}"
}
]
}
```
### Webhook Setup
1. In Discord, go to Server Settings → Integrations → Webhooks
2. Create new webhook in `#server-alerts` channel
3. Copy webhook URL
4. Add to N8N HTTP Request node
## Error Handling
### Claude Code Timeout
Add error handling branch after "Invoke Claude Code":
```javascript
// In error handler
if ($input.first().json.exitCode !== 0) {
return [{
json: {
success: false,
error: 'Claude Code execution failed',
stderr: $input.first().json.stderr,
fallback_action: 'Manual investigation required'
}
}];
}
```
### Retry Logic
For transient failures, add retry configuration:
```json
{
"retry": {
"enabled": true,
"maxRetries": 3,
"waitBetweenRetries": 5000
}
}
```
### Alert Deduplication (Phase 2)
Add cooldown logic to prevent alert fatigue:
```javascript
// Before triggering Claude
const lastAlertKey = `last_alert_${container}`;
const lastAlert = await $getWorkflowStaticData(lastAlertKey);
const cooldownMs = 5 * 60 * 1000; // 5 minutes
if (lastAlert && (Date.now() - lastAlert) < cooldownMs) {
return []; // Skip, within cooldown
}
// After sending alert
await $setWorkflowStaticData(lastAlertKey, Date.now());
```
## Credentials Required
### N8N Environment
1. **SSH Key to Claude LXC:** `/root/.ssh/claude_lxc_key`
2. **Discord Webhook URL:** Stored in N8N credentials
### Claude LXC Environment
1. **Max Subscription Auth:** Authenticated via device code flow (credentials in `~/.claude/`)
2. **SSH Key to Proxmox:** `~/.ssh/claude_diagnostics_key`
### Handling Auth Expiry
Add error detection for authentication failures:
```javascript
// In Parse Claude Response node, detect auth errors
if (stderr.includes('authenticate') || stderr.includes('unauthorized')) {
return [{
json: {
success: false,
error: 'Claude Code authentication expired',
action_required: 'SSH to Claude LXC and run: claude (to re-authenticate)',
severity: 'critical'
}
}];
}
```
## Testing the Workflow
### Manual Trigger Test
1. In N8N, click "Execute Workflow" on the workflow
2. Check each node's output
3. Verify Discord message received
### Simulated Failure Test
```bash
# Stop a container on Proxmox host
docker stop tdarr
# Wait for next health check (1 minute)
# Verify:
# 1. Claude Code invoked
# 2. Discord notification sent
# 3. Report saved to NAS
# 4. Container restarted (if remediation enabled)
```
### End-to-End Verification
```bash
# Check N8N execution history
# - Go to N8N UI → Executions
# - Review successful and failed runs
# Check NAS for reports
ls -la /mnt/nas/claude-reports/$(date +%Y-%m)/
# Check Discord channel for notifications
```
## Performance Considerations
1. **Schedule Interval:** 1 minute may be aggressive. Consider 5 minutes for production.
2. **Claude Code Timeout:** 180 seconds is generous. Most diagnoses should complete in 30-60 seconds.
3. **Log Size:** Limit logs passed to Claude to avoid token limits (2000 chars in template).
4. **Parallel Execution:** N8N should handle if previous execution is still running.
## Security Notes
1. **SSH Key Permissions:** Ensure keys are 600 permissions
2. **Webhook URL:** Don't expose in logs or error messages
3. **NAS Mount:** Ensure proper permissions on report directory
4. **API Key:** Never log or expose ANTHROPIC_API_KEY
## Monitoring the Workflow
### N8N Execution Metrics
- Successful executions per hour
- Failed executions per hour
- Average execution time
- Claude Code invocation count
### Alert on Workflow Failure
Create a second workflow that monitors the primary workflow's health:
```
Trigger: Every 15 minutes
Check: Last successful execution within 20 minutes
Alert: If no recent success, send Discord alert about monitoring failure
```

View File

@ -0,0 +1,709 @@
# Server Diagnostics Skill - Architecture Design
## Overview
The `server-diagnostics` skill provides automated troubleshooting capabilities for homelab infrastructure via SSH. It follows the same architectural patterns as the existing Proxmox skill: a Python client library with CLI interface, SKILL.md for context, and YAML configuration.
## Directory Structure
```
~/.claude/skills/server-diagnostics/
├── SKILL.md # Skill context, workflows, and usage instructions
├── client.py # Main Python client library with CLI
├── config.yaml # Server inventory, command whitelist, container config
├── requirements.txt # Python dependencies (paramiko, pyyaml)
└── commands/ # Modular command implementations
├── __init__.py
├── docker.py # Docker-specific diagnostics
├── system.py # System-level diagnostics (disk, memory, CPU)
└── network.py # Network diagnostics
```
## Component Details
### 1. SKILL.md
Provides context for Claude Code when troubleshooting. Key sections:
```markdown
---
name: server-diagnostics
description: Automated server troubleshooting for Docker containers and system health.
Provides SSH-based diagnostics, log reading, metrics collection, and low-risk
remediation. USE WHEN N8N triggers troubleshooting, container issues detected,
or system health checks needed.
---
# Server Diagnostics - Automated Troubleshooting
## When to Activate This Skill
- N8N triggers with error context
- "diagnose container X", "check docker status"
- "read logs from server", "check disk usage"
- "troubleshoot server issue"
- Any automated health check response
## Quick Start
[Examples of common operations]
## Troubleshooting Workflow
[Step-by-step diagnostic process]
## MemoryGraph Integration
[How to recall/store troubleshooting patterns]
## Security Constraints
[Whitelist/deny list documentation]
```
### 2. client.py - Main Client Library
```python
#!/usr/bin/env python3
"""
Server Diagnostics Client Library
Provides SSH-based diagnostics for homelab troubleshooting
"""
import json
import subprocess
from pathlib import Path
from typing import Any, Literal
import yaml
class ServerDiagnostics:
"""
Main diagnostic client for server troubleshooting.
Connects to servers via SSH and executes whitelisted diagnostic
commands. Enforces security constraints from config.yaml.
"""
def __init__(self, config_path: str | None = None):
"""
Initialize with configuration.
Args:
config_path: Path to config.yaml. Defaults to same directory.
"""
if config_path is None:
config_path = Path(__file__).parent / "config.yaml"
self.config = self._load_config(config_path)
self.servers = self.config.get("servers", {})
self.containers = self.config.get("docker_containers", [])
self.allowed_commands = self.config.get("diagnostic_commands", {})
self.remediation_commands = self.config.get("remediation_commands", {})
self.denied_patterns = self.config.get("denied_patterns", [])
def _load_config(self, path: str | Path) -> dict:
"""Load YAML configuration."""
with open(path) as f:
return yaml.safe_load(f)
def _validate_command(self, command: str) -> bool:
"""Check command against deny list."""
for pattern in self.denied_patterns:
if pattern in command:
raise SecurityError(f"Command contains denied pattern: {pattern}")
return True
def _ssh_exec(self, server: str, command: str) -> dict:
"""
Execute command on remote server via SSH.
Returns:
dict with 'stdout', 'stderr', 'returncode'
"""
self._validate_command(command)
server_config = self.servers.get(server)
if not server_config:
raise ValueError(f"Unknown server: {server}")
ssh_cmd = [
"ssh",
"-i", server_config["ssh_key"],
"-o", "StrictHostKeyChecking=no",
"-o", "ConnectTimeout=10",
f"{server_config['ssh_user']}@{server_config['hostname']}",
command
]
result = subprocess.run(
ssh_cmd,
capture_output=True,
text=True,
timeout=60
)
return {
"stdout": result.stdout,
"stderr": result.stderr,
"returncode": result.returncode,
"success": result.returncode == 0
}
# === Docker Operations ===
def get_docker_status(self, server: str, container: str | None = None) -> dict:
"""
Get Docker container status.
Args:
server: Server identifier from config
container: Specific container name (optional, all if not specified)
Returns:
dict with container statuses
"""
if container:
cmd = f"docker inspect --format '{{{{json .State}}}}' {container}"
else:
cmd = "docker ps -a --format 'json'"
result = self._ssh_exec(server, cmd)
if result["success"]:
try:
if container:
result["data"] = json.loads(result["stdout"])
else:
# Parse newline-delimited JSON
result["data"] = [
json.loads(line)
for line in result["stdout"].strip().split("\n")
if line
]
except json.JSONDecodeError:
result["data"] = None
return result
def docker_logs(self, server: str, container: str,
lines: int = 100, filter: str | None = None) -> dict:
"""
Get Docker container logs.
Args:
server: Server identifier
container: Container name
lines: Number of lines to retrieve
filter: Optional grep filter pattern
Returns:
dict with log output
"""
cmd = f"docker logs --tail {lines} {container} 2>&1"
if filter:
cmd += f" | grep -i '{filter}'"
return self._ssh_exec(server, cmd)
def docker_restart(self, server: str, container: str) -> dict:
"""
Restart a Docker container (low-risk remediation).
Args:
server: Server identifier
container: Container name
Returns:
dict with operation result
"""
# Check if container is allowed to be restarted
container_config = next(
(c for c in self.containers if c["name"] == container),
None
)
if not container_config:
return {
"success": False,
"error": f"Container {container} not in monitored list"
}
if not container_config.get("restart_allowed", False):
return {
"success": False,
"error": f"Container {container} restart not permitted"
}
cmd = f"docker restart {container}"
result = self._ssh_exec(server, cmd)
result["action"] = "docker_restart"
result["container"] = container
return result
# === System Diagnostics ===
def get_metrics(self, server: str,
metric_type: Literal["cpu", "memory", "disk", "network", "all"] = "all"
) -> dict:
"""
Get system metrics from server.
Args:
server: Server identifier
metric_type: Type of metrics to retrieve
Returns:
dict with metric data
"""
metrics = {}
if metric_type in ("cpu", "all"):
result = self._ssh_exec(server, self.allowed_commands["cpu_usage"])
metrics["cpu"] = result
if metric_type in ("memory", "all"):
result = self._ssh_exec(server, self.allowed_commands["memory_usage"])
metrics["memory"] = result
if metric_type in ("disk", "all"):
result = self._ssh_exec(server, self.allowed_commands["disk_usage"])
metrics["disk"] = result
if metric_type in ("network", "all"):
result = self._ssh_exec(server, self.allowed_commands["network_status"])
metrics["network"] = result
return {"server": server, "metrics": metrics}
def read_logs(self, server: str,
log_type: Literal["system", "docker", "application", "custom"],
lines: int = 100,
filter: str | None = None,
custom_path: str | None = None) -> dict:
"""
Read logs from server.
Args:
server: Server identifier
log_type: Type of log to read
lines: Number of lines
filter: Optional grep pattern
custom_path: Path for custom log type
Returns:
dict with log content
"""
log_paths = {
"system": "/var/log/syslog",
"docker": "/var/log/docker.log",
"application": "/var/log/application.log",
}
path = custom_path if log_type == "custom" else log_paths.get(log_type)
if not path:
return {"success": False, "error": f"Unknown log type: {log_type}"}
cmd = f"tail -n {lines} {path}"
if filter:
cmd += f" | grep -i '{filter}'"
return self._ssh_exec(server, cmd)
def run_diagnostic(self, server: str,
command: str,
params: dict | None = None) -> dict:
"""
Run a whitelisted diagnostic command.
Args:
server: Server identifier
command: Command key from config whitelist
params: Optional parameters to substitute
Returns:
dict with command output
"""
if command not in self.allowed_commands:
return {
"success": False,
"error": f"Command '{command}' not in whitelist"
}
cmd = self.allowed_commands[command]
# Substitute parameters if provided
if params:
for key, value in params.items():
cmd = cmd.replace(f"{{{key}}}", str(value))
return self._ssh_exec(server, cmd)
# === Convenience Methods ===
def quick_health_check(self, server: str) -> dict:
"""
Perform quick health check on server.
Returns summary of Docker containers, disk, and memory.
"""
health = {
"server": server,
"docker": self.get_docker_status(server),
"metrics": self.get_metrics(server, "all"),
"healthy": True,
"issues": []
}
# Check for stopped containers
if health["docker"].get("data"):
for container in health["docker"]["data"]:
status = container.get("State", container.get("Status", ""))
if "Up" not in str(status) and "running" not in str(status).lower():
health["healthy"] = False
health["issues"].append(
f"Container {container.get('Names', 'unknown')} is not running"
)
return health
def to_json(self, data: Any) -> str:
"""Convert result to JSON string."""
return json.dumps(data, indent=2, default=str)
class SecurityError(Exception):
"""Raised when a command violates security constraints."""
pass
def main():
"""CLI interface for server diagnostics."""
import argparse
parser = argparse.ArgumentParser(
description="Server Diagnostics CLI",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
%(prog)s docker-status proxmox-host
%(prog)s docker-status proxmox-host --container tdarr
%(prog)s docker-logs proxmox-host tdarr --lines 200
%(prog)s docker-restart proxmox-host tdarr
%(prog)s metrics proxmox-host --type all
%(prog)s logs proxmox-host --type system --lines 50
%(prog)s health proxmox-host
%(prog)s diagnostic proxmox-host disk_usage
"""
)
subparsers = parser.add_subparsers(dest="command", required=True)
# docker-status
p_docker = subparsers.add_parser("docker-status", help="Get Docker container status")
p_docker.add_argument("server", help="Server identifier")
p_docker.add_argument("--container", "-c", help="Specific container name")
# docker-logs
p_logs = subparsers.add_parser("docker-logs", help="Get Docker container logs")
p_logs.add_argument("server", help="Server identifier")
p_logs.add_argument("container", help="Container name")
p_logs.add_argument("--lines", "-n", type=int, default=100, help="Number of lines")
p_logs.add_argument("--filter", "-f", help="Grep filter pattern")
# docker-restart
p_restart = subparsers.add_parser("docker-restart", help="Restart Docker container")
p_restart.add_argument("server", help="Server identifier")
p_restart.add_argument("container", help="Container name")
# metrics
p_metrics = subparsers.add_parser("metrics", help="Get system metrics")
p_metrics.add_argument("server", help="Server identifier")
p_metrics.add_argument("--type", "-t", default="all",
choices=["cpu", "memory", "disk", "network", "all"],
help="Metric type")
# logs
p_syslogs = subparsers.add_parser("logs", help="Read system logs")
p_syslogs.add_argument("server", help="Server identifier")
p_syslogs.add_argument("--type", "-t", default="system",
choices=["system", "docker", "application", "custom"],
help="Log type")
p_syslogs.add_argument("--lines", "-n", type=int, default=100, help="Number of lines")
p_syslogs.add_argument("--filter", "-f", help="Grep filter pattern")
p_syslogs.add_argument("--path", help="Custom log path (for type=custom)")
# health
p_health = subparsers.add_parser("health", help="Quick health check")
p_health.add_argument("server", help="Server identifier")
# diagnostic
p_diag = subparsers.add_parser("diagnostic", help="Run whitelisted diagnostic")
p_diag.add_argument("server", help="Server identifier")
p_diag.add_argument("command", help="Command from whitelist")
p_diag.add_argument("--params", "-p", help="JSON parameters for command substitution")
args = parser.parse_args()
client = ServerDiagnostics()
if args.command == "docker-status":
result = client.get_docker_status(args.server, args.container)
elif args.command == "docker-logs":
result = client.docker_logs(
args.server, args.container, args.lines, args.filter
)
elif args.command == "docker-restart":
result = client.docker_restart(args.server, args.container)
elif args.command == "metrics":
result = client.get_metrics(args.server, args.type)
elif args.command == "logs":
result = client.read_logs(
args.server, args.type, args.lines, args.filter, args.path
)
elif args.command == "health":
result = client.quick_health_check(args.server)
elif args.command == "diagnostic":
params = json.loads(args.params) if args.params else None
result = client.run_diagnostic(args.server, args.command, params)
print(client.to_json(result))
if __name__ == "__main__":
main()
```
### 3. config.yaml
```yaml
# Server Diagnostics Configuration
# Used by client.py for server inventory and security constraints
# Server inventory - SSH connection details
servers:
proxmox-host:
hostname: 10.10.0.11 # Update with actual IP
ssh_user: root
ssh_key: ~/.ssh/claude_diagnostics_key
description: "Main Proxmox host running Docker services"
# Docker containers to monitor
# restart_allowed: false prevents automatic remediation
docker_containers:
- name: tdarr
critical: true
restart_allowed: true
description: "Media transcoding service"
- name: portainer
critical: true
restart_allowed: true
description: "Docker management UI"
- name: n8n
critical: true
restart_allowed: false # Never restart - it triggers us!
description: "Workflow automation"
- name: plex
critical: true
restart_allowed: true
description: "Media server"
# Whitelisted diagnostic commands
# These are the ONLY commands that can be executed
diagnostic_commands:
disk_usage: "df -h"
memory_usage: "free -h"
cpu_usage: "top -bn1 | head -20"
cpu_load: "uptime"
process_list: "ps aux --sort=-%mem | head -20"
process_tree: "pstree -p"
network_status: "ss -tuln"
network_connections: "netstat -an | head -50"
docker_ps: "docker ps -a --format 'table {{.Names}}\\t{{.Status}}\\t{{.Ports}}'"
docker_stats: "docker stats --no-stream --format 'table {{.Name}}\\t{{.CPUPerc}}\\t{{.MemUsage}}'"
service_status: "systemctl status {service}"
journal_errors: "journalctl -p err -n 50 --no-pager"
port_check: "nc -zv {host} {port}"
dns_check: "dig +short {domain}"
ping_check: "ping -c 3 {host}"
# Remediation commands (low-risk only)
remediation_commands:
docker_restart: "docker restart {container}"
docker_logs: "docker logs --tail 500 {container}"
service_restart: "systemctl restart {service}" # Phase 2
# DENIED patterns - commands containing these will be rejected
# This is a security safeguard
denied_patterns:
- "rm -rf"
- "rm -r /"
- "dd if="
- "mkfs"
- ":(){:|:&};:"
- "shutdown"
- "reboot"
- "init 0"
- "init 6"
- "systemctl stop"
- "> /dev/sd"
- "chmod 777"
- "wget|sh"
- "curl|sh"
- "eval"
- "$(("
- "` `"
# Logging configuration
logging:
enabled: true
path: ~/.claude/logs/server-diagnostics.log
max_size_mb: 10
backup_count: 5
```
### 4. SKILL.md Full Content
See separate file: `SKILL.md` in the skill directory.
Key sections:
- Activation triggers
- Quick start with Python and CLI examples
- Troubleshooting workflow (step-by-step)
- MemoryGraph integration instructions
- Security constraints documentation
- Common error patterns and solutions
## Integration Points
### With Proxmox Skill
The server-diagnostics skill can leverage the Proxmox skill for:
- VM/LXC lifecycle operations (restart container that runs Docker)
- Resource monitoring at hypervisor level
- Snapshot creation before risky operations
```python
# Example integration
from proxmox_client import ProxmoxClient
from server_diagnostics.client import ServerDiagnostics
proxmox = ProxmoxClient()
diag = ServerDiagnostics()
# Check if container needs VM-level intervention
result = diag.docker_restart("proxmox-host", "tdarr")
if not result["success"]:
# Escalate to VM level
proxmox.restart_container(lxc_id)
```
### With MemoryGraph
```python
# In SKILL.md, instruct Claude to:
# 1. Before diagnosis - recall similar issues
# python ~/.claude/skills/memorygraph/client.py recall "docker tdarr timeout"
# 2. After resolution - store the pattern
# python ~/.claude/skills/memorygraph/client.py store \
# --type solution \
# --title "Tdarr container memory exhaustion" \
# --content "Container exceeded memory limit due to large transcode queue..." \
# --tags "docker,tdarr,memory,troubleshooting" \
# --importance 0.7
```
### With N8N
N8N invokes the skill via headless Claude Code:
```bash
claude -p "
You are troubleshooting a server issue. Use the server-diagnostics skill.
Server: proxmox-host
Error Type: container_stopped
Container: tdarr
Timestamp: 2025-12-19T14:30:00Z
Use the diagnostic client to investigate and resolve if possible.
" --output-format json --allowedTools "Read,Bash,Grep,Glob"
```
## Security Model
### Three-Layer Protection
1. **settings.json** - Claude Code level allow/deny lists
2. **config.yaml** - Skill-level command whitelist and denied patterns
3. **Container config** - Per-container restart permissions
### Audit Trail
All operations are logged:
- Skill logs to `~/.claude/logs/server-diagnostics.log`
- MemoryGraph entries for significant troubleshooting
- N8N execution history
- NAS report storage
## Testing Strategy
### Unit Tests
```bash
# Test command validation
python -m pytest tests/test_security.py
# Test SSH mocking
python -m pytest tests/test_ssh.py
```
### Integration Tests
```bash
# Test against real server (requires SSH access)
python client.py health proxmox-host
python client.py docker-status proxmox-host
python client.py diagnostic proxmox-host disk_usage
```
### Simulated Failures
```bash
# Stop a container and verify detection
docker stop tdarr
python client.py health proxmox-host # Should show issue
python client.py docker-restart proxmox-host tdarr # Should restart
```
## File Relationships
```
┌─────────────────────────────────────────────────────────────┐
│ Claude Code Session │
│ │
│ Loads: SKILL.md (context) │
│ ↓ │
│ Executes: python client.py <command>
│ ↓ │
│ Reads: config.yaml (server inventory, whitelist) │
│ ↓ │
│ Connects: SSH to servers │
│ ↓ │
│ Returns: JSON output to Claude Code │
│ ↓ │
│ Stores: MemoryGraph (learnings) │
│ │
└─────────────────────────────────────────────────────────────┘
```
## Next Steps
1. Create actual skill files in `~/.claude/skills/server-diagnostics/`
2. Generate SSH key pair for diagnostics
3. Install key on Proxmox host
4. Test basic connectivity
5. Integrate with N8N workflow

View File

@ -0,0 +1,170 @@
{
"name": "Server Health Monitor - Claude Code",
"nodes": [
{
"parameters": {
"rule": {
"interval": [
{
"field": "minutes",
"minutesInterval": 5
}
]
}
},
"id": "schedule-trigger",
"name": "Every 5 Minutes",
"type": "n8n-nodes-base.scheduleTrigger",
"typeVersion": 1.2,
"position": [0, 0]
},
{
"parameters": {
"operation": "executeCommand",
"command": "/root/.local/bin/claude -p \"Run python3 ~/.claude/skills/server-diagnostics/client.py health paper-dynasty and analyze the results. If any containers are not running or there are critical issues, summarize them. Otherwise just say 'All systems healthy'.\" --output-format json --json-schema '{\"type\":\"object\",\"properties\":{\"status\":{\"type\":\"string\",\"enum\":[\"healthy\",\"issues_found\"]},\"summary\":{\"type\":\"string\"},\"root_cause\":{\"type\":\"string\"},\"severity\":{\"type\":\"string\",\"enum\":[\"low\",\"medium\",\"high\",\"critical\"]},\"affected_services\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}},\"actions_taken\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}}},\"required\":[\"status\",\"severity\",\"summary\"]}' --allowedTools \"Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)\"",
"options": {}
},
"id": "ssh-claude-code",
"name": "Run Claude Diagnostics",
"type": "n8n-nodes-base.ssh",
"typeVersion": 1,
"position": [220, 0],
"credentials": {
"sshPassword": {
"id": "REPLACE_WITH_CREDENTIAL_ID",
"name": "Claude Code LXC"
}
}
},
{
"parameters": {
"jsCode": "// Parse Claude Code JSON output (uses --json-schema for structured_output)\nconst stdout = $input.first().json.stdout || '';\n\ntry {\n const response = JSON.parse(stdout);\n const data = response.structured_output || JSON.parse(response.result || '{}');\n \n return [{\n json: {\n success: !response.is_error,\n status: data.status || 'unknown',\n summary: data.summary || response.result || 'No result',\n severity: data.severity || 'low',\n root_cause: data.root_cause || null,\n affected_services: data.affected_services || [],\n actions_taken: data.actions_taken || [],\n duration_ms: response.duration_ms,\n cost_usd: response.total_cost_usd,\n session_id: response.session_id,\n has_issues: data.status === 'issues_found',\n raw: response\n }\n }];\n} catch (e) {\n return [{\n json: {\n success: false,\n status: 'error',\n summary: 'Failed to parse Claude response',\n severity: 'high',\n error: e.message,\n raw_stdout: stdout,\n has_issues: true\n }\n }];\n}"
},
"id": "parse-response",
"name": "Parse Claude Response",
"type": "n8n-nodes-base.code",
"typeVersion": 2,
"position": [440, 0]
},
{
"parameters": {
"conditions": {
"options": {
"caseSensitive": true,
"leftValue": "",
"typeValidation": "strict"
},
"conditions": [
{
"id": "check-issues",
"leftValue": "={{ $json.has_issues }}",
"rightValue": true,
"operator": {
"type": "boolean",
"operation": "equals"
}
}
],
"combinator": "and"
},
"options": {}
},
"id": "check-issues",
"name": "Has Issues?",
"type": "n8n-nodes-base.if",
"typeVersion": 2,
"position": [660, 0]
},
{
"parameters": {
"method": "POST",
"url": "https://discord.com/api/webhooks/1451783909409816763/O9PMDiNt6ZIWRf8HKocIZ_E4vMGV_lEwq50aAiZ9HVFR2UGwO6J1N9_wOm82p0MetIqT",
"sendBody": true,
"specifyBody": "json",
"jsonBody": "={\n \"embeds\": [{\n \"title\": \"{{ $json.severity === 'critical' ? '🔴' : $json.severity === 'high' ? '🟠' : '🟡' }} Server Alert\",\n \"description\": {{ JSON.stringify($json.summary) }},\n \"color\": {{ $json.severity === 'critical' ? 15158332 : $json.severity === 'high' ? 15105570 : 16776960 }},\n \"fields\": [\n {\n \"name\": \"Severity\",\n \"value\": \"{{ $json.severity.toUpperCase() }}\",\n \"inline\": true\n },\n {\n \"name\": \"Server\",\n \"value\": \"paper-dynasty (10.10.0.88)\",\n \"inline\": true\n },\n {\n \"name\": \"Cost\",\n \"value\": \"${{ $json.cost_usd ? $json.cost_usd.toFixed(4) : '0.0000' }}\",\n \"inline\": true\n },\n {\n \"name\": \"Root Cause\",\n \"value\": \"{{ $json.root_cause || 'N/A' }}\",\n \"inline\": false\n },\n {\n \"name\": \"Affected Services\",\n \"value\": \"{{ $json.affected_services.length ? $json.affected_services.join(', ') : 'None' }}\",\n \"inline\": false\n },\n {\n \"name\": \"Actions Taken\",\n \"value\": \"{{ $json.actions_taken.length ? $json.actions_taken.join('\\n') : 'None' }}\",\n \"inline\": false\n }\n ],\n \"timestamp\": \"{{ new Date().toISOString() }}\"\n }]\n}",
"options": {}
},
"id": "discord-alert",
"name": "Discord Alert",
"type": "n8n-nodes-base.httpRequest",
"typeVersion": 4.2,
"position": [880, -100]
},
{
"parameters": {
"method": "POST",
"url": "https://discord.com/api/webhooks/1451783909409816763/O9PMDiNt6ZIWRf8HKocIZ_E4vMGV_lEwq50aAiZ9HVFR2UGwO6J1N9_wOm82p0MetIqT",
"sendBody": true,
"specifyBody": "json",
"jsonBody": "={\n \"embeds\": [{\n \"title\": \"Health Check OK\",\n \"description\": {{ JSON.stringify($json.result) }},\n \"color\": 3066993,\n \"fields\": [\n {\n \"name\": \"Server\",\n \"value\": \"paper-dynasty (10.10.0.88)\",\n \"inline\": true\n },\n {\n \"name\": \"Duration\",\n \"value\": \"{{ $json.duration_ms }}ms\",\n \"inline\": true\n }\n ],\n \"timestamp\": \"{{ new Date().toISOString() }}\"\n }]\n}",
"options": {}
},
"id": "discord-ok",
"name": "Discord OK (Optional)",
"type": "n8n-nodes-base.httpRequest",
"typeVersion": 4.2,
"position": [880, 100],
"disabled": true
}
],
"connections": {
"Every 5 Minutes": {
"main": [
[
{
"node": "Run Claude Diagnostics",
"type": "main",
"index": 0
}
]
]
},
"Run Claude Diagnostics": {
"main": [
[
{
"node": "Parse Claude Response",
"type": "main",
"index": 0
}
]
]
},
"Parse Claude Response": {
"main": [
[
{
"node": "Has Issues?",
"type": "main",
"index": 0
}
]
]
},
"Has Issues?": {
"main": [
[
{
"node": "Discord Alert",
"type": "main",
"index": 0
}
],
[
{
"node": "Discord OK (Optional)",
"type": "main",
"index": 0
}
]
]
}
},
"settings": {
"executionOrder": "v1"
},
"staticData": null,
"tags": [],
"triggerCount": 0,
"pinData": {}
}

View File

@ -0,0 +1,381 @@
{
"project": {
"name": "N8N-to-Claude Code Automated Server Troubleshooting",
"version": "1.0.0",
"created": "2025-12-19",
"updated": "2025-12-20"
},
"phases": [
{
"id": "phase-1",
"name": "Foundation",
"description": "Core infrastructure and basic Docker diagnostics",
"status": "completed",
"completed_date": "2025-12-20"
},
{
"id": "phase-2",
"name": "Enhancement",
"description": "Complete diagnostic library and system-level monitoring",
"status": "pending"
},
{
"id": "phase-3",
"name": "Expansion",
"description": "Full homelab coverage and Proxmox integration",
"status": "pending"
}
],
"infrastructure": {
"claude_code_lxc": {
"ct_id": 300,
"hostname": "claude-code",
"ip": "10.10.0.148",
"os": "Ubuntu 20.04",
"resources": "2 vCPU, 2GB RAM, 16GB disk",
"claude_version": "2.0.74",
"auth_method": "Max subscription (OAuth)"
},
"n8n": {
"ct_id": 210,
"hostname": "docker-n8n-lxc",
"ip": "10.10.0.210",
"deployment": "Docker container in LXC"
},
"target_servers": [
{
"name": "paper-dynasty",
"ip": "10.10.0.88",
"ssh_user": "cal",
"description": "Paper Dynasty Discord bots and services",
"containers": [
"paper-dynasty_discord-app_1",
"paper-dynasty_db_1",
"paper-dynasty_adminer_1",
"sba-website_sba-web_1",
"sba-ghost_sba-ghost_1"
]
}
],
"notifications": {
"discord_webhook": "configured",
"channel": "server-alerts"
}
},
"tasks": [
{
"id": "lxc-provision",
"name": "Provision Claude Code LXC",
"description": "Create dedicated LXC container on Proxmox for Claude Code (2 vCPU, 2GB RAM, 16GB disk)",
"phase": "phase-1",
"dependencies": [],
"completed": true,
"tested": true,
"completed_date": "2025-12-20",
"notes": "CT 300, Ubuntu 20.04 (22.04 template failed)"
},
{
"id": "lxc-packages",
"name": "Install LXC Dependencies",
"description": "Install Node.js, Python 3.8, and other required packages on the LXC",
"phase": "phase-1",
"dependencies": ["lxc-provision"],
"completed": true,
"tested": true,
"completed_date": "2025-12-20",
"notes": "Python 3.8 (Ubuntu 20.04 default), PyYAML installed"
},
{
"id": "claude-code-install",
"name": "Install Claude Code CLI",
"description": "Install Claude Code CLI using native installer and authenticate with Max subscription",
"phase": "phase-1",
"dependencies": ["lxc-packages"],
"completed": true,
"tested": true,
"completed_date": "2025-12-20",
"notes": "v2.0.74 installed at /root/.local/bin/claude, OAuth device code auth"
},
{
"id": "ssh-keys-generate",
"name": "Generate SSH Key Pair",
"description": "Generate dedicated SSH key pair for Claude Code diagnostics",
"phase": "phase-1",
"dependencies": ["lxc-provision"],
"completed": true,
"tested": true,
"completed_date": "2025-12-20",
"notes": "~/.ssh/claude_diagnostics_key (ed25519)"
},
{
"id": "ssh-keys-install",
"name": "Install SSH Keys on Target Server",
"description": "Add public key to Paper Dynasty server (10.10.0.88)",
"phase": "phase-1",
"dependencies": ["ssh-keys-generate"],
"completed": true,
"tested": true,
"completed_date": "2025-12-20",
"notes": "Key installed for user 'cal' on paper-dynasty server"
},
{
"id": "skill-structure",
"name": "Create Skill Directory Structure",
"description": "Create ~/.claude/skills/server-diagnostics/ with SKILL.md, client.py, config.yaml",
"phase": "phase-1",
"dependencies": ["claude-code-install"],
"completed": true,
"tested": true,
"completed_date": "2025-12-20"
},
{
"id": "skill-skillmd",
"name": "Write SKILL.md Context File",
"description": "Create comprehensive SKILL.md with troubleshooting workflows, command references, and usage instructions",
"phase": "phase-1",
"dependencies": ["skill-structure"],
"completed": true,
"tested": true,
"completed_date": "2025-12-20"
},
{
"id": "skill-config",
"name": "Write config.yaml",
"description": "Create server inventory, command whitelist, and container configuration",
"phase": "phase-1",
"dependencies": ["skill-structure"],
"completed": true,
"tested": true,
"completed_date": "2025-12-20",
"notes": "Paper Dynasty server with 5 containers configured"
},
{
"id": "skill-client",
"name": "Implement Python Diagnostic Client",
"description": "Implement ServerDiagnostics class with SSH, Docker operations, and CLI interface",
"phase": "phase-1",
"dependencies": ["skill-config", "ssh-keys-install"],
"completed": true,
"tested": true,
"completed_date": "2025-12-20",
"notes": "Python 3.8 compatible, Go template format for Docker 20.10"
},
{
"id": "settings-json",
"name": "Configure settings.json Allow/Deny Lists",
"description": "Add permission rules to Claude Code settings.json for command control",
"phase": "phase-1",
"dependencies": ["claude-code-install"],
"completed": true,
"tested": true,
"completed_date": "2025-12-20"
},
{
"id": "n8n-ssh-access",
"name": "Configure N8N SSH Access to Claude Code LXC",
"description": "Set up SSH from N8N LXC to Claude Code LXC",
"phase": "phase-1",
"dependencies": ["lxc-provision"],
"completed": true,
"tested": true,
"completed_date": "2025-12-20",
"notes": "SSH key n8n_to_claude generated and installed"
},
{
"id": "n8n-workflow",
"name": "Create N8N Health Check Workflow",
"description": "Two-stage workflow: free health check script, Claude remediation only on issues",
"phase": "phase-1",
"dependencies": ["n8n-ssh-access"],
"completed": true,
"tested": true,
"completed_date": "2025-12-20",
"notes": "Cost-optimized: $0 for healthy checks, ~$0.10-0.15 for remediation"
},
{
"id": "wrapper-scripts",
"name": "Create Wrapper Scripts",
"description": "health-check.sh (free) and remediate.sh (Claude) for N8N invocation",
"phase": "phase-1",
"dependencies": ["skill-client"],
"completed": true,
"tested": true,
"completed_date": "2025-12-20",
"notes": "Scripts handle quoting issues and provide clean interface for N8N"
},
{
"id": "discord-webhook",
"name": "Set Up Discord Webhook",
"description": "Configure Discord webhook for server alerts",
"phase": "phase-1",
"dependencies": [],
"completed": true,
"tested": true,
"completed_date": "2025-12-20"
},
{
"id": "n8n-discord",
"name": "Add Discord Notification to Workflow",
"description": "Add Discord webhook node with formatted alerts",
"phase": "phase-1",
"dependencies": ["n8n-workflow", "discord-webhook"],
"completed": true,
"tested": true,
"completed_date": "2025-12-20",
"notes": "Bullet point format (Discord doesn't support markdown tables)"
},
{
"id": "auto-remediation",
"name": "Implement Auto-Remediation",
"description": "Claude automatically restarts containers with restart_allowed: true",
"phase": "phase-1",
"dependencies": ["n8n-workflow"],
"completed": true,
"tested": true,
"completed_date": "2025-12-20",
"notes": "Successfully tested with paper-dynasty_adminer_1"
},
{
"id": "e2e-test-phase1",
"name": "End-to-End Phase 1 Test",
"description": "Stop a Docker container, verify full pipeline: detection -> diagnosis -> remediation -> notification",
"phase": "phase-1",
"dependencies": ["n8n-discord", "auto-remediation"],
"completed": true,
"tested": true,
"completed_date": "2025-12-20",
"notes": "Full pipeline working: stopped adminer -> detected -> restarted -> Discord alert"
},
{
"id": "production-activation",
"name": "Activate Workflow for Production",
"description": "Enable N8N workflow to run on 5-minute schedule",
"phase": "phase-1",
"dependencies": ["e2e-test-phase1"],
"completed": true,
"tested": true,
"completed_date": "2025-12-20",
"notes": "Running live every 5 minutes"
},
{
"id": "skill-commands-expand",
"name": "Expand Diagnostic Command Library",
"description": "Add network diagnostics, port checks, service status, and more system commands",
"phase": "phase-2",
"dependencies": ["e2e-test-phase1"],
"completed": false,
"tested": false
},
{
"id": "containers-expand",
"name": "Add More Containers to Monitoring",
"description": "Expand docker_containers list in config.yaml to include all critical containers",
"phase": "phase-2",
"dependencies": ["e2e-test-phase1"],
"completed": false,
"tested": false
},
{
"id": "multi-server",
"name": "Add Multi-Server Support",
"description": "Expand to monitor additional servers beyond Paper Dynasty",
"phase": "phase-2",
"dependencies": ["e2e-test-phase1"],
"completed": false,
"tested": false
},
{
"id": "alert-dedup",
"name": "Implement Alert Deduplication",
"description": "Add cooldown/deduplication logic to prevent alert fatigue",
"phase": "phase-2",
"dependencies": ["e2e-test-phase1"],
"completed": false,
"tested": false
},
{
"id": "memorygraph-integration",
"name": "Integrate MemoryGraph with Skill",
"description": "Add MemoryGraph recall/store calls to SKILL.md instructions for pattern learning",
"phase": "phase-2",
"dependencies": ["e2e-test-phase1"],
"completed": false,
"tested": false
},
{
"id": "nas-reports",
"name": "Configure NAS Report Storage",
"description": "Set up NAS mount and add report saving to N8N workflow",
"phase": "phase-2",
"dependencies": ["e2e-test-phase1"],
"completed": false,
"tested": false
},
{
"id": "e2e-test-phase2",
"name": "End-to-End Phase 2 Test",
"description": "Test 10+ different error scenarios for comprehensive coverage",
"phase": "phase-2",
"dependencies": ["skill-commands-expand", "containers-expand", "alert-dedup", "memorygraph-integration"],
"completed": false,
"tested": false
},
{
"id": "proxmox-skill-extend",
"name": "Extend Existing Proxmox Skill",
"description": "Integrate server-diagnostics with existing Proxmox skill for VM/LXC monitoring",
"phase": "phase-3",
"dependencies": ["e2e-test-phase2"],
"completed": false,
"tested": false
},
{
"id": "nas-archival",
"name": "Implement Report Archival and Cleanup",
"description": "Add scheduled cleanup of old reports on NAS",
"phase": "phase-3",
"dependencies": ["nas-reports"],
"completed": false,
"tested": false
},
{
"id": "documentation",
"name": "Write Documentation",
"description": "Create setup guide, runbooks, and troubleshooting documentation",
"phase": "phase-3",
"dependencies": ["e2e-test-phase2"],
"completed": false,
"tested": false
},
{
"id": "e2e-test-phase3",
"name": "End-to-End Phase 3 Test",
"description": "Full homelab coverage test with all servers and services",
"phase": "phase-3",
"dependencies": ["proxmox-skill-extend", "nas-archival", "documentation"],
"completed": false,
"tested": false
}
],
"bonuses_completed": [
{
"description": "Disabled avahi-daemon (was consuming 67+ hours CPU time)",
"date": "2025-12-20",
"server": "paper-dynasty"
},
{
"description": "Disabled gdm/GNOME desktop services (~12% CPU + 180MB RAM recovered)",
"date": "2025-12-20",
"server": "paper-dynasty"
}
],
"metadata": {
"total_tasks": 29,
"phase_1_tasks": 18,
"phase_1_completed": 18,
"phase_2_tasks": 7,
"phase_2_completed": 0,
"phase_3_tasks": 4,
"phase_3_completed": 0
}
}