Original planning folder (no git repo) for the server diagnostics system that runs on CT 300. Live deployment is on claude-runner; this preserves the Agent SDK reference, PRD with Phase 2/3 roadmap, and N8N workflow designs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
22 KiB
22 KiB
Server Diagnostics Skill - Architecture Design
Overview
The server-diagnostics skill provides automated troubleshooting capabilities for homelab infrastructure via SSH. It follows the same architectural patterns as the existing Proxmox skill: a Python client library with CLI interface, SKILL.md for context, and YAML configuration.
Directory Structure
~/.claude/skills/server-diagnostics/
├── SKILL.md # Skill context, workflows, and usage instructions
├── client.py # Main Python client library with CLI
├── config.yaml # Server inventory, command whitelist, container config
├── requirements.txt # Python dependencies (paramiko, pyyaml)
└── commands/ # Modular command implementations
├── __init__.py
├── docker.py # Docker-specific diagnostics
├── system.py # System-level diagnostics (disk, memory, CPU)
└── network.py # Network diagnostics
Component Details
1. SKILL.md
Provides context for Claude Code when troubleshooting. Key sections:
---
name: server-diagnostics
description: Automated server troubleshooting for Docker containers and system health.
Provides SSH-based diagnostics, log reading, metrics collection, and low-risk
remediation. USE WHEN N8N triggers troubleshooting, container issues detected,
or system health checks needed.
---
# Server Diagnostics - Automated Troubleshooting
## When to Activate This Skill
- N8N triggers with error context
- "diagnose container X", "check docker status"
- "read logs from server", "check disk usage"
- "troubleshoot server issue"
- Any automated health check response
## Quick Start
[Examples of common operations]
## Troubleshooting Workflow
[Step-by-step diagnostic process]
## MemoryGraph Integration
[How to recall/store troubleshooting patterns]
## Security Constraints
[Whitelist/deny list documentation]
2. client.py - Main Client Library
#!/usr/bin/env python3
"""
Server Diagnostics Client Library
Provides SSH-based diagnostics for homelab troubleshooting
"""
import json
import subprocess
from pathlib import Path
from typing import Any, Literal
import yaml
class ServerDiagnostics:
"""
Main diagnostic client for server troubleshooting.
Connects to servers via SSH and executes whitelisted diagnostic
commands. Enforces security constraints from config.yaml.
"""
def __init__(self, config_path: str | None = None):
"""
Initialize with configuration.
Args:
config_path: Path to config.yaml. Defaults to same directory.
"""
if config_path is None:
config_path = Path(__file__).parent / "config.yaml"
self.config = self._load_config(config_path)
self.servers = self.config.get("servers", {})
self.containers = self.config.get("docker_containers", [])
self.allowed_commands = self.config.get("diagnostic_commands", {})
self.remediation_commands = self.config.get("remediation_commands", {})
self.denied_patterns = self.config.get("denied_patterns", [])
def _load_config(self, path: str | Path) -> dict:
"""Load YAML configuration."""
with open(path) as f:
return yaml.safe_load(f)
def _validate_command(self, command: str) -> bool:
"""Check command against deny list."""
for pattern in self.denied_patterns:
if pattern in command:
raise SecurityError(f"Command contains denied pattern: {pattern}")
return True
def _ssh_exec(self, server: str, command: str) -> dict:
"""
Execute command on remote server via SSH.
Returns:
dict with 'stdout', 'stderr', 'returncode'
"""
self._validate_command(command)
server_config = self.servers.get(server)
if not server_config:
raise ValueError(f"Unknown server: {server}")
ssh_cmd = [
"ssh",
"-i", server_config["ssh_key"],
"-o", "StrictHostKeyChecking=no",
"-o", "ConnectTimeout=10",
f"{server_config['ssh_user']}@{server_config['hostname']}",
command
]
result = subprocess.run(
ssh_cmd,
capture_output=True,
text=True,
timeout=60
)
return {
"stdout": result.stdout,
"stderr": result.stderr,
"returncode": result.returncode,
"success": result.returncode == 0
}
# === Docker Operations ===
def get_docker_status(self, server: str, container: str | None = None) -> dict:
"""
Get Docker container status.
Args:
server: Server identifier from config
container: Specific container name (optional, all if not specified)
Returns:
dict with container statuses
"""
if container:
cmd = f"docker inspect --format '{{{{json .State}}}}' {container}"
else:
cmd = "docker ps -a --format 'json'"
result = self._ssh_exec(server, cmd)
if result["success"]:
try:
if container:
result["data"] = json.loads(result["stdout"])
else:
# Parse newline-delimited JSON
result["data"] = [
json.loads(line)
for line in result["stdout"].strip().split("\n")
if line
]
except json.JSONDecodeError:
result["data"] = None
return result
def docker_logs(self, server: str, container: str,
lines: int = 100, filter: str | None = None) -> dict:
"""
Get Docker container logs.
Args:
server: Server identifier
container: Container name
lines: Number of lines to retrieve
filter: Optional grep filter pattern
Returns:
dict with log output
"""
cmd = f"docker logs --tail {lines} {container} 2>&1"
if filter:
cmd += f" | grep -i '{filter}'"
return self._ssh_exec(server, cmd)
def docker_restart(self, server: str, container: str) -> dict:
"""
Restart a Docker container (low-risk remediation).
Args:
server: Server identifier
container: Container name
Returns:
dict with operation result
"""
# Check if container is allowed to be restarted
container_config = next(
(c for c in self.containers if c["name"] == container),
None
)
if not container_config:
return {
"success": False,
"error": f"Container {container} not in monitored list"
}
if not container_config.get("restart_allowed", False):
return {
"success": False,
"error": f"Container {container} restart not permitted"
}
cmd = f"docker restart {container}"
result = self._ssh_exec(server, cmd)
result["action"] = "docker_restart"
result["container"] = container
return result
# === System Diagnostics ===
def get_metrics(self, server: str,
metric_type: Literal["cpu", "memory", "disk", "network", "all"] = "all"
) -> dict:
"""
Get system metrics from server.
Args:
server: Server identifier
metric_type: Type of metrics to retrieve
Returns:
dict with metric data
"""
metrics = {}
if metric_type in ("cpu", "all"):
result = self._ssh_exec(server, self.allowed_commands["cpu_usage"])
metrics["cpu"] = result
if metric_type in ("memory", "all"):
result = self._ssh_exec(server, self.allowed_commands["memory_usage"])
metrics["memory"] = result
if metric_type in ("disk", "all"):
result = self._ssh_exec(server, self.allowed_commands["disk_usage"])
metrics["disk"] = result
if metric_type in ("network", "all"):
result = self._ssh_exec(server, self.allowed_commands["network_status"])
metrics["network"] = result
return {"server": server, "metrics": metrics}
def read_logs(self, server: str,
log_type: Literal["system", "docker", "application", "custom"],
lines: int = 100,
filter: str | None = None,
custom_path: str | None = None) -> dict:
"""
Read logs from server.
Args:
server: Server identifier
log_type: Type of log to read
lines: Number of lines
filter: Optional grep pattern
custom_path: Path for custom log type
Returns:
dict with log content
"""
log_paths = {
"system": "/var/log/syslog",
"docker": "/var/log/docker.log",
"application": "/var/log/application.log",
}
path = custom_path if log_type == "custom" else log_paths.get(log_type)
if not path:
return {"success": False, "error": f"Unknown log type: {log_type}"}
cmd = f"tail -n {lines} {path}"
if filter:
cmd += f" | grep -i '{filter}'"
return self._ssh_exec(server, cmd)
def run_diagnostic(self, server: str,
command: str,
params: dict | None = None) -> dict:
"""
Run a whitelisted diagnostic command.
Args:
server: Server identifier
command: Command key from config whitelist
params: Optional parameters to substitute
Returns:
dict with command output
"""
if command not in self.allowed_commands:
return {
"success": False,
"error": f"Command '{command}' not in whitelist"
}
cmd = self.allowed_commands[command]
# Substitute parameters if provided
if params:
for key, value in params.items():
cmd = cmd.replace(f"{{{key}}}", str(value))
return self._ssh_exec(server, cmd)
# === Convenience Methods ===
def quick_health_check(self, server: str) -> dict:
"""
Perform quick health check on server.
Returns summary of Docker containers, disk, and memory.
"""
health = {
"server": server,
"docker": self.get_docker_status(server),
"metrics": self.get_metrics(server, "all"),
"healthy": True,
"issues": []
}
# Check for stopped containers
if health["docker"].get("data"):
for container in health["docker"]["data"]:
status = container.get("State", container.get("Status", ""))
if "Up" not in str(status) and "running" not in str(status).lower():
health["healthy"] = False
health["issues"].append(
f"Container {container.get('Names', 'unknown')} is not running"
)
return health
def to_json(self, data: Any) -> str:
"""Convert result to JSON string."""
return json.dumps(data, indent=2, default=str)
class SecurityError(Exception):
"""Raised when a command violates security constraints."""
pass
def main():
"""CLI interface for server diagnostics."""
import argparse
parser = argparse.ArgumentParser(
description="Server Diagnostics CLI",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
%(prog)s docker-status proxmox-host
%(prog)s docker-status proxmox-host --container tdarr
%(prog)s docker-logs proxmox-host tdarr --lines 200
%(prog)s docker-restart proxmox-host tdarr
%(prog)s metrics proxmox-host --type all
%(prog)s logs proxmox-host --type system --lines 50
%(prog)s health proxmox-host
%(prog)s diagnostic proxmox-host disk_usage
"""
)
subparsers = parser.add_subparsers(dest="command", required=True)
# docker-status
p_docker = subparsers.add_parser("docker-status", help="Get Docker container status")
p_docker.add_argument("server", help="Server identifier")
p_docker.add_argument("--container", "-c", help="Specific container name")
# docker-logs
p_logs = subparsers.add_parser("docker-logs", help="Get Docker container logs")
p_logs.add_argument("server", help="Server identifier")
p_logs.add_argument("container", help="Container name")
p_logs.add_argument("--lines", "-n", type=int, default=100, help="Number of lines")
p_logs.add_argument("--filter", "-f", help="Grep filter pattern")
# docker-restart
p_restart = subparsers.add_parser("docker-restart", help="Restart Docker container")
p_restart.add_argument("server", help="Server identifier")
p_restart.add_argument("container", help="Container name")
# metrics
p_metrics = subparsers.add_parser("metrics", help="Get system metrics")
p_metrics.add_argument("server", help="Server identifier")
p_metrics.add_argument("--type", "-t", default="all",
choices=["cpu", "memory", "disk", "network", "all"],
help="Metric type")
# logs
p_syslogs = subparsers.add_parser("logs", help="Read system logs")
p_syslogs.add_argument("server", help="Server identifier")
p_syslogs.add_argument("--type", "-t", default="system",
choices=["system", "docker", "application", "custom"],
help="Log type")
p_syslogs.add_argument("--lines", "-n", type=int, default=100, help="Number of lines")
p_syslogs.add_argument("--filter", "-f", help="Grep filter pattern")
p_syslogs.add_argument("--path", help="Custom log path (for type=custom)")
# health
p_health = subparsers.add_parser("health", help="Quick health check")
p_health.add_argument("server", help="Server identifier")
# diagnostic
p_diag = subparsers.add_parser("diagnostic", help="Run whitelisted diagnostic")
p_diag.add_argument("server", help="Server identifier")
p_diag.add_argument("command", help="Command from whitelist")
p_diag.add_argument("--params", "-p", help="JSON parameters for command substitution")
args = parser.parse_args()
client = ServerDiagnostics()
if args.command == "docker-status":
result = client.get_docker_status(args.server, args.container)
elif args.command == "docker-logs":
result = client.docker_logs(
args.server, args.container, args.lines, args.filter
)
elif args.command == "docker-restart":
result = client.docker_restart(args.server, args.container)
elif args.command == "metrics":
result = client.get_metrics(args.server, args.type)
elif args.command == "logs":
result = client.read_logs(
args.server, args.type, args.lines, args.filter, args.path
)
elif args.command == "health":
result = client.quick_health_check(args.server)
elif args.command == "diagnostic":
params = json.loads(args.params) if args.params else None
result = client.run_diagnostic(args.server, args.command, params)
print(client.to_json(result))
if __name__ == "__main__":
main()
3. config.yaml
# Server Diagnostics Configuration
# Used by client.py for server inventory and security constraints
# Server inventory - SSH connection details
servers:
proxmox-host:
hostname: 10.10.0.11 # Update with actual IP
ssh_user: root
ssh_key: ~/.ssh/claude_diagnostics_key
description: "Main Proxmox host running Docker services"
# Docker containers to monitor
# restart_allowed: false prevents automatic remediation
docker_containers:
- name: tdarr
critical: true
restart_allowed: true
description: "Media transcoding service"
- name: portainer
critical: true
restart_allowed: true
description: "Docker management UI"
- name: n8n
critical: true
restart_allowed: false # Never restart - it triggers us!
description: "Workflow automation"
- name: plex
critical: true
restart_allowed: true
description: "Media server"
# Whitelisted diagnostic commands
# These are the ONLY commands that can be executed
diagnostic_commands:
disk_usage: "df -h"
memory_usage: "free -h"
cpu_usage: "top -bn1 | head -20"
cpu_load: "uptime"
process_list: "ps aux --sort=-%mem | head -20"
process_tree: "pstree -p"
network_status: "ss -tuln"
network_connections: "netstat -an | head -50"
docker_ps: "docker ps -a --format 'table {{.Names}}\\t{{.Status}}\\t{{.Ports}}'"
docker_stats: "docker stats --no-stream --format 'table {{.Name}}\\t{{.CPUPerc}}\\t{{.MemUsage}}'"
service_status: "systemctl status {service}"
journal_errors: "journalctl -p err -n 50 --no-pager"
port_check: "nc -zv {host} {port}"
dns_check: "dig +short {domain}"
ping_check: "ping -c 3 {host}"
# Remediation commands (low-risk only)
remediation_commands:
docker_restart: "docker restart {container}"
docker_logs: "docker logs --tail 500 {container}"
service_restart: "systemctl restart {service}" # Phase 2
# DENIED patterns - commands containing these will be rejected
# This is a security safeguard
denied_patterns:
- "rm -rf"
- "rm -r /"
- "dd if="
- "mkfs"
- ":(){:|:&};:"
- "shutdown"
- "reboot"
- "init 0"
- "init 6"
- "systemctl stop"
- "> /dev/sd"
- "chmod 777"
- "wget|sh"
- "curl|sh"
- "eval"
- "$(("
- "` `"
# Logging configuration
logging:
enabled: true
path: ~/.claude/logs/server-diagnostics.log
max_size_mb: 10
backup_count: 5
4. SKILL.md Full Content
See separate file: SKILL.md in the skill directory.
Key sections:
- Activation triggers
- Quick start with Python and CLI examples
- Troubleshooting workflow (step-by-step)
- MemoryGraph integration instructions
- Security constraints documentation
- Common error patterns and solutions
Integration Points
With Proxmox Skill
The server-diagnostics skill can leverage the Proxmox skill for:
- VM/LXC lifecycle operations (restart container that runs Docker)
- Resource monitoring at hypervisor level
- Snapshot creation before risky operations
# Example integration
from proxmox_client import ProxmoxClient
from server_diagnostics.client import ServerDiagnostics
proxmox = ProxmoxClient()
diag = ServerDiagnostics()
# Check if container needs VM-level intervention
result = diag.docker_restart("proxmox-host", "tdarr")
if not result["success"]:
# Escalate to VM level
proxmox.restart_container(lxc_id)
With MemoryGraph
# In SKILL.md, instruct Claude to:
# 1. Before diagnosis - recall similar issues
# python ~/.claude/skills/memorygraph/client.py recall "docker tdarr timeout"
# 2. After resolution - store the pattern
# python ~/.claude/skills/memorygraph/client.py store \
# --type solution \
# --title "Tdarr container memory exhaustion" \
# --content "Container exceeded memory limit due to large transcode queue..." \
# --tags "docker,tdarr,memory,troubleshooting" \
# --importance 0.7
With N8N
N8N invokes the skill via headless Claude Code:
claude -p "
You are troubleshooting a server issue. Use the server-diagnostics skill.
Server: proxmox-host
Error Type: container_stopped
Container: tdarr
Timestamp: 2025-12-19T14:30:00Z
Use the diagnostic client to investigate and resolve if possible.
" --output-format json --allowedTools "Read,Bash,Grep,Glob"
Security Model
Three-Layer Protection
- settings.json - Claude Code level allow/deny lists
- config.yaml - Skill-level command whitelist and denied patterns
- Container config - Per-container restart permissions
Audit Trail
All operations are logged:
- Skill logs to
~/.claude/logs/server-diagnostics.log - MemoryGraph entries for significant troubleshooting
- N8N execution history
- NAS report storage
Testing Strategy
Unit Tests
# Test command validation
python -m pytest tests/test_security.py
# Test SSH mocking
python -m pytest tests/test_ssh.py
Integration Tests
# Test against real server (requires SSH access)
python client.py health proxmox-host
python client.py docker-status proxmox-host
python client.py diagnostic proxmox-host disk_usage
Simulated Failures
# Stop a container and verify detection
docker stop tdarr
python client.py health proxmox-host # Should show issue
python client.py docker-restart proxmox-host tdarr # Should restart
File Relationships
┌─────────────────────────────────────────────────────────────┐
│ Claude Code Session │
│ │
│ Loads: SKILL.md (context) │
│ ↓ │
│ Executes: python client.py <command> │
│ ↓ │
│ Reads: config.yaml (server inventory, whitelist) │
│ ↓ │
│ Connects: SSH to servers │
│ ↓ │
│ Returns: JSON output to Claude Code │
│ ↓ │
│ Stores: MemoryGraph (learnings) │
│ │
└─────────────────────────────────────────────────────────────┘
Next Steps
- Create actual skill files in
~/.claude/skills/server-diagnostics/ - Generate SSH key pair for diagnostics
- Install key on Proxmox host
- Test basic connectivity
- Integrate with N8N workflow