# Server Diagnostics Skill - Architecture Design ## Overview The `server-diagnostics` skill provides automated troubleshooting capabilities for homelab infrastructure via SSH. It follows the same architectural patterns as the existing Proxmox skill: a Python client library with CLI interface, SKILL.md for context, and YAML configuration. ## Directory Structure ``` ~/.claude/skills/server-diagnostics/ ├── SKILL.md # Skill context, workflows, and usage instructions ├── client.py # Main Python client library with CLI ├── config.yaml # Server inventory, command whitelist, container config ├── requirements.txt # Python dependencies (paramiko, pyyaml) └── commands/ # Modular command implementations ├── __init__.py ├── docker.py # Docker-specific diagnostics ├── system.py # System-level diagnostics (disk, memory, CPU) └── network.py # Network diagnostics ``` ## Component Details ### 1. SKILL.md Provides context for Claude Code when troubleshooting. Key sections: ```markdown --- name: server-diagnostics description: Automated server troubleshooting for Docker containers and system health. Provides SSH-based diagnostics, log reading, metrics collection, and low-risk remediation. USE WHEN N8N triggers troubleshooting, container issues detected, or system health checks needed. --- # Server Diagnostics - Automated Troubleshooting ## When to Activate This Skill - N8N triggers with error context - "diagnose container X", "check docker status" - "read logs from server", "check disk usage" - "troubleshoot server issue" - Any automated health check response ## Quick Start [Examples of common operations] ## Troubleshooting Workflow [Step-by-step diagnostic process] ## MemoryGraph Integration [How to recall/store troubleshooting patterns] ## Security Constraints [Whitelist/deny list documentation] ``` ### 2. client.py - Main Client Library ```python #!/usr/bin/env python3 """ Server Diagnostics Client Library Provides SSH-based diagnostics for homelab troubleshooting """ import json import subprocess from pathlib import Path from typing import Any, Literal import yaml class ServerDiagnostics: """ Main diagnostic client for server troubleshooting. Connects to servers via SSH and executes whitelisted diagnostic commands. Enforces security constraints from config.yaml. """ def __init__(self, config_path: str | None = None): """ Initialize with configuration. Args: config_path: Path to config.yaml. Defaults to same directory. """ if config_path is None: config_path = Path(__file__).parent / "config.yaml" self.config = self._load_config(config_path) self.servers = self.config.get("servers", {}) self.containers = self.config.get("docker_containers", []) self.allowed_commands = self.config.get("diagnostic_commands", {}) self.remediation_commands = self.config.get("remediation_commands", {}) self.denied_patterns = self.config.get("denied_patterns", []) def _load_config(self, path: str | Path) -> dict: """Load YAML configuration.""" with open(path) as f: return yaml.safe_load(f) def _validate_command(self, command: str) -> bool: """Check command against deny list.""" for pattern in self.denied_patterns: if pattern in command: raise SecurityError(f"Command contains denied pattern: {pattern}") return True def _ssh_exec(self, server: str, command: str) -> dict: """ Execute command on remote server via SSH. Returns: dict with 'stdout', 'stderr', 'returncode' """ self._validate_command(command) server_config = self.servers.get(server) if not server_config: raise ValueError(f"Unknown server: {server}") ssh_cmd = [ "ssh", "-i", server_config["ssh_key"], "-o", "StrictHostKeyChecking=no", "-o", "ConnectTimeout=10", f"{server_config['ssh_user']}@{server_config['hostname']}", command ] result = subprocess.run( ssh_cmd, capture_output=True, text=True, timeout=60 ) return { "stdout": result.stdout, "stderr": result.stderr, "returncode": result.returncode, "success": result.returncode == 0 } # === Docker Operations === def get_docker_status(self, server: str, container: str | None = None) -> dict: """ Get Docker container status. Args: server: Server identifier from config container: Specific container name (optional, all if not specified) Returns: dict with container statuses """ if container: cmd = f"docker inspect --format '{{{{json .State}}}}' {container}" else: cmd = "docker ps -a --format 'json'" result = self._ssh_exec(server, cmd) if result["success"]: try: if container: result["data"] = json.loads(result["stdout"]) else: # Parse newline-delimited JSON result["data"] = [ json.loads(line) for line in result["stdout"].strip().split("\n") if line ] except json.JSONDecodeError: result["data"] = None return result def docker_logs(self, server: str, container: str, lines: int = 100, filter: str | None = None) -> dict: """ Get Docker container logs. Args: server: Server identifier container: Container name lines: Number of lines to retrieve filter: Optional grep filter pattern Returns: dict with log output """ cmd = f"docker logs --tail {lines} {container} 2>&1" if filter: cmd += f" | grep -i '{filter}'" return self._ssh_exec(server, cmd) def docker_restart(self, server: str, container: str) -> dict: """ Restart a Docker container (low-risk remediation). Args: server: Server identifier container: Container name Returns: dict with operation result """ # Check if container is allowed to be restarted container_config = next( (c for c in self.containers if c["name"] == container), None ) if not container_config: return { "success": False, "error": f"Container {container} not in monitored list" } if not container_config.get("restart_allowed", False): return { "success": False, "error": f"Container {container} restart not permitted" } cmd = f"docker restart {container}" result = self._ssh_exec(server, cmd) result["action"] = "docker_restart" result["container"] = container return result # === System Diagnostics === def get_metrics(self, server: str, metric_type: Literal["cpu", "memory", "disk", "network", "all"] = "all" ) -> dict: """ Get system metrics from server. Args: server: Server identifier metric_type: Type of metrics to retrieve Returns: dict with metric data """ metrics = {} if metric_type in ("cpu", "all"): result = self._ssh_exec(server, self.allowed_commands["cpu_usage"]) metrics["cpu"] = result if metric_type in ("memory", "all"): result = self._ssh_exec(server, self.allowed_commands["memory_usage"]) metrics["memory"] = result if metric_type in ("disk", "all"): result = self._ssh_exec(server, self.allowed_commands["disk_usage"]) metrics["disk"] = result if metric_type in ("network", "all"): result = self._ssh_exec(server, self.allowed_commands["network_status"]) metrics["network"] = result return {"server": server, "metrics": metrics} def read_logs(self, server: str, log_type: Literal["system", "docker", "application", "custom"], lines: int = 100, filter: str | None = None, custom_path: str | None = None) -> dict: """ Read logs from server. Args: server: Server identifier log_type: Type of log to read lines: Number of lines filter: Optional grep pattern custom_path: Path for custom log type Returns: dict with log content """ log_paths = { "system": "/var/log/syslog", "docker": "/var/log/docker.log", "application": "/var/log/application.log", } path = custom_path if log_type == "custom" else log_paths.get(log_type) if not path: return {"success": False, "error": f"Unknown log type: {log_type}"} cmd = f"tail -n {lines} {path}" if filter: cmd += f" | grep -i '{filter}'" return self._ssh_exec(server, cmd) def run_diagnostic(self, server: str, command: str, params: dict | None = None) -> dict: """ Run a whitelisted diagnostic command. Args: server: Server identifier command: Command key from config whitelist params: Optional parameters to substitute Returns: dict with command output """ if command not in self.allowed_commands: return { "success": False, "error": f"Command '{command}' not in whitelist" } cmd = self.allowed_commands[command] # Substitute parameters if provided if params: for key, value in params.items(): cmd = cmd.replace(f"{{{key}}}", str(value)) return self._ssh_exec(server, cmd) # === Convenience Methods === def quick_health_check(self, server: str) -> dict: """ Perform quick health check on server. Returns summary of Docker containers, disk, and memory. """ health = { "server": server, "docker": self.get_docker_status(server), "metrics": self.get_metrics(server, "all"), "healthy": True, "issues": [] } # Check for stopped containers if health["docker"].get("data"): for container in health["docker"]["data"]: status = container.get("State", container.get("Status", "")) if "Up" not in str(status) and "running" not in str(status).lower(): health["healthy"] = False health["issues"].append( f"Container {container.get('Names', 'unknown')} is not running" ) return health def to_json(self, data: Any) -> str: """Convert result to JSON string.""" return json.dumps(data, indent=2, default=str) class SecurityError(Exception): """Raised when a command violates security constraints.""" pass def main(): """CLI interface for server diagnostics.""" import argparse parser = argparse.ArgumentParser( description="Server Diagnostics CLI", formatter_class=argparse.RawDescriptionHelpFormatter, epilog=""" Examples: %(prog)s docker-status proxmox-host %(prog)s docker-status proxmox-host --container tdarr %(prog)s docker-logs proxmox-host tdarr --lines 200 %(prog)s docker-restart proxmox-host tdarr %(prog)s metrics proxmox-host --type all %(prog)s logs proxmox-host --type system --lines 50 %(prog)s health proxmox-host %(prog)s diagnostic proxmox-host disk_usage """ ) subparsers = parser.add_subparsers(dest="command", required=True) # docker-status p_docker = subparsers.add_parser("docker-status", help="Get Docker container status") p_docker.add_argument("server", help="Server identifier") p_docker.add_argument("--container", "-c", help="Specific container name") # docker-logs p_logs = subparsers.add_parser("docker-logs", help="Get Docker container logs") p_logs.add_argument("server", help="Server identifier") p_logs.add_argument("container", help="Container name") p_logs.add_argument("--lines", "-n", type=int, default=100, help="Number of lines") p_logs.add_argument("--filter", "-f", help="Grep filter pattern") # docker-restart p_restart = subparsers.add_parser("docker-restart", help="Restart Docker container") p_restart.add_argument("server", help="Server identifier") p_restart.add_argument("container", help="Container name") # metrics p_metrics = subparsers.add_parser("metrics", help="Get system metrics") p_metrics.add_argument("server", help="Server identifier") p_metrics.add_argument("--type", "-t", default="all", choices=["cpu", "memory", "disk", "network", "all"], help="Metric type") # logs p_syslogs = subparsers.add_parser("logs", help="Read system logs") p_syslogs.add_argument("server", help="Server identifier") p_syslogs.add_argument("--type", "-t", default="system", choices=["system", "docker", "application", "custom"], help="Log type") p_syslogs.add_argument("--lines", "-n", type=int, default=100, help="Number of lines") p_syslogs.add_argument("--filter", "-f", help="Grep filter pattern") p_syslogs.add_argument("--path", help="Custom log path (for type=custom)") # health p_health = subparsers.add_parser("health", help="Quick health check") p_health.add_argument("server", help="Server identifier") # diagnostic p_diag = subparsers.add_parser("diagnostic", help="Run whitelisted diagnostic") p_diag.add_argument("server", help="Server identifier") p_diag.add_argument("command", help="Command from whitelist") p_diag.add_argument("--params", "-p", help="JSON parameters for command substitution") args = parser.parse_args() client = ServerDiagnostics() if args.command == "docker-status": result = client.get_docker_status(args.server, args.container) elif args.command == "docker-logs": result = client.docker_logs( args.server, args.container, args.lines, args.filter ) elif args.command == "docker-restart": result = client.docker_restart(args.server, args.container) elif args.command == "metrics": result = client.get_metrics(args.server, args.type) elif args.command == "logs": result = client.read_logs( args.server, args.type, args.lines, args.filter, args.path ) elif args.command == "health": result = client.quick_health_check(args.server) elif args.command == "diagnostic": params = json.loads(args.params) if args.params else None result = client.run_diagnostic(args.server, args.command, params) print(client.to_json(result)) if __name__ == "__main__": main() ``` ### 3. config.yaml ```yaml # Server Diagnostics Configuration # Used by client.py for server inventory and security constraints # Server inventory - SSH connection details servers: proxmox-host: hostname: 10.10.0.11 # Update with actual IP ssh_user: root ssh_key: ~/.ssh/claude_diagnostics_key description: "Main Proxmox host running Docker services" # Docker containers to monitor # restart_allowed: false prevents automatic remediation docker_containers: - name: tdarr critical: true restart_allowed: true description: "Media transcoding service" - name: portainer critical: true restart_allowed: true description: "Docker management UI" - name: n8n critical: true restart_allowed: false # Never restart - it triggers us! description: "Workflow automation" - name: plex critical: true restart_allowed: true description: "Media server" # Whitelisted diagnostic commands # These are the ONLY commands that can be executed diagnostic_commands: disk_usage: "df -h" memory_usage: "free -h" cpu_usage: "top -bn1 | head -20" cpu_load: "uptime" process_list: "ps aux --sort=-%mem | head -20" process_tree: "pstree -p" network_status: "ss -tuln" network_connections: "netstat -an | head -50" docker_ps: "docker ps -a --format 'table {{.Names}}\\t{{.Status}}\\t{{.Ports}}'" docker_stats: "docker stats --no-stream --format 'table {{.Name}}\\t{{.CPUPerc}}\\t{{.MemUsage}}'" service_status: "systemctl status {service}" journal_errors: "journalctl -p err -n 50 --no-pager" port_check: "nc -zv {host} {port}" dns_check: "dig +short {domain}" ping_check: "ping -c 3 {host}" # Remediation commands (low-risk only) remediation_commands: docker_restart: "docker restart {container}" docker_logs: "docker logs --tail 500 {container}" service_restart: "systemctl restart {service}" # Phase 2 # DENIED patterns - commands containing these will be rejected # This is a security safeguard denied_patterns: - "rm -rf" - "rm -r /" - "dd if=" - "mkfs" - ":(){:|:&};:" - "shutdown" - "reboot" - "init 0" - "init 6" - "systemctl stop" - "> /dev/sd" - "chmod 777" - "wget|sh" - "curl|sh" - "eval" - "$((" - "` `" # Logging configuration logging: enabled: true path: ~/.claude/logs/server-diagnostics.log max_size_mb: 10 backup_count: 5 ``` ### 4. SKILL.md Full Content See separate file: `SKILL.md` in the skill directory. Key sections: - Activation triggers - Quick start with Python and CLI examples - Troubleshooting workflow (step-by-step) - MemoryGraph integration instructions - Security constraints documentation - Common error patterns and solutions ## Integration Points ### With Proxmox Skill The server-diagnostics skill can leverage the Proxmox skill for: - VM/LXC lifecycle operations (restart container that runs Docker) - Resource monitoring at hypervisor level - Snapshot creation before risky operations ```python # Example integration from proxmox_client import ProxmoxClient from server_diagnostics.client import ServerDiagnostics proxmox = ProxmoxClient() diag = ServerDiagnostics() # Check if container needs VM-level intervention result = diag.docker_restart("proxmox-host", "tdarr") if not result["success"]: # Escalate to VM level proxmox.restart_container(lxc_id) ``` ### With MemoryGraph ```python # In SKILL.md, instruct Claude to: # 1. Before diagnosis - recall similar issues # python ~/.claude/skills/memorygraph/client.py recall "docker tdarr timeout" # 2. After resolution - store the pattern # python ~/.claude/skills/memorygraph/client.py store \ # --type solution \ # --title "Tdarr container memory exhaustion" \ # --content "Container exceeded memory limit due to large transcode queue..." \ # --tags "docker,tdarr,memory,troubleshooting" \ # --importance 0.7 ``` ### With N8N N8N invokes the skill via headless Claude Code: ```bash claude -p " You are troubleshooting a server issue. Use the server-diagnostics skill. Server: proxmox-host Error Type: container_stopped Container: tdarr Timestamp: 2025-12-19T14:30:00Z Use the diagnostic client to investigate and resolve if possible. " --output-format json --allowedTools "Read,Bash,Grep,Glob" ``` ## Security Model ### Three-Layer Protection 1. **settings.json** - Claude Code level allow/deny lists 2. **config.yaml** - Skill-level command whitelist and denied patterns 3. **Container config** - Per-container restart permissions ### Audit Trail All operations are logged: - Skill logs to `~/.claude/logs/server-diagnostics.log` - MemoryGraph entries for significant troubleshooting - N8N execution history - NAS report storage ## Testing Strategy ### Unit Tests ```bash # Test command validation python -m pytest tests/test_security.py # Test SSH mocking python -m pytest tests/test_ssh.py ``` ### Integration Tests ```bash # Test against real server (requires SSH access) python client.py health proxmox-host python client.py docker-status proxmox-host python client.py diagnostic proxmox-host disk_usage ``` ### Simulated Failures ```bash # Stop a container and verify detection docker stop tdarr python client.py health proxmox-host # Should show issue python client.py docker-restart proxmox-host tdarr # Should restart ``` ## File Relationships ``` ┌─────────────────────────────────────────────────────────────┐ │ Claude Code Session │ │ │ │ Loads: SKILL.md (context) │ │ ↓ │ │ Executes: python client.py │ │ ↓ │ │ Reads: config.yaml (server inventory, whitelist) │ │ ↓ │ │ Connects: SSH to servers │ │ ↓ │ │ Returns: JSON output to Claude Code │ │ ↓ │ │ Stores: MemoryGraph (learnings) │ │ │ └─────────────────────────────────────────────────────────────┘ ``` ## Next Steps 1. Create actual skill files in `~/.claude/skills/server-diagnostics/` 2. Generate SSH key pair for diagnostics 3. Install key on Proxmox host 4. Test basic connectivity 5. Integrate with N8N workflow