claude-home/legacy/headless-claude/docs/skill-architecture.md
Cal Corum babf062d6a docs: archive headless-claude design docs to legacy/
Original planning folder (no git repo) for the server diagnostics system
that runs on CT 300. Live deployment is on claude-runner; this preserves
the Agent SDK reference, PRD with Phase 2/3 roadmap, and N8N workflow designs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 08:15:13 -06:00

22 KiB

Server Diagnostics Skill - Architecture Design

Overview

The server-diagnostics skill provides automated troubleshooting capabilities for homelab infrastructure via SSH. It follows the same architectural patterns as the existing Proxmox skill: a Python client library with CLI interface, SKILL.md for context, and YAML configuration.

Directory Structure

~/.claude/skills/server-diagnostics/
├── SKILL.md                    # Skill context, workflows, and usage instructions
├── client.py                   # Main Python client library with CLI
├── config.yaml                 # Server inventory, command whitelist, container config
├── requirements.txt            # Python dependencies (paramiko, pyyaml)
└── commands/                   # Modular command implementations
    ├── __init__.py
    ├── docker.py               # Docker-specific diagnostics
    ├── system.py               # System-level diagnostics (disk, memory, CPU)
    └── network.py              # Network diagnostics

Component Details

1. SKILL.md

Provides context for Claude Code when troubleshooting. Key sections:

---
name: server-diagnostics
description: Automated server troubleshooting for Docker containers and system health.
  Provides SSH-based diagnostics, log reading, metrics collection, and low-risk
  remediation. USE WHEN N8N triggers troubleshooting, container issues detected,
  or system health checks needed.
---

# Server Diagnostics - Automated Troubleshooting

## When to Activate This Skill
- N8N triggers with error context
- "diagnose container X", "check docker status"
- "read logs from server", "check disk usage"
- "troubleshoot server issue"
- Any automated health check response

## Quick Start
[Examples of common operations]

## Troubleshooting Workflow
[Step-by-step diagnostic process]

## MemoryGraph Integration
[How to recall/store troubleshooting patterns]

## Security Constraints
[Whitelist/deny list documentation]

2. client.py - Main Client Library

#!/usr/bin/env python3
"""
Server Diagnostics Client Library
Provides SSH-based diagnostics for homelab troubleshooting
"""

import json
import subprocess
from pathlib import Path
from typing import Any, Literal
import yaml

class ServerDiagnostics:
    """
    Main diagnostic client for server troubleshooting.

    Connects to servers via SSH and executes whitelisted diagnostic
    commands. Enforces security constraints from config.yaml.
    """

    def __init__(self, config_path: str | None = None):
        """
        Initialize with configuration.

        Args:
            config_path: Path to config.yaml. Defaults to same directory.
        """
        if config_path is None:
            config_path = Path(__file__).parent / "config.yaml"
        self.config = self._load_config(config_path)
        self.servers = self.config.get("servers", {})
        self.containers = self.config.get("docker_containers", [])
        self.allowed_commands = self.config.get("diagnostic_commands", {})
        self.remediation_commands = self.config.get("remediation_commands", {})
        self.denied_patterns = self.config.get("denied_patterns", [])

    def _load_config(self, path: str | Path) -> dict:
        """Load YAML configuration."""
        with open(path) as f:
            return yaml.safe_load(f)

    def _validate_command(self, command: str) -> bool:
        """Check command against deny list."""
        for pattern in self.denied_patterns:
            if pattern in command:
                raise SecurityError(f"Command contains denied pattern: {pattern}")
        return True

    def _ssh_exec(self, server: str, command: str) -> dict:
        """
        Execute command on remote server via SSH.

        Returns:
            dict with 'stdout', 'stderr', 'returncode'
        """
        self._validate_command(command)

        server_config = self.servers.get(server)
        if not server_config:
            raise ValueError(f"Unknown server: {server}")

        ssh_cmd = [
            "ssh",
            "-i", server_config["ssh_key"],
            "-o", "StrictHostKeyChecking=no",
            "-o", "ConnectTimeout=10",
            f"{server_config['ssh_user']}@{server_config['hostname']}",
            command
        ]

        result = subprocess.run(
            ssh_cmd,
            capture_output=True,
            text=True,
            timeout=60
        )

        return {
            "stdout": result.stdout,
            "stderr": result.stderr,
            "returncode": result.returncode,
            "success": result.returncode == 0
        }

    # === Docker Operations ===

    def get_docker_status(self, server: str, container: str | None = None) -> dict:
        """
        Get Docker container status.

        Args:
            server: Server identifier from config
            container: Specific container name (optional, all if not specified)

        Returns:
            dict with container statuses
        """
        if container:
            cmd = f"docker inspect --format '{{{{json .State}}}}' {container}"
        else:
            cmd = "docker ps -a --format 'json'"

        result = self._ssh_exec(server, cmd)

        if result["success"]:
            try:
                if container:
                    result["data"] = json.loads(result["stdout"])
                else:
                    # Parse newline-delimited JSON
                    result["data"] = [
                        json.loads(line)
                        for line in result["stdout"].strip().split("\n")
                        if line
                    ]
            except json.JSONDecodeError:
                result["data"] = None

        return result

    def docker_logs(self, server: str, container: str,
                    lines: int = 100, filter: str | None = None) -> dict:
        """
        Get Docker container logs.

        Args:
            server: Server identifier
            container: Container name
            lines: Number of lines to retrieve
            filter: Optional grep filter pattern

        Returns:
            dict with log output
        """
        cmd = f"docker logs --tail {lines} {container} 2>&1"
        if filter:
            cmd += f" | grep -i '{filter}'"

        return self._ssh_exec(server, cmd)

    def docker_restart(self, server: str, container: str) -> dict:
        """
        Restart a Docker container (low-risk remediation).

        Args:
            server: Server identifier
            container: Container name

        Returns:
            dict with operation result
        """
        # Check if container is allowed to be restarted
        container_config = next(
            (c for c in self.containers if c["name"] == container),
            None
        )

        if not container_config:
            return {
                "success": False,
                "error": f"Container {container} not in monitored list"
            }

        if not container_config.get("restart_allowed", False):
            return {
                "success": False,
                "error": f"Container {container} restart not permitted"
            }

        cmd = f"docker restart {container}"
        result = self._ssh_exec(server, cmd)
        result["action"] = "docker_restart"
        result["container"] = container

        return result

    # === System Diagnostics ===

    def get_metrics(self, server: str,
                    metric_type: Literal["cpu", "memory", "disk", "network", "all"] = "all"
                   ) -> dict:
        """
        Get system metrics from server.

        Args:
            server: Server identifier
            metric_type: Type of metrics to retrieve

        Returns:
            dict with metric data
        """
        metrics = {}

        if metric_type in ("cpu", "all"):
            result = self._ssh_exec(server, self.allowed_commands["cpu_usage"])
            metrics["cpu"] = result

        if metric_type in ("memory", "all"):
            result = self._ssh_exec(server, self.allowed_commands["memory_usage"])
            metrics["memory"] = result

        if metric_type in ("disk", "all"):
            result = self._ssh_exec(server, self.allowed_commands["disk_usage"])
            metrics["disk"] = result

        if metric_type in ("network", "all"):
            result = self._ssh_exec(server, self.allowed_commands["network_status"])
            metrics["network"] = result

        return {"server": server, "metrics": metrics}

    def read_logs(self, server: str,
                  log_type: Literal["system", "docker", "application", "custom"],
                  lines: int = 100,
                  filter: str | None = None,
                  custom_path: str | None = None) -> dict:
        """
        Read logs from server.

        Args:
            server: Server identifier
            log_type: Type of log to read
            lines: Number of lines
            filter: Optional grep pattern
            custom_path: Path for custom log type

        Returns:
            dict with log content
        """
        log_paths = {
            "system": "/var/log/syslog",
            "docker": "/var/log/docker.log",
            "application": "/var/log/application.log",
        }

        path = custom_path if log_type == "custom" else log_paths.get(log_type)

        if not path:
            return {"success": False, "error": f"Unknown log type: {log_type}"}

        cmd = f"tail -n {lines} {path}"
        if filter:
            cmd += f" | grep -i '{filter}'"

        return self._ssh_exec(server, cmd)

    def run_diagnostic(self, server: str,
                       command: str,
                       params: dict | None = None) -> dict:
        """
        Run a whitelisted diagnostic command.

        Args:
            server: Server identifier
            command: Command key from config whitelist
            params: Optional parameters to substitute

        Returns:
            dict with command output
        """
        if command not in self.allowed_commands:
            return {
                "success": False,
                "error": f"Command '{command}' not in whitelist"
            }

        cmd = self.allowed_commands[command]

        # Substitute parameters if provided
        if params:
            for key, value in params.items():
                cmd = cmd.replace(f"{{{key}}}", str(value))

        return self._ssh_exec(server, cmd)

    # === Convenience Methods ===

    def quick_health_check(self, server: str) -> dict:
        """
        Perform quick health check on server.

        Returns summary of Docker containers, disk, and memory.
        """
        health = {
            "server": server,
            "docker": self.get_docker_status(server),
            "metrics": self.get_metrics(server, "all"),
            "healthy": True,
            "issues": []
        }

        # Check for stopped containers
        if health["docker"].get("data"):
            for container in health["docker"]["data"]:
                status = container.get("State", container.get("Status", ""))
                if "Up" not in str(status) and "running" not in str(status).lower():
                    health["healthy"] = False
                    health["issues"].append(
                        f"Container {container.get('Names', 'unknown')} is not running"
                    )

        return health

    def to_json(self, data: Any) -> str:
        """Convert result to JSON string."""
        return json.dumps(data, indent=2, default=str)


class SecurityError(Exception):
    """Raised when a command violates security constraints."""
    pass


def main():
    """CLI interface for server diagnostics."""
    import argparse

    parser = argparse.ArgumentParser(
        description="Server Diagnostics CLI",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
Examples:
  %(prog)s docker-status proxmox-host
  %(prog)s docker-status proxmox-host --container tdarr
  %(prog)s docker-logs proxmox-host tdarr --lines 200
  %(prog)s docker-restart proxmox-host tdarr
  %(prog)s metrics proxmox-host --type all
  %(prog)s logs proxmox-host --type system --lines 50
  %(prog)s health proxmox-host
  %(prog)s diagnostic proxmox-host disk_usage
        """
    )

    subparsers = parser.add_subparsers(dest="command", required=True)

    # docker-status
    p_docker = subparsers.add_parser("docker-status", help="Get Docker container status")
    p_docker.add_argument("server", help="Server identifier")
    p_docker.add_argument("--container", "-c", help="Specific container name")

    # docker-logs
    p_logs = subparsers.add_parser("docker-logs", help="Get Docker container logs")
    p_logs.add_argument("server", help="Server identifier")
    p_logs.add_argument("container", help="Container name")
    p_logs.add_argument("--lines", "-n", type=int, default=100, help="Number of lines")
    p_logs.add_argument("--filter", "-f", help="Grep filter pattern")

    # docker-restart
    p_restart = subparsers.add_parser("docker-restart", help="Restart Docker container")
    p_restart.add_argument("server", help="Server identifier")
    p_restart.add_argument("container", help="Container name")

    # metrics
    p_metrics = subparsers.add_parser("metrics", help="Get system metrics")
    p_metrics.add_argument("server", help="Server identifier")
    p_metrics.add_argument("--type", "-t", default="all",
                          choices=["cpu", "memory", "disk", "network", "all"],
                          help="Metric type")

    # logs
    p_syslogs = subparsers.add_parser("logs", help="Read system logs")
    p_syslogs.add_argument("server", help="Server identifier")
    p_syslogs.add_argument("--type", "-t", default="system",
                          choices=["system", "docker", "application", "custom"],
                          help="Log type")
    p_syslogs.add_argument("--lines", "-n", type=int, default=100, help="Number of lines")
    p_syslogs.add_argument("--filter", "-f", help="Grep filter pattern")
    p_syslogs.add_argument("--path", help="Custom log path (for type=custom)")

    # health
    p_health = subparsers.add_parser("health", help="Quick health check")
    p_health.add_argument("server", help="Server identifier")

    # diagnostic
    p_diag = subparsers.add_parser("diagnostic", help="Run whitelisted diagnostic")
    p_diag.add_argument("server", help="Server identifier")
    p_diag.add_argument("command", help="Command from whitelist")
    p_diag.add_argument("--params", "-p", help="JSON parameters for command substitution")

    args = parser.parse_args()

    client = ServerDiagnostics()

    if args.command == "docker-status":
        result = client.get_docker_status(args.server, args.container)

    elif args.command == "docker-logs":
        result = client.docker_logs(
            args.server, args.container, args.lines, args.filter
        )

    elif args.command == "docker-restart":
        result = client.docker_restart(args.server, args.container)

    elif args.command == "metrics":
        result = client.get_metrics(args.server, args.type)

    elif args.command == "logs":
        result = client.read_logs(
            args.server, args.type, args.lines, args.filter, args.path
        )

    elif args.command == "health":
        result = client.quick_health_check(args.server)

    elif args.command == "diagnostic":
        params = json.loads(args.params) if args.params else None
        result = client.run_diagnostic(args.server, args.command, params)

    print(client.to_json(result))


if __name__ == "__main__":
    main()

3. config.yaml

# Server Diagnostics Configuration
# Used by client.py for server inventory and security constraints

# Server inventory - SSH connection details
servers:
  proxmox-host:
    hostname: 10.10.0.11  # Update with actual IP
    ssh_user: root
    ssh_key: ~/.ssh/claude_diagnostics_key
    description: "Main Proxmox host running Docker services"

# Docker containers to monitor
# restart_allowed: false prevents automatic remediation
docker_containers:
  - name: tdarr
    critical: true
    restart_allowed: true
    description: "Media transcoding service"

  - name: portainer
    critical: true
    restart_allowed: true
    description: "Docker management UI"

  - name: n8n
    critical: true
    restart_allowed: false  # Never restart - it triggers us!
    description: "Workflow automation"

  - name: plex
    critical: true
    restart_allowed: true
    description: "Media server"

# Whitelisted diagnostic commands
# These are the ONLY commands that can be executed
diagnostic_commands:
  disk_usage: "df -h"
  memory_usage: "free -h"
  cpu_usage: "top -bn1 | head -20"
  cpu_load: "uptime"
  process_list: "ps aux --sort=-%mem | head -20"
  process_tree: "pstree -p"
  network_status: "ss -tuln"
  network_connections: "netstat -an | head -50"
  docker_ps: "docker ps -a --format 'table {{.Names}}\\t{{.Status}}\\t{{.Ports}}'"
  docker_stats: "docker stats --no-stream --format 'table {{.Name}}\\t{{.CPUPerc}}\\t{{.MemUsage}}'"
  service_status: "systemctl status {service}"
  journal_errors: "journalctl -p err -n 50 --no-pager"
  port_check: "nc -zv {host} {port}"
  dns_check: "dig +short {domain}"
  ping_check: "ping -c 3 {host}"

# Remediation commands (low-risk only)
remediation_commands:
  docker_restart: "docker restart {container}"
  docker_logs: "docker logs --tail 500 {container}"
  service_restart: "systemctl restart {service}"  # Phase 2

# DENIED patterns - commands containing these will be rejected
# This is a security safeguard
denied_patterns:
  - "rm -rf"
  - "rm -r /"
  - "dd if="
  - "mkfs"
  - ":(){:|:&};:"
  - "shutdown"
  - "reboot"
  - "init 0"
  - "init 6"
  - "systemctl stop"
  - "> /dev/sd"
  - "chmod 777"
  - "wget|sh"
  - "curl|sh"
  - "eval"
  - "$(("
  - "` `"

# Logging configuration
logging:
  enabled: true
  path: ~/.claude/logs/server-diagnostics.log
  max_size_mb: 10
  backup_count: 5

4. SKILL.md Full Content

See separate file: SKILL.md in the skill directory.

Key sections:

  • Activation triggers
  • Quick start with Python and CLI examples
  • Troubleshooting workflow (step-by-step)
  • MemoryGraph integration instructions
  • Security constraints documentation
  • Common error patterns and solutions

Integration Points

With Proxmox Skill

The server-diagnostics skill can leverage the Proxmox skill for:

  • VM/LXC lifecycle operations (restart container that runs Docker)
  • Resource monitoring at hypervisor level
  • Snapshot creation before risky operations
# Example integration
from proxmox_client import ProxmoxClient
from server_diagnostics.client import ServerDiagnostics

proxmox = ProxmoxClient()
diag = ServerDiagnostics()

# Check if container needs VM-level intervention
result = diag.docker_restart("proxmox-host", "tdarr")
if not result["success"]:
    # Escalate to VM level
    proxmox.restart_container(lxc_id)

With MemoryGraph

# In SKILL.md, instruct Claude to:

# 1. Before diagnosis - recall similar issues
# python ~/.claude/skills/memorygraph/client.py recall "docker tdarr timeout"

# 2. After resolution - store the pattern
# python ~/.claude/skills/memorygraph/client.py store \
#   --type solution \
#   --title "Tdarr container memory exhaustion" \
#   --content "Container exceeded memory limit due to large transcode queue..." \
#   --tags "docker,tdarr,memory,troubleshooting" \
#   --importance 0.7

With N8N

N8N invokes the skill via headless Claude Code:

claude -p "
You are troubleshooting a server issue. Use the server-diagnostics skill.

Server: proxmox-host
Error Type: container_stopped
Container: tdarr
Timestamp: 2025-12-19T14:30:00Z

Use the diagnostic client to investigate and resolve if possible.
" --output-format json --allowedTools "Read,Bash,Grep,Glob"

Security Model

Three-Layer Protection

  1. settings.json - Claude Code level allow/deny lists
  2. config.yaml - Skill-level command whitelist and denied patterns
  3. Container config - Per-container restart permissions

Audit Trail

All operations are logged:

  • Skill logs to ~/.claude/logs/server-diagnostics.log
  • MemoryGraph entries for significant troubleshooting
  • N8N execution history
  • NAS report storage

Testing Strategy

Unit Tests

# Test command validation
python -m pytest tests/test_security.py

# Test SSH mocking
python -m pytest tests/test_ssh.py

Integration Tests

# Test against real server (requires SSH access)
python client.py health proxmox-host
python client.py docker-status proxmox-host
python client.py diagnostic proxmox-host disk_usage

Simulated Failures

# Stop a container and verify detection
docker stop tdarr
python client.py health proxmox-host  # Should show issue
python client.py docker-restart proxmox-host tdarr  # Should restart

File Relationships

┌─────────────────────────────────────────────────────────────┐
│                     Claude Code Session                      │
│                                                              │
│  Loads: SKILL.md (context)                                  │
│         ↓                                                    │
│  Executes: python client.py <command>                       │
│         ↓                                                    │
│  Reads: config.yaml (server inventory, whitelist)           │
│         ↓                                                    │
│  Connects: SSH to servers                                   │
│         ↓                                                    │
│  Returns: JSON output to Claude Code                        │
│         ↓                                                    │
│  Stores: MemoryGraph (learnings)                            │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Next Steps

  1. Create actual skill files in ~/.claude/skills/server-diagnostics/
  2. Generate SSH key pair for diagnostics
  3. Install key on Proxmox host
  4. Test basic connectivity
  5. Integrate with N8N workflow