chore: add recovered CT 302 configs, archive tdarr scripts, clean up repo

- Add recovered LXC 300/302 server-diagnostics configs as reference (headless Claude permission patterns, health check client) - Archive decommissioned tdarr monitoring scripts - Gitignore rpg-art/ directory - Delete stray temp files and swarm-test/ Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 00:41:41 -06:00 · 2026-03-01 00:41:41 -06:00 · 28abde7c9f
commit 28abde7c9f
parent 64f9662f25
10 changed files with 2260 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -17,3 +17,6 @@ __pycache__
 # Large binary files
 *.zip
 # Art assets (managed separately)
 rpg-art/
--- a/monitoring/recovered-lxc300/server-diagnostics/SKILL.md
+++ b/monitoring/recovered-lxc300/server-diagnostics/SKILL.md
@ -0,0 +1,177 @@
 ---
 name: server-diagnostics
 description: |
  Automated server troubleshooting for Docker containers and system health.
  Provides SSH-based diagnostics, log reading, metrics collection, and low-risk
  remediation. USE WHEN N8N triggers troubleshooting, container issues detected,
  or system health checks needed.
 ---
 # Server Diagnostics - Automated Troubleshooting
 ## When to Activate This Skill
 - N8N triggers with error context
 - "diagnose container X", "check docker status"
 - "read logs from server", "check disk usage"
 - "troubleshoot server issue"
 - Any automated health check response
 ## Quick Start
 ### Check All Containers
 ```bash
 python ~/.claude/skills/server-diagnostics/client.py docker-status paper-dynasty
 ```
 ### Quick Health Check (Docker + System Metrics)
 ```bash
 python ~/.claude/skills/server-diagnostics/client.py health paper-dynasty
 ```
 ### Get Container Logs
 ```bash
 python ~/.claude/skills/server-diagnostics/client.py docker-logs paper-dynasty paper-dynasty_discord-app_1 --lines 200
 ```
 ### Restart a Container
 ```bash
 python ~/.claude/skills/server-diagnostics/client.py docker-restart paper-dynasty paper-dynasty_discord-app_1
 ```
 ### System Metrics
 ```bash
 python ~/.claude/skills/server-diagnostics/client.py metrics paper-dynasty --type all
 python ~/.claude/skills/server-diagnostics/client.py metrics paper-dynasty --type disk
 ```
 ### Run Diagnostic Command
 ```bash
 python ~/.claude/skills/server-diagnostics/client.py diagnostic paper-dynasty disk_usage
 python ~/.claude/skills/server-diagnostics/client.py diagnostic paper-dynasty memory_usage
 ```
 ## Troubleshooting Workflow
 When an issue is reported:
 1. **Quick Health Check** - Get overview of containers and system state
   ```bash
   python ~/.claude/skills/server-diagnostics/client.py health paper-dynasty
   ```
 2. **Check MemoryGraph** - Recall similar issues
   ```bash
   python ~/.claude/skills/memorygraph/client.py recall "docker container error"
   ```
 3. **Get Container Logs** - Look for errors
   ```bash
   python ~/.claude/skills/server-diagnostics/client.py docker-logs paper-dynasty <container> --lines 500 --filter error
   ```
 4. **Remediate if Safe** - Restart if allowed
   ```bash
   python ~/.claude/skills/server-diagnostics/client.py docker-restart paper-dynasty <container>
   ```
 5. **Store Solution** - Save to MemoryGraph if resolved
   ```bash
   python ~/.claude/skills/memorygraph/client.py store \
     --type solution \
     --title "Fixed <container> issue" \
     --content "Description of problem and solution" \
     --tags "docker,paper-dynasty,troubleshooting" \
     --importance 0.7
   ```
 ## Server Inventory
 | Server | IP | SSH User | Description |
 |--------|-----|----------|-------------|
 | paper-dynasty | 10.10.0.88 | cal | Paper Dynasty Discord bots and services |
 ## Monitored Containers
 | Container | Critical | Restart Allowed | Description |
 |-----------|----------|-----------------|-------------|
 | paper-dynasty_discord-app_1 | Yes | Yes | Paper Dynasty Discord bot |
 | paper-dynasty_db_1 | Yes | Yes | PostgreSQL database |
 | paper-dynasty_adminer_1 | No | Yes | Database admin UI |
 | sba-website_sba-web_1 | Yes | Yes | SBA website |
 | sba-ghost_sba-ghost_1 | No | Yes | Ghost CMS |
 ## Available Diagnostic Commands
 - `disk_usage` - df -h
 - `memory_usage` - free -h
 - `cpu_usage` - top -bn1 | head -20
 - `cpu_load` - uptime
 - `process_list` - ps aux --sort=-%mem | head -20
 - `network_status` - ss -tuln
 - `docker_ps` - docker ps -a (formatted)
 - `docker_stats` - docker stats --no-stream
 - `journal_errors` - journalctl -p err -n 50
 ## Security Constraints
 ### DENIED Patterns (Will Be Rejected)
 - rm -rf, rm -r /
 - dd if=, mkfs
 - shutdown, reboot
 - systemctl stop
 - chmod 777
 - wget|sh, curl|sh
 ### Container Restart Rules
 - Only containers in config.yaml with restart_allowed: true
 - N8N container restart is NEVER allowed (it triggers us)
 ## MemoryGraph Integration
 Before troubleshooting, check for known solutions:
 ```bash
 python ~/.claude/skills/memorygraph/client.py recall "docker paper-dynasty"
 ```
 After resolving, store the pattern:
 ```bash
 python ~/.claude/skills/memorygraph/client.py store \
  --type solution \
  --title "Brief description" \
  --content "Full explanation..." \
  --tags "docker,paper-dynasty,fix" \
  --importance 0.7
 ```
 ## Common Issues and Solutions
 ### Container Not Running
 1. Check logs for crash reason
 2. Check disk space and memory
 3. Attempt restart if allowed
 4. Escalate if restart fails
 ### High Memory Usage
 1. Check which container is consuming
 2. Review docker stats
 3. Check for memory leaks in logs
 4. Consider container restart
 ### Disk Space Low
 1. Run disk_usage diagnostic
 2. Check docker system df
 3. Consider log rotation
 4. Alert user for cleanup
 ## Output Format
 All commands return JSON:
 ```json
 {
  "success": true,
  "stdout": "...",
  "stderr": "...",
  "returncode": 0,
  "data": {...}  // Parsed data if applicable
 }
 ```
--- a/monitoring/recovered-lxc300/server-diagnostics/client.py
+++ b/monitoring/recovered-lxc300/server-diagnostics/client.py
@ -0,0 +1,443 @@
 #!/usr/bin/env python3
 """
 Server Diagnostics Client Library
 Provides SSH-based diagnostics for homelab troubleshooting
 """
 import json
 import subprocess
 from pathlib import Path
 from typing import Any, Optional, List, Dict
 import yaml
 class ServerDiagnostics:
    """
    Main diagnostic client for server troubleshooting.
    Connects to servers via SSH and executes whitelisted diagnostic
    commands. Enforces security constraints from config.yaml.
    """
    def __init__(self, config_path: Optional[str] = None):
        """
        Initialize with configuration.
        Args:
            config_path: Path to config.yaml. Defaults to same directory.
        """
        if config_path is None:
            config_path = Path(__file__).parent / "config.yaml"
        self.config = self._load_config(config_path)
        self.servers = self.config.get("servers", {})
        self.containers = self.config.get("docker_containers", [])
        self.allowed_commands = self.config.get("diagnostic_commands", {})
        self.remediation_commands = self.config.get("remediation_commands", {})
        self.denied_patterns = self.config.get("denied_patterns", [])
    def _load_config(self, path) -> dict:
        """Load YAML configuration."""
        with open(path) as f:
            return yaml.safe_load(f)
    def _validate_command(self, command: str) -> bool:
        """Check command against deny list."""
        for pattern in self.denied_patterns:
            if pattern in command:
                raise SecurityError(f"Command contains denied pattern: {pattern}")
        return True
    def _ssh_exec(self, server: str, command: str) -> dict:
        """
        Execute command on remote server via SSH.
        Returns:
            dict with stdout, stderr, returncode
        """
        self._validate_command(command)
        server_config = self.servers.get(server)
        if not server_config:
            raise ValueError(f"Unknown server: {server}")
        ssh_key = Path(server_config["ssh_key"]).expanduser()
        ssh_user = server_config["ssh_user"]
        hostname = server_config["hostname"]
        ssh_cmd = [
            "ssh",
            "-i",
            str(ssh_key),
            "-o",
            "StrictHostKeyChecking=no",
            "-o",
            "ConnectTimeout=10",
            f"{ssh_user}@{hostname}",
            command,
        ]
        result = subprocess.run(ssh_cmd, capture_output=True, text=True, timeout=60)
        return {
            "stdout": result.stdout,
            "stderr": result.stderr,
            "returncode": result.returncode,
            "success": result.returncode == 0,
        }
    # === Docker Operations ===
    def get_docker_status(self, server: str, container: Optional[str] = None) -> dict:
        """
        Get Docker container status.
        Args:
            server: Server identifier from config
            container: Specific container name (optional, all if not specified)
        Returns:
            dict with container statuses
        """
        if container:
            cmd = "docker inspect --format '{{json .State}}' " + container
            result = self._ssh_exec(server, cmd)
            if result["success"]:
                try:
                    result["data"] = json.loads(result["stdout"])
                except json.JSONDecodeError:
                    result["data"] = None
        else:
            # Use Go template format for Docker 20.10 compatibility
            # Format: Name|Status|State|Ports
            cmd = "docker ps -a --format '{{.Names}}|{{.Status}}|{{.State}}|{{.Ports}}'"
            result = self._ssh_exec(server, cmd)
            if result["success"]:
                containers = []
                for line in result["stdout"].strip().split("\n"):
                    if line:
                        parts = line.split("|")
                        if len(parts) >= 3:
                            containers.append(
                                {
                                    "Names": parts[0],
                                    "Status": parts[1],
                                    "State": parts[2],
                                    "Ports": parts[3] if len(parts) > 3 else "",
                                }
                            )
                result["data"] = containers
        return result
    def docker_logs(
        self,
        server: str,
        container: str,
        lines: int = 100,
        log_filter: Optional[str] = None,
    ) -> dict:
        """
        Get Docker container logs.
        Args:
            server: Server identifier
            container: Container name
            lines: Number of lines to retrieve
            log_filter: Optional grep filter pattern
        Returns:
            dict with log output
        """
        cmd = f"docker logs --tail {lines} {container} 2>&1"
        if log_filter:
            cmd += f" | grep -i '{log_filter}'"
        return self._ssh_exec(server, cmd)
    def docker_restart(self, server: str, container: str) -> dict:
        """
        Restart a Docker container (low-risk remediation).
        Args:
            server: Server identifier
            container: Container name
        Returns:
            dict with operation result
        """
        # Check if container is allowed to be restarted
        container_config = next(
            (c for c in self.containers if c["name"] == container), None
        )
        if not container_config:
            return {
                "success": False,
                "error": f"Container {container} not in monitored list",
            }
        if not container_config.get("restart_allowed", False):
            return {
                "success": False,
                "error": f"Container {container} restart not permitted",
            }
        cmd = f"docker restart {container}"
        result = self._ssh_exec(server, cmd)
        result["action"] = "docker_restart"
        result["container"] = container
        return result
    # === System Diagnostics ===
    def get_metrics(self, server: str, metric_type: str = "all") -> dict:
        """
        Get system metrics from server.
        Args:
            server: Server identifier
            metric_type: Type of metrics (cpu, memory, disk, network, all)
        Returns:
            dict with metric data
        """
        metrics = {}
        if metric_type in ("cpu", "all"):
            result = self._ssh_exec(server, self.allowed_commands["cpu_usage"])
            metrics["cpu"] = result
        if metric_type in ("memory", "all"):
            result = self._ssh_exec(server, self.allowed_commands["memory_usage"])
            metrics["memory"] = result
        if metric_type in ("disk", "all"):
            result = self._ssh_exec(server, self.allowed_commands["disk_usage"])
            metrics["disk"] = result
        if metric_type in ("network", "all"):
            result = self._ssh_exec(server, self.allowed_commands["network_status"])
            metrics["network"] = result
        return {"server": server, "metrics": metrics}
    def read_logs(
        self,
        server: str,
        log_type: str,
        lines: int = 100,
        log_filter: Optional[str] = None,
        custom_path: Optional[str] = None,
    ) -> dict:
        """
        Read logs from server.
        Args:
            server: Server identifier
            log_type: Type of log (system, docker, application, custom)
            lines: Number of lines
            log_filter: Optional grep pattern
            custom_path: Path for custom log type
        Returns:
            dict with log content
        """
        log_paths = {
            "system": "/var/log/syslog",
            "docker": "/var/log/docker.log",
            "application": "/var/log/application.log",
        }
        path = custom_path if log_type == "custom" else log_paths.get(log_type)
        if not path:
            return {"success": False, "error": f"Unknown log type: {log_type}"}
        cmd = f"tail -n {lines} {path}"
        if log_filter:
            cmd += f" | grep -i '{log_filter}'"
        return self._ssh_exec(server, cmd)
    def run_diagnostic(
        self, server: str, command: str, params: Optional[dict] = None
    ) -> dict:
        """
        Run a whitelisted diagnostic command.
        Args:
            server: Server identifier
            command: Command key from config whitelist
            params: Optional parameters to substitute
        Returns:
            dict with command output
        """
        if command not in self.allowed_commands:
            return {"success": False, "error": f"Command '{command}' not in whitelist"}
        cmd = self.allowed_commands[command]
        # Substitute parameters if provided
        if params:
            for key, value in params.items():
                cmd = cmd.replace(f"{{{key}}}", str(value))
        return self._ssh_exec(server, cmd)
    # === Convenience Methods ===
    def quick_health_check(self, server: str) -> dict:
        """
        Perform quick health check on server.
        Returns summary of Docker containers, disk, and memory.
        """
        health = {
            "server": server,
            "docker": self.get_docker_status(server),
            "metrics": self.get_metrics(server, "all"),
            "healthy": True,
            "issues": [],
        }
        # Check for stopped containers
        if health["docker"].get("data"):
            for container in health["docker"]["data"]:
                status = container.get("State", container.get("Status", ""))
                if "Up" not in str(status) and "running" not in str(status).lower():
                    health["healthy"] = False
                    health["issues"].append(
                        f"Container {container.get('Names', 'unknown')} is not running"
                    )
        return health
    def to_json(self, data: Any) -> str:
        """Convert result to JSON string."""
        return json.dumps(data, indent=2, default=str)
 class SecurityError(Exception):
    """Raised when a command violates security constraints."""
    pass
 def main():
    """CLI interface for server diagnostics."""
    import argparse
    parser = argparse.ArgumentParser(
        description="Server Diagnostics CLI",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
 Examples:
  %(prog)s docker-status paper-dynasty
  %(prog)s docker-status paper-dynasty --container paper-dynasty_discord-app_1
  %(prog)s docker-logs paper-dynasty paper-dynasty_discord-app_1 --lines 200
  %(prog)s docker-restart paper-dynasty paper-dynasty_discord-app_1
  %(prog)s metrics paper-dynasty --type all
  %(prog)s health paper-dynasty
  %(prog)s diagnostic paper-dynasty disk_usage
        """,
    )
    subparsers = parser.add_subparsers(dest="command", required=True)
    # docker-status
    p_docker = subparsers.add_parser(
        "docker-status", help="Get Docker container status"
    )
    p_docker.add_argument("server", help="Server identifier")
    p_docker.add_argument("--container", "-c", help="Specific container name")
    # docker-logs
    p_logs = subparsers.add_parser("docker-logs", help="Get Docker container logs")
    p_logs.add_argument("server", help="Server identifier")
    p_logs.add_argument("container", help="Container name")
    p_logs.add_argument("--lines", "-n", type=int, default=100, help="Number of lines")
    p_logs.add_argument("--filter", "-f", dest="log_filter", help="Grep filter pattern")
    # docker-restart
    p_restart = subparsers.add_parser("docker-restart", help="Restart Docker container")
    p_restart.add_argument("server", help="Server identifier")
    p_restart.add_argument("container", help="Container name")
    # metrics
    p_metrics = subparsers.add_parser("metrics", help="Get system metrics")
    p_metrics.add_argument("server", help="Server identifier")
    p_metrics.add_argument(
        "--type",
        "-t",
        default="all",
        choices=["cpu", "memory", "disk", "network", "all"],
        help="Metric type",
    )
    # logs
    p_syslogs = subparsers.add_parser("logs", help="Read system logs")
    p_syslogs.add_argument("server", help="Server identifier")
    p_syslogs.add_argument(
        "--type",
        "-t",
        default="system",
        choices=["system", "docker", "application", "custom"],
        help="Log type",
    )
    p_syslogs.add_argument(
        "--lines", "-n", type=int, default=100, help="Number of lines"
    )
    p_syslogs.add_argument(
        "--filter", "-f", dest="log_filter", help="Grep filter pattern"
    )
    p_syslogs.add_argument("--path", help="Custom log path (for type=custom)")
    # health
    p_health = subparsers.add_parser("health", help="Quick health check")
    p_health.add_argument("server", help="Server identifier")
    # diagnostic
    p_diag = subparsers.add_parser("diagnostic", help="Run whitelisted diagnostic")
    p_diag.add_argument("server", help="Server identifier")
    p_diag.add_argument("diagnostic_cmd", help="Command from whitelist")
    p_diag.add_argument(
        "--params", "-p", help="JSON parameters for command substitution"
    )
    args = parser.parse_args()
    client = ServerDiagnostics()
    if args.command == "docker-status":
        result = client.get_docker_status(args.server, args.container)
    elif args.command == "docker-logs":
        result = client.docker_logs(
            args.server, args.container, args.lines, args.log_filter
        )
    elif args.command == "docker-restart":
        result = client.docker_restart(args.server, args.container)
    elif args.command == "metrics":
        result = client.get_metrics(args.server, args.type)
    elif args.command == "logs":
        result = client.read_logs(
            args.server, args.type, args.lines, args.log_filter, args.path
        )
    elif args.command == "health":
        result = client.quick_health_check(args.server)
    elif args.command == "diagnostic":
        params = json.loads(args.params) if args.params else None
        result = client.run_diagnostic(args.server, args.diagnostic_cmd, params)
    print(client.to_json(result))
 if __name__ == "__main__":
    main()
--- a/monitoring/recovered-lxc300/server-diagnostics/config.yaml
+++ b/monitoring/recovered-lxc300/server-diagnostics/config.yaml
@ -0,0 +1,72 @@
 # Server Diagnostics Configuration
 # Used by client.py for server inventory and security constraints
 # Server inventory - SSH connection details
 servers:
  paper-dynasty:
    hostname: 10.10.0.88
    ssh_user: cal
    ssh_key: ~/.ssh/claude_diagnostics_key
    description: "Paper Dynasty Discord bots and services"
 # Docker containers to monitor
 # restart_allowed: false prevents automatic remediation
 docker_containers:
  - name: paper-dynasty_discord-app_1
    critical: true
    restart_allowed: true
    description: "Paper Dynasty Discord bot"
  - name: paper-dynasty_db_1
    critical: true
    restart_allowed: true
    description: "Paper Dynasty PostgreSQL database"
  - name: paper-dynasty_adminer_1
    critical: false
    restart_allowed: true
    description: "Database admin UI"
  - name: sba-website_sba-web_1
    critical: true
    restart_allowed: true
    description: "SBA website"
  - name: sba-ghost_sba-ghost_1
    critical: false
    restart_allowed: true
    description: "SBA Ghost CMS"
 # Whitelisted diagnostic commands
 diagnostic_commands:
  disk_usage: "df -h"
  memory_usage: "free -h"
  cpu_usage: "top -bn1 | head -20"
  cpu_load: "uptime"
  process_list: "ps aux --sort=-%mem | head -20"
  network_status: "ss -tuln"
  docker_ps: "docker ps -a --format 'table {{.Names}}\\t{{.Status}}\\t{{.Ports}}'"
  docker_stats: "docker stats --no-stream --format 'table {{.Name}}\\t{{.CPUPerc}}\\t{{.MemUsage}}'"
  journal_errors: "journalctl -p err -n 50 --no-pager"
 # Remediation commands (low-risk only)
 remediation_commands:
  docker_restart: "docker restart {container}"
  docker_logs: "docker logs --tail 500 {container}"
 # DENIED patterns - commands containing these will be rejected
 denied_patterns:
  - "rm -rf"
  - "rm -r /"
  - "dd if="
  - "mkfs"
  - ":(){:|:&};:"
  - "shutdown"
  - "reboot"
  - "init 0"
  - "init 6"
  - "systemctl stop"
  - "> /dev/sd"
  - "chmod 777"
  - "wget|sh"
  - "curl|sh"
--- a/monitoring/recovered-lxc300/server-diagnostics/requirements.txt
+++ b/monitoring/recovered-lxc300/server-diagnostics/requirements.txt
@ -0,0 +1 @@
 pyyaml>=6.0
--- a/monitoring/recovered-lxc300/settings.json
+++ b/monitoring/recovered-lxc300/settings.json
@ -0,0 +1,26 @@
 {
  "permissions": {
    "allow": [
      "Bash(python3 ~/.claude/skills/server-diagnostics/client.py:*)",
      "Bash(ssh -i ~/.ssh/claude_diagnostics_key:*)",
      "Read(~/.claude/skills/**)",
      "Read(~/.claude/logs/**)",
      "Glob(*)",
      "Grep(*)"
    ],
    "deny": [
      "Bash(rm -rf:*)",
      "Bash(rm -r /:*)",
      "Bash(dd:*)",
      "Bash(mkfs:*)",
      "Bash(shutdown:*)",
      "Bash(reboot:*)",
      "Bash(*> /dev/sd*)",
      "Bash(chmod 777:*)",
      "Bash(*|sh)",
      "Bash(*curl*|*bash*)",
      "Bash(*wget*|*bash*)"
    ]
  },
  "model": "sonnet"
 }
--- a/tdarr/archive/README.md
+++ b/tdarr/archive/README.md
@ -0,0 +1,19 @@
 # Legacy Tdarr Scripts
 ## tdarr_monitor_local_node.py
 Full-featured Tdarr monitoring script (~1200 lines) built for when the local workstation (nobara-pc) ran as an unmapped remote Tdarr node with GPU transcoding.
 **Features:** Stuck job detection via cross-run state comparison (pickle file), automatic worker killing, Discord alerts, configurable thresholds, rotating log files.
 **Why it existed:** The unmapped remote node architecture was prone to stuck jobs caused by network issues during file transfers between the remote node and server. The monitor ran every minute via cron to detect and kill stuck workers.
 **Why it's archived:** Transcoding moved to ubuntu-manticore (10.10.0.226) as a local mapped node with shared NFS storage. No remote transfers means no stuck jobs. Tdarr manages its own workers natively. Archived February 2026.
 ## tdarr_file_monitor_local_node.py + tdarr-file-monitor-cron_local_node.sh
 File completion monitor that watched the local Tdarr cache directory for finished `.mkv` transcodes and copied the smallest version to a backup location. The cron wrapper ran it every minute.
 **Why it existed:** When the local workstation ran as an unmapped Tdarr node, completed transcodes landed in the local NVMe cache. This monitor detected completion (by tracking size stability) and kept the best copy.
 **Why it's archived:** Same reason as above - mapped node on manticore writes directly to the shared NFS media mount. No local cache to monitor. Archived February 2026.
--- a/tdarr/archive/tdarr-file-monitor-cron_local_node.sh
+++ b/tdarr/archive/tdarr-file-monitor-cron_local_node.sh
@ -0,0 +1,6 @@
 #!/bin/bash
 # Cron job wrapper for Tdarr file monitor
 # Add this to crontab with: * * * * * /mnt/NV2/Development/claude-home/monitoring/scripts/tdarr-file-monitor-cron.sh
 cd /mnt/NV2/Development/claude-home/monitoring/scripts
 /usr/bin/python3 /mnt/NV2/Development/claude-home/monitoring/scripts/tdarr_file_monitor.py
--- a/tdarr/archive/tdarr_file_monitor_local_node.py
+++ b/tdarr/archive/tdarr_file_monitor_local_node.py
@ -0,0 +1,286 @@
 #!/usr/bin/env python3
 """
 Tdarr File Monitor - Monitors Tdarr cache directory for completed .mkv files and copies them to backup location.
 Detects file completion by monitoring size changes and always keeps the smallest version of duplicate files.
 """
 import os
 import shutil
 import json
 import time
 import logging
 from pathlib import Path
 from dataclasses import dataclass, asdict
 from typing import Dict, Optional
 from datetime import datetime, timedelta
@dataclass
 class FileState:
    """Tracks the state of a monitored file."""
    path: str
    size: int
    last_modified: float
    first_seen: float
    last_size_change: float
    check_count: int = 0
 class TdarrFileMonitor:
    """Monitors Tdarr cache directory for completed .mkv files."""
    def __init__(
        self,
        source_dir: str = "/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/temp",
        media_dir: str = "/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/media",
        dest_dir: str = "/mnt/NV2/tdarr-cache/manual-backup",
        state_file: str = "/mnt/NV2/Development/claude-home/logs/tdarr_file_monitor_state.json",
        completion_wait_seconds: int = 60,
        log_file: str = "/mnt/NV2/Development/claude-home/logs/tdarr_file_monitor.log"
    ):
        self.source_dir = Path(source_dir)
        self.media_dir = Path(media_dir)
        self.dest_dir = Path(dest_dir)
        self.state_file = Path(state_file)
        self.completion_wait_seconds = completion_wait_seconds
        self.monitored_files: Dict[str, FileState] = {}
        # Setup logging
        logging.basicConfig(
            level=logging.INFO,
            format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
            handlers=[
                logging.FileHandler(log_file),
                logging.StreamHandler()
            ]
        )
        self.logger = logging.getLogger(f'{__name__}.TdarrFileMonitor')
        # Ensure destination directory exists
        self.dest_dir.mkdir(parents=True, exist_ok=True)
        # Load previous state
        self._load_state()
    def _load_state(self) -> None:
        """Load monitored files state from disk."""
        if self.state_file.exists():
            try:
                with open(self.state_file, 'r') as f:
                    data = json.load(f)
                    self.monitored_files = {
                        path: FileState(**file_data) 
                        for path, file_data in data.items()
                    }
                self.logger.info(f"Loaded state for {len(self.monitored_files)} monitored files")
            except Exception as e:
                self.logger.error(f"Failed to load state file: {e}")
                self.monitored_files = {}
    def _save_state(self) -> None:
        """Save monitored files state to disk."""
        try:
            with open(self.state_file, 'w') as f:
                data = {path: asdict(state) for path, state in self.monitored_files.items()}
                json.dump(data, f, indent=2)
        except Exception as e:
            self.logger.error(f"Failed to save state file: {e}")
    def _scan_for_mkv_files(self) -> Dict[str, Path]:
        """Scan source directory for .mkv files in all subdirectories."""
        mkv_files = {}
        try:
            for mkv_file in self.source_dir.rglob("*.mkv"):
                if mkv_file.is_file():
                    mkv_files[str(mkv_file)] = mkv_file
        except Exception as e:
            self.logger.error(f"Error scanning source directory: {e}")
        return mkv_files
    def _get_file_info(self, file_path: Path) -> Optional[tuple]:
        """Get file size and modification time, return None if file doesn't exist or can't be accessed."""
        try:
            stat = file_path.stat()
            return stat.st_size, stat.st_mtime
        except (OSError, FileNotFoundError) as e:
            self.logger.warning(f"Cannot access file {file_path}: {e}")
            return None
    def _validate_file_pair(self, temp_file_path: Path, temp_file_size: int) -> bool:
        """Validate that a matching file exists in media directory with exact same name and size."""
        try:
            # Search for matching file in media directory tree
            for media_file in self.media_dir.rglob(temp_file_path.name):
                if media_file.is_file():
                    media_file_info = self._get_file_info(media_file)
                    if media_file_info:
                        media_size, _ = media_file_info
                        if media_size == temp_file_size:
                            self.logger.debug(f"Found matching file: {temp_file_path.name} ({temp_file_size:,} bytes) in temp and media directories")
                            return True
                        else:
                            self.logger.debug(f"Size mismatch for {temp_file_path.name}: temp={temp_file_size:,}, media={media_size:,}")
            # No matching file found
            self.logger.info(f"No matching file found in media directory for {temp_file_path.name} ({temp_file_size:,} bytes)")
            return False
        except Exception as e:
            self.logger.error(f"Error validating file pair for {temp_file_path.name}: {e}")
            return False
    def _is_file_complete(self, file_state: FileState, current_time: float) -> bool:
        """Check if file is complete based on size stability."""
        stale_time = current_time - file_state.last_size_change
        return stale_time >= self.completion_wait_seconds
    def _should_copy_file(self, source_path: Path, dest_path: Path) -> bool:
        """Determine if we should copy the file (always keep smaller version)."""
        if not dest_path.exists():
            return True
        source_size = source_path.stat().st_size
        dest_size = dest_path.stat().st_size
        if source_size < dest_size:
            self.logger.info(f"Source file {source_path.name} ({source_size:,} bytes) is smaller than existing destination ({dest_size:,} bytes), will replace")
            return True
        else:
            self.logger.info(f"Source file {source_path.name} ({source_size:,} bytes) is not smaller than existing destination ({dest_size:,} bytes), skipping")
            return False
    def _copy_file_with_retry(self, source_path: Path, dest_path: Path) -> bool:
        """Copy file with retry logic and cleanup on failure."""
        temp_dest = dest_path.with_suffix(dest_path.suffix + '.tmp')
        for attempt in range(2):  # Try twice
            try:
                start_time = time.time()
                self.logger.info(f"Attempt {attempt + 1}: Copying {source_path.name} ({source_path.stat().st_size:,} bytes)")
                # Copy to temporary file first
                shutil.copy2(source_path, temp_dest)
                # Verify copy completed successfully
                if temp_dest.stat().st_size != source_path.stat().st_size:
                    raise Exception(f"Copy verification failed: size mismatch")
                # Move temp file to final destination
                if dest_path.exists():
                    dest_path.unlink()  # Remove existing file
                temp_dest.rename(dest_path)
                copy_time = time.time() - start_time
                final_size = dest_path.stat().st_size
                self.logger.info(f"Successfully copied {source_path.name} ({final_size:,} bytes) in {copy_time:.2f}s")
                return True
            except Exception as e:
                self.logger.error(f"Copy attempt {attempt + 1} failed for {source_path.name}: {e}")
                # Cleanup temporary file if it exists
                if temp_dest.exists():
                    try:
                        temp_dest.unlink()
                    except Exception as cleanup_error:
                        self.logger.error(f"Failed to cleanup temp file {temp_dest}: {cleanup_error}")
                if attempt == 1:  # Last attempt failed
                    self.logger.error(f"All copy attempts failed for {source_path.name}, giving up")
                    return False
                else:
                    time.sleep(5)  # Wait before retry
        return False
    def run_check(self) -> None:
        """Run a single monitoring check cycle."""
        current_time = time.time()
        self.logger.info("Starting monitoring check cycle")
        # Scan for current .mkv files
        current_files = self._scan_for_mkv_files()
        self.logger.info(f"Found {len(current_files)} .mkv files in source directory")
        # Remove files from monitoring that no longer exist
        missing_files = set(self.monitored_files.keys()) - set(current_files.keys())
        for missing_file in missing_files:
            self.logger.info(f"File no longer exists, removing from monitoring: {Path(missing_file).name}")
            del self.monitored_files[missing_file]
        # Process each current file
        files_to_copy = []
        for file_path_str, file_path in current_files.items():
            file_info = self._get_file_info(file_path)
            if not file_info:
                continue
            current_size, current_mtime = file_info
            # Update or create file state
            if file_path_str in self.monitored_files:
                file_state = self.monitored_files[file_path_str]
                file_state.check_count += 1
                # Check if size changed
                if current_size != file_state.size:
                    file_state.size = current_size
                    file_state.last_size_change = current_time
                    self.logger.debug(f"Size changed for {file_path.name}: {current_size:,} bytes")
                file_state.last_modified = current_mtime
            else:
                # New file discovered - validate before tracking
                if not self._validate_file_pair(file_path, current_size):
                    # File doesn't have a matching pair in media directory, skip tracking
                    continue
                file_state = FileState(
                    path=file_path_str,
                    size=current_size,
                    last_modified=current_mtime,
                    first_seen=current_time,
                    last_size_change=current_time,
                    check_count=1
                )
                self.monitored_files[file_path_str] = file_state
                self.logger.info(f"Started monitoring validated file: {file_path.name} ({current_size:,} bytes)")
            # Log current state
            stale_time = current_time - file_state.last_size_change
            self.logger.info(f"Checking {file_path.name}: {current_size:,} bytes, stale for {stale_time:.1f}s (checks: {file_state.check_count})")
            # Check if file is complete
            if self._is_file_complete(file_state, current_time):
                dest_path = self.dest_dir / file_path.name
                if self._should_copy_file(file_path, dest_path):
                    files_to_copy.append((file_path, dest_path, file_state))
        # Copy completed files
        for source_path, dest_path, file_state in files_to_copy:
            self.logger.info(f"File appears complete: {source_path.name} (stable for {current_time - file_state.last_size_change:.1f}s)")
            if self._copy_file_with_retry(source_path, dest_path):
                # Remove from monitoring after successful copy
                del self.monitored_files[str(source_path)]
                self.logger.info(f"Successfully processed and removed from monitoring: {source_path.name}")
            else:
                self.logger.error(f"Failed to copy {source_path.name}, will continue monitoring")
        # Save state
        self._save_state()
        self.logger.info(f"Check cycle completed, monitoring {len(self.monitored_files)} files")
 def main():
    """Main entry point for the script."""
    monitor = TdarrFileMonitor()
    monitor.run_check()
 if __name__ == "__main__":
    main()
--- a/tdarr/archive/tdarr_monitor_local_node.py
+++ b/tdarr/archive/tdarr_monitor_local_node.py