CLAUDE: Add comprehensive Tdarr API monitoring with dataclass-based status tracking

- Add tdarr_monitor.py: Python-based API monitoring client with type-safe dataclasses - ServerStatus, QueueStatus, NodeStatus, LibraryStatus, StatisticsStatus, HealthStatus - Support for health checks, queue monitoring, node status, library scans - JSON and pretty-print output formats with proper exit codes - Integration with existing Discord monitoring system - Create scripts/monitoring/README.md: Complete monitoring documentation - Comprehensive usage examples and command-line options - Integration patterns with gaming-aware scheduling - Best practices for automated health monitoring - Update CLAUDE.md: Enhanced Tdarr keyword triggers and documentation structure - Add "monitoring" and "api" keywords to automatically load monitoring docs - Reference new tdarr_monitor.py with dataclass-based status tracking - Update documentation structure to show monitoring script location 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-12 12:15:41 -05:00 · 2025-08-12 12:15:41 -05:00 · aed1f54d91
commit aed1f54d91
parent 34702a37fc
3 changed files with 619 additions and 1 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -120,11 +120,13 @@ When user mentions specific terms, automatically load relevant docs:
  - Load: `examples/vm-management/`

 **Tdarr Keywords**
- "tdarr", "transcode", "ffmpeg", "gpu transcoding", "nvenc", "forEach error", "gaming detection", "scheduler"
+- "tdarr", "transcode", "ffmpeg", "gpu transcoding", "nvenc", "forEach error", "gaming detection", "scheduler", "monitoring", "api"
  - Load: `reference/docker/tdarr-troubleshooting.md`
  - Load: `patterns/docker/distributed-transcoding.md`
  - Load: `scripts/tdarr/README.md` (for automation and scheduling)
+  - Load: `scripts/monitoring/README.md` (for monitoring and health checks)
  - Note: Gaming-aware scheduling system with configurable time windows available
+  - Note: Comprehensive API monitoring available via `tdarr_monitor.py` with dataclass-based status tracking

 **Windows Monitoring Keywords**
 - "windows reboot", "discord notification", "system monitor", "windows desktop", "power outage", "windows update"
@ -154,6 +156,7 @@ When user mentions specific terms, automatically load relevant docs:
 /scripts/           # Active scripts and utilities for home lab operations
  ├── tdarr/        # Tdarr automation with gaming-aware scheduling
  ├── monitoring/   # System monitoring and alerting
+  │   ├── tdarr_monitor.py  # Comprehensive Tdarr API monitoring with dataclasses
  │   └── windows-desktop/  # Windows reboot monitoring with Discord notifications
  └── <future>/     # Other organized automation subsystems
 ```
--- a/scripts/monitoring/README.md
+++ b/scripts/monitoring/README.md
@ -0,0 +1,117 @@
+# Monitoring Scripts
+
+This directory contains various monitoring scripts and tools for the home lab infrastructure.
+
+## Available Scripts
+
+### Tdarr Monitoring
+
+#### tdarr_monitor.py
+A comprehensive Python-based monitoring tool for Tdarr media transcoding servers. Features dataclass-based return types for improved type safety and IDE support.
+
+**Features:**
+- Server status and health monitoring
+- Queue status and statistics tracking
+- Node connectivity and performance monitoring
+- Library scan progress monitoring  
+- Worker activity tracking
+- Comprehensive health checks
+- JSON and pretty-print output formats
+- Configurable timeouts and logging
+
+**Usage:**
+```bash
+# Basic health check
+./tdarr_monitor.py --server http://10.10.0.43:8265 --check health
+
+# Monitor queue status
+./tdarr_monitor.py --server http://10.10.0.43:8265 --check queue
+
+# Get all status information
+./tdarr_monitor.py --server http://10.10.0.43:8265 --check all --output json
+
+# Monitor nodes with verbose logging
+./tdarr_monitor.py --server http://10.10.0.43:8265 --check nodes --verbose
+```
+
+**Available Checks:**
+- `health` - Comprehensive health check (default)
+- `status` - Server status and configuration
+- `queue` - Transcoding queue statistics
+- `nodes` - Connected nodes status
+- `libraries` - Library scan progress
+- `stats` - Overall transcoding statistics
+- `all` - All checks combined
+
+**Output Formats:**
+- `pretty` - Human-readable format (default)
+- `json` - Structured JSON output
+
+**Exit Codes:**
+- `0` - Success, all systems healthy
+- `1` - Error or unhealthy status detected
+
+**Requirements:**
+- Python 3.7+
+- `requests` library
+- Access to Tdarr server API endpoints
+
+#### tdarr-timeout-monitor.sh
+Shell script for monitoring Tdarr timeouts and system status.
+
+**Usage:**
+```bash
+./tdarr-timeout-monitor.sh
+```
+
+### System Monitoring
+
+#### Windows Desktop Monitoring
+Complete Windows desktop monitoring system with Discord notifications for reboots and system events.
+
+Location: `windows-desktop/`
+- Full setup instructions in `windows-desktop/README.md`
+- PowerShell monitoring scripts
+- Windows Task Scheduler integration
+- Discord webhook notifications
+
+**Features:**
+- Automatic reboot detection
+- System startup/shutdown monitoring  
+- Discord notifications with timestamps
+- Configurable monitoring intervals
+- Windows Task Scheduler integration
+
+### Setup and Configuration
+
+#### Discord Integration
+See `setup-discord-monitoring.md` for Discord webhook setup instructions.
+
+## Integration with Home Lab
+
+### Tdarr Keywords Trigger
+When working with Tdarr-related tasks, the following documentation is automatically loaded:
+- `reference/docker/tdarr-troubleshooting.md`
+- `patterns/docker/distributed-transcoding.md`
+- `scripts/tdarr/README.md`
+
+### Gaming-Aware Scheduling
+The monitoring scripts integrate with the gaming-aware Tdarr scheduling system that provides:
+- Configurable time windows for transcoding
+- Gaming session detection
+- Automated resource management
+- Smart scheduling to avoid performance conflicts
+
+## Best Practices
+
+1. **Regular Monitoring**: Set up cron jobs or scheduled tasks for regular status checks
+2. **Health Checks**: Use the health check endpoints for automated monitoring
+3. **Logging**: Enable verbose logging for troubleshooting
+4. **Timeout Configuration**: Adjust timeouts based on network conditions
+5. **Error Handling**: Monitor exit codes for automated alerting
+
+## Related Documentation
+
+- `/patterns/docker/distributed-transcoding.md` - Tdarr architecture patterns
+- `/reference/docker/tdarr-troubleshooting.md` - Troubleshooting guide
+- `/scripts/tdarr/README.md` - Tdarr management scripts
--- a/scripts/monitoring/tdarr_monitor.py
+++ b/scripts/monitoring/tdarr_monitor.py
@ -0,0 +1,498 @@
+#!/usr/bin/env python3
+"""
+Tdarr API Monitoring Script
+
+Monitors Tdarr server via its web API endpoints:
+- Server status and health
+- Queue status and statistics
+- Node status and performance
+- Library scan progress
+- Worker activity
+
+Usage:
+    python3 tdarr_monitor.py --server http://10.10.0.43:8265 --check all
+    python3 tdarr_monitor.py --server http://10.10.0.43:8265 --check queue
+    python3 tdarr_monitor.py --server http://10.10.0.43:8265 --check nodes
+"""
+
+import argparse
+import json
+import logging
+import sys
+from dataclasses import dataclass, asdict
+from datetime import datetime
+from typing import Dict, List, Optional, Any
+import requests
+from urllib.parse import urljoin
+
+
+@dataclass
+class ServerStatus:
+    timestamp: str
+    server_url: str
+    status: str
+    error: Optional[str] = None
+    version: Optional[str] = None
+    server_id: Optional[str] = None
+    uptime: Optional[str] = None
+    system_info: Optional[Dict[str, Any]] = None
+
+
+@dataclass
+class QueueStats:
+    total_files: int
+    queued: int
+    processing: int
+    completed: int
+    queue_items: List[Dict[str, Any]]
+
+
+@dataclass
+class QueueStatus:
+    timestamp: str
+    queue_stats: Optional[QueueStats] = None
+    error: Optional[str] = None
+
+
+@dataclass
+class NodeInfo:
+    id: Optional[str]
+    nodeName: Optional[str]
+    status: str
+    lastSeen: Optional[int]
+    version: Optional[str]
+    platform: Optional[str]
+    workers: Dict[str, int]
+    processing: List[Dict[str, Any]]
+
+
+@dataclass
+class NodeSummary:
+    total_nodes: int
+    online_nodes: int
+    offline_nodes: int
+    online_details: List[NodeInfo]
+    offline_details: List[NodeInfo]
+
+
+@dataclass
+class NodeStatus:
+    timestamp: str
+    nodes: List[Dict[str, Any]]
+    node_summary: Optional[NodeSummary] = None
+    error: Optional[str] = None
+
+
+@dataclass
+class LibraryInfo:
+    name: Optional[str]
+    path: Optional[str]
+    file_count: int
+    scan_progress: int
+    last_scan: Optional[str]
+    is_scanning: bool
+
+
+@dataclass
+class ScanStatus:
+    total_libraries: int
+    total_files: int
+    scanning_libraries: int
+
+
+@dataclass
+class LibraryStatus:
+    timestamp: str
+    libraries: List[LibraryInfo]
+    scan_status: Optional[ScanStatus] = None
+    error: Optional[str] = None
+
+
+@dataclass
+class Statistics:
+    total_transcodes: int
+    space_saved: int
+    total_files_processed: int
+    failed_transcodes: int
+    processing_speed: int
+    eta: Optional[str]
+
+
+@dataclass
+class StatisticsStatus:
+    timestamp: str
+    statistics: Optional[Statistics] = None
+    error: Optional[str] = None
+
+
+@dataclass
+class HealthCheck:
+    status: str
+    healthy: bool
+    online_count: Optional[int] = None
+    total_count: Optional[int] = None
+    accessible: Optional[bool] = None
+    total_items: Optional[int] = None
+
+
+@dataclass
+class HealthStatus:
+    timestamp: str
+    overall_status: str
+    checks: Dict[str, HealthCheck]
+
+
+class TdarrMonitor:
+    def __init__(self, server_url: str, timeout: int = 30):
+        """Initialize Tdarr monitor with server URL."""
+        self.server_url = server_url.rstrip('/')
+        self.timeout = timeout
+        self.session = requests.Session()
+        
+        # Configure logging
+        logging.basicConfig(
+            level=logging.INFO,
+            format='%(asctime)s - %(levelname)s - %(message)s'
+        )
+        self.logger = logging.getLogger(__name__)
+
+    def _make_request(self, endpoint: str) -> Optional[Dict[str, Any]]:
+        """Make HTTP request to Tdarr API endpoint."""
+        url = urljoin(self.server_url, endpoint)
+        
+        try:
+            response = self.session.get(url, timeout=self.timeout)
+            response.raise_for_status()
+            return response.json()
+            
+        except requests.exceptions.RequestException as e:
+            self.logger.error(f"Request failed for {url}: {e}")
+            return None
+        except json.JSONDecodeError as e:
+            self.logger.error(f"JSON decode failed for {url}: {e}")
+            return None
+
+    def get_server_status(self) -> ServerStatus:
+        """Get overall server status and configuration."""
+        timestamp = datetime.now().isoformat()
+        
+        # Try to get server info from API
+        data = self._make_request('/api/v2/get-server-info')
+        if data:
+            return ServerStatus(
+                timestamp=timestamp,
+                server_url=self.server_url,
+                status='online',
+                version=data.get('version'),
+                server_id=data.get('serverId'),
+                uptime=data.get('uptime'),
+                system_info=data.get('systemInfo', {})
+            )
+        else:
+            return ServerStatus(
+                timestamp=timestamp,
+                server_url=self.server_url,
+                status='offline',
+                error='Unable to connect to Tdarr server'
+            )
+
+    def get_queue_status(self) -> QueueStatus:
+        """Get transcoding queue status and statistics."""
+        timestamp = datetime.now().isoformat()
+        
+        # Get queue information
+        data = self._make_request('/api/v2/get-queue')
+        if data:
+            queue_data = data.get('queue', [])
+            
+            # Calculate queue statistics
+            total_files = len(queue_data)
+            queued_files = len([f for f in queue_data if f.get('status') == 'Queued'])
+            processing_files = len([f for f in queue_data if f.get('status') == 'Processing'])
+            completed_files = len([f for f in queue_data if f.get('status') == 'Completed'])
+            
+            queue_stats = QueueStats(
+                total_files=total_files,
+                queued=queued_files,
+                processing=processing_files,
+                completed=completed_files,
+                queue_items=queue_data[:10]  # First 10 items for details
+            )
+            
+            return QueueStatus(
+                timestamp=timestamp,
+                queue_stats=queue_stats
+            )
+        else:
+            return QueueStatus(
+                timestamp=timestamp,
+                error='Unable to fetch queue data'
+            )
+
+    def get_node_status(self) -> NodeStatus:
+        """Get status of all connected nodes."""
+        timestamp = datetime.now().isoformat()
+        
+        # Get nodes information
+        data = self._make_request('/api/v2/get-nodes')
+        if data:
+            nodes = data.get('nodes', [])
+            
+            # Process node information
+            online_nodes = []
+            offline_nodes = []
+            
+            for node in nodes:
+                node_info = NodeInfo(
+                    id=node.get('_id'),
+                    nodeName=node.get('nodeName'),
+                    status='online' if node.get('lastSeen', 0) > 0 else 'offline',
+                    lastSeen=node.get('lastSeen'),
+                    version=node.get('version'),
+                    platform=node.get('platform'),
+                    workers={
+                        'cpu': node.get('workers', {}).get('CPU', 0),
+                        'gpu': node.get('workers', {}).get('GPU', 0)
+                    },
+                    processing=node.get('currentJobs', [])
+                )
+                
+                if node_info.status == 'online':
+                    online_nodes.append(node_info)
+                else:
+                    offline_nodes.append(node_info)
+            
+            node_summary = NodeSummary(
+                total_nodes=len(nodes),
+                online_nodes=len(online_nodes),
+                offline_nodes=len(offline_nodes),
+                online_details=online_nodes,
+                offline_details=offline_nodes
+            )
+            
+            return NodeStatus(
+                timestamp=timestamp,
+                nodes=nodes,
+                node_summary=node_summary
+            )
+        else:
+            return NodeStatus(
+                timestamp=timestamp,
+                nodes=[],
+                error='Unable to fetch node data'
+            )
+
+    def get_library_status(self) -> LibraryStatus:
+        """Get library scan status and file statistics."""
+        timestamp = datetime.now().isoformat()
+        
+        # Get library information
+        data = self._make_request('/api/v2/get-libraries')
+        if data:
+            libraries = data.get('libraries', [])
+            
+            library_stats = []
+            total_files = 0
+            
+            for lib in libraries:
+                lib_info = LibraryInfo(
+                    name=lib.get('name'),
+                    path=lib.get('path'),
+                    file_count=lib.get('totalFiles', 0),
+                    scan_progress=lib.get('scanProgress', 0),
+                    last_scan=lib.get('lastScan'),
+                    is_scanning=lib.get('isScanning', False)
+                )
+                library_stats.append(lib_info)
+                total_files += lib_info.file_count
+            
+            scan_status = ScanStatus(
+                total_libraries=len(libraries),
+                total_files=total_files,
+                scanning_libraries=len([l for l in library_stats if l.is_scanning])
+            )
+            
+            return LibraryStatus(
+                timestamp=timestamp,
+                libraries=library_stats,
+                scan_status=scan_status
+            )
+        else:
+            return LibraryStatus(
+                timestamp=timestamp,
+                libraries=[],
+                error='Unable to fetch library data'
+            )
+
+    def get_statistics(self) -> StatisticsStatus:
+        """Get overall Tdarr statistics and health metrics."""
+        timestamp = datetime.now().isoformat()
+        
+        # Get statistics
+        data = self._make_request('/api/v2/get-stats')
+        if data:
+            stats = data.get('stats', {})
+            statistics = Statistics(
+                total_transcodes=stats.get('totalTranscodes', 0),
+                space_saved=stats.get('spaceSaved', 0),
+                total_files_processed=stats.get('totalFilesProcessed', 0),
+                failed_transcodes=stats.get('failedTranscodes', 0),
+                processing_speed=stats.get('processingSpeed', 0),
+                eta=stats.get('eta')
+            )
+            
+            return StatisticsStatus(
+                timestamp=timestamp,
+                statistics=statistics
+            )
+        else:
+            return StatisticsStatus(
+                timestamp=timestamp,
+                error='Unable to fetch statistics'
+            )
+
+    def health_check(self) -> HealthStatus:
+        """Perform comprehensive health check."""
+        timestamp = datetime.now().isoformat()
+        
+        # Server connectivity
+        server_status = self.get_server_status()
+        server_check = HealthCheck(
+            status=server_status.status,
+            healthy=server_status.status == 'online'
+        )
+        
+        # Node connectivity  
+        node_status = self.get_node_status()
+        nodes_healthy = (
+            node_status.node_summary.online_nodes > 0 if node_status.node_summary else False
+        ) and not node_status.error
+        
+        nodes_check = HealthCheck(
+            status='online' if nodes_healthy else 'offline',
+            healthy=nodes_healthy,
+            online_count=node_status.node_summary.online_nodes if node_status.node_summary else 0,
+            total_count=node_status.node_summary.total_nodes if node_status.node_summary else 0
+        )
+        
+        # Queue status
+        queue_status = self.get_queue_status()
+        queue_healthy = not queue_status.error
+        queue_check = HealthCheck(
+            status='accessible' if queue_healthy else 'error',
+            healthy=queue_healthy,
+            accessible=queue_healthy,
+            total_items=queue_status.queue_stats.total_files if queue_status.queue_stats else 0
+        )
+        
+        checks = {
+            'server': server_check,
+            'nodes': nodes_check,
+            'queue': queue_check
+        }
+        
+        # Determine overall health
+        all_checks_healthy = all(check.healthy for check in checks.values())
+        overall_status = 'healthy' if all_checks_healthy else 'unhealthy'
+        
+        return HealthStatus(
+            timestamp=timestamp,
+            overall_status=overall_status,
+            checks=checks
+        )
+
+
+def main():
+    parser = argparse.ArgumentParser(description='Monitor Tdarr server via API')
+    parser.add_argument('--server', required=True, help='Tdarr server URL (e.g., http://10.10.0.43:8265)')
+    parser.add_argument('--check', choices=['all', 'status', 'queue', 'nodes', 'libraries', 'stats', 'health'],
+                       default='health', help='Type of check to perform')
+    parser.add_argument('--timeout', type=int, default=30, help='Request timeout in seconds')
+    parser.add_argument('--output', choices=['json', 'pretty'], default='pretty', help='Output format')
+    parser.add_argument('--verbose', action='store_true', help='Enable verbose logging')
+    
+    args = parser.parse_args()
+    
+    if args.verbose:
+        logging.getLogger().setLevel(logging.DEBUG)
+    
+    # Initialize monitor
+    monitor = TdarrMonitor(args.server, args.timeout)
+    
+    # Perform requested check
+    result = None
+    if args.check == 'all':
+        result = {
+            'server_status': monitor.get_server_status(),
+            'queue_status': monitor.get_queue_status(),
+            'node_status': monitor.get_node_status(),
+            'library_status': monitor.get_library_status(),
+            'statistics': monitor.get_statistics()
+        }
+    elif args.check == 'status':
+        result = monitor.get_server_status()
+    elif args.check == 'queue':
+        result = monitor.get_queue_status()
+    elif args.check == 'nodes':
+        result = monitor.get_node_status()
+    elif args.check == 'libraries':
+        result = monitor.get_library_status()
+    elif args.check == 'stats':
+        result = monitor.get_statistics()
+    elif args.check == 'health':
+        result = monitor.health_check()
+    
+    # Output results
+    if args.output == 'json':
+        # Convert dataclasses to dictionaries for JSON serialization
+        if args.check == 'all':
+            json_result = {}
+            for key, value in result.items():
+                json_result[key] = asdict(value)
+            print(json.dumps(json_result, indent=2))
+        else:
+            print(json.dumps(asdict(result), indent=2))
+    else:
+        # Pretty print format
+        print(f"=== Tdarr Monitor Results - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')} ===")
+        
+        if args.check == 'health' or (hasattr(result, 'overall_status') and result.overall_status):
+            health = result if hasattr(result, 'overall_status') else None
+            if health:
+                status = health.overall_status
+                print(f"Overall Status: {status.upper()}")
+                
+                if health.checks:
+                    print("\nHealth Checks:")
+                    for check_name, check_data in health.checks.items():
+                        status_icon = "✓" if check_data.healthy else "✗"
+                        print(f"  {status_icon} {check_name.title()}: {asdict(check_data)}")
+        
+        if args.check == 'all':
+            for section, data in result.items():
+                print(f"\n=== {section.replace('_', ' ').title()} ===")
+                print(json.dumps(asdict(data), indent=2))
+        elif args.check != 'health':
+            print(json.dumps(asdict(result), indent=2))
+    
+    # Exit with appropriate code
+    if result:
+        # Check for unhealthy status in health check
+        if isinstance(result, HealthStatus) and result.overall_status == 'unhealthy':
+            sys.exit(1)
+        # Check for errors in individual status objects (all status classes except HealthStatus have error attribute)
+        elif (isinstance(result, (ServerStatus, QueueStatus, NodeStatus, LibraryStatus, StatisticsStatus)) 
+              and result.error):
+            sys.exit(1)
+        # Check for errors in 'all' results
+        elif isinstance(result, dict):
+            for status_obj in result.values():
+                if (isinstance(status_obj, (ServerStatus, QueueStatus, NodeStatus, LibraryStatus, StatisticsStatus)) 
+                    and status_obj.error):
+                    sys.exit(1)
+    
+    sys.exit(0)
+
+
+if __name__ == '__main__':
+    main()