# Monitoring Scripts - Operational Context ## Script Overview This directory contains active operational scripts for system monitoring, health checks, alert notifications, and automation across the homelab infrastructure. ## Core Monitoring Scripts ### Jellyfin GPU Health Monitor **Script**: `jellyfin_gpu_monitor.py` **Purpose**: Monitor Jellyfin container GPU access with Discord alerts and auto-restart capability **Key Features**: - GPU accessibility monitoring via nvidia-smi in container - Container status verification - Discord webhook notifications for GPU issues - Automatic container restart on GPU access loss (configurable) - Comprehensive logging with decision tracking **Schedule**: Every 5 minutes via cron ```bash */5 * * * * /usr/bin/python3 /home/cal/scripts/jellyfin_gpu_monitor.py --check --discord-alerts --auto-restart >> /home/cal/logs/jellyfin-gpu-monitor.log 2>&1 ``` **Usage**: ```bash # Health check with Discord alerts python3 jellyfin_gpu_monitor.py --check --discord-alerts # With auto-restart on failure python3 jellyfin_gpu_monitor.py --check --discord-alerts --auto-restart # Test Discord integration python3 jellyfin_gpu_monitor.py --discord-test # JSON output for parsing python3 jellyfin_gpu_monitor.py --check --output json ``` **Alert Types**: - 🔴 **GPU Access Lost** - nvidia-smi fails in container, transcoding will fail - 🟢 **GPU Access Restored** - After successful restart, GPU working again - ⚠️ **Restart Failed** - Host-level issue (requires host reboot) **Locations**: - Script: `/home/cal/scripts/jellyfin_gpu_monitor.py` (on ubuntu-manticore) - Logs: `/home/cal/logs/jellyfin-gpu-monitor.log` - Remote execution via SSH from monitoring system **Limitations**: Container restart cannot fix host-level NVIDIA driver issues. If "Restart failed" with driver/library mismatch, host reboot required. ### NVIDIA Driver Update Monitor **Script**: `nvidia_update_checker.py` **Purpose**: Weekly monitoring for NVIDIA driver updates on held packages with Discord notifications **Key Features**: - Checks for available updates to held NVIDIA packages - Sends Discord alerts when new driver versions available - Includes manual update instructions in alert - JSON and pretty output modes - Remote execution via SSH **Schedule**: Weekly (Mondays at 9 AM) ```bash 0 9 * * 1 /usr/bin/python3 /home/cal/scripts/nvidia_update_checker.py --check --discord-alerts >> /home/cal/logs/nvidia-update-checker.log 2>&1 ``` **Usage**: ```bash # Check for updates with Discord alert python3 nvidia_update_checker.py --check --discord-alerts # Silent check (for cron) python3 nvidia_update_checker.py --check # Test Discord integration python3 nvidia_update_checker.py --discord-test # JSON output python3 nvidia_update_checker.py --check --output json ``` **Alert Content**: - 🔔 **Update Available** - Lists package versions (current → available) - ⚠️ **Action Required** - Includes manual update procedure with commands - Package list with version comparison - Reminder that packages are held and won't auto-update **Locations**: - Script: `/home/cal/scripts/nvidia_update_checker.py` (on ubuntu-manticore) - Logs: `/home/cal/logs/nvidia-update-checker.log` **Context**: Part of NVIDIA driver management strategy to prevent surprise auto-updates causing GPU access loss. See `/media-servers/jellyfin-ubuntu-manticore.md` for full driver management documentation. ### Tdarr API Monitor **Script**: `tdarr_monitor.py` **Purpose**: Comprehensive Tdarr server/node monitoring with Discord notifications and dataclass-based status tracking **Key Features**: - Server health monitoring (API connectivity, database status) - Node status tracking (worker count, queue depth, GPU usage) - Transcode statistics (files processed, queue size, errors) - Discord notifications for critical issues - Dataclass-based status representation for type safety - JSON and pretty output modes **Usage**: ```bash # Full health check python3 tdarr_monitor.py --check # With Discord alerts python3 tdarr_monitor.py --check --discord-alerts # Monitor specific node python3 tdarr_monitor.py --check --node-id tdarr-node-gpu # JSON output python3 tdarr_monitor.py --check --output json ``` **Monitoring Scope**: - **Server Health**: API availability, response times, database connectivity - **Node Health**: Worker status, GPU availability, processing capacity - **Queue Status**: Files waiting, active transcodes, completion rate - **Error Detection**: Failed transcodes, stuck jobs, node disconnections **Integration**: Works with Tdarr's gaming-aware scheduler. Monitors both unmapped nodes (local cache) and standard nodes. **Documentation**: See `/tdarr/CONTEXT.md` and `/tdarr/scripts/CONTEXT.md` for Tdarr infrastructure details. ### Tdarr File Monitor **Script**: `tdarr_file_monitor.py` **Purpose**: Monitors Tdarr cache directory for completed .mkv files and backs them up **Key Features**: - Recursive .mkv file detection in Tdarr cache - Size change monitoring to detect completion - Configurable completion wait time (default: 60 seconds) - Automatic backup to manual-backup directory - Persistent state tracking across runs - Duplicate handling (keeps smallest version) - Comprehensive logging **Schedule**: Via cron wrapper `tdarr-file-monitor-cron.sh` **Usage**: ```bash # Run file monitor scan python3 tdarr_file_monitor.py # Custom directories python3 tdarr_file_monitor.py --source /path/to/cache --dest /path/to/backup ``` **Configuration**: - **Source**: `/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/temp` - **Media**: `/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/media` - **Destination**: `/mnt/NV2/tdarr-cache/manual-backup` - **State File**: `/mnt/NV2/Development/claude-home/logs/tdarr_file_monitor_state.json` - **Log File**: `/mnt/NV2/Development/claude-home/logs/tdarr_file_monitor.log` **Completion Detection**: 1. File discovered in cache directory 2. Size tracked over time 3. When size stable for completion_wait_seconds (60s), marked complete 4. File copied to backup location 5. State persisted for next run **Cron Wrapper**: `tdarr-file-monitor-cron.sh` ```bash #!/bin/bash # Wrapper for tdarr_file_monitor.py with logging cd /mnt/NV2/Development/claude-home/monitoring/scripts/ python3 tdarr_file_monitor.py >> /mnt/NV2/Development/claude-home/logs/tdarr-file-monitor-cron.log 2>&1 ``` **Note**: Schedule not currently active. Enable when needed for automatic backup of completed transcodes. ### Windows Desktop Monitoring **Directory**: `windows-desktop/` **Purpose**: Monitor Windows machine reboots and system events with Discord notifications **Core Script**: `windows-reboot-monitor.ps1` **Features**: - System startup monitoring (normal and unexpected) - Shutdown detection (planned and unplanned) - Reboot reason analysis (Windows Updates, power outages, user-initiated) - System uptime and boot statistics tracking - Discord webhook notifications with color coding - Event log analysis for root cause determination **Task Scheduler Integration**: - **Startup Task**: `windows-reboot-task-startup.xml` - **Shutdown Task**: `windows-reboot-task-shutdown.xml` **Notification Types**: - 🟢 **Normal Startup** - System booted after planned shutdown - 🔴 **Unexpected Restart** - Recovery from power loss/crash/forced reboot - 🟡 **Planned Shutdown** - System shutting down gracefully **Information Captured**: - Computer name and timestamp - Boot/shutdown reasons (detailed) - System uptime duration - Boot counter for restart frequency tracking - Event log context **Use Cases**: - Power outage detection - Windows Update monitoring - Hardware failure alerts - Remote system availability tracking - Uptime statistics **Setup**: See `windows-desktop/windows-setup-instructions.md` for complete installation guide ## Operational Patterns ### Monitoring Schedule **Active Cron Jobs** (on ubuntu-manticore via cal user): ```bash # Jellyfin GPU monitoring - Every 5 minutes */5 * * * * /usr/bin/python3 /home/cal/scripts/jellyfin_gpu_monitor.py --check --discord-alerts --auto-restart >> /home/cal/logs/jellyfin-gpu-monitor.log 2>&1 # NVIDIA driver update checks - Weekly (Mondays at 9 AM) 0 9 * * 1 /usr/bin/python3 /home/cal/scripts/nvidia_update_checker.py --check --discord-alerts >> /home/cal/logs/nvidia-update-checker.log 2>&1 ``` **Manual/On-Demand**: - `tdarr_monitor.py` - Run as needed for Tdarr health checks - `tdarr_file_monitor.py` - Can be scheduled if automatic backup needed - Windows monitoring - Automatic via Task Scheduler on Windows machines ### Discord Integration **All monitoring scripts use Discord webhooks** for notifications: - Standardized embed format with color coding - Timestamp inclusion - Actionable information in alerts - Field-based structured data presentation **Webhook Configuration**: - Default webhook embedded in scripts - Can be overridden via command-line arguments - Test commands available for verification **Color Coding**: - 🔴 Red (`0xff6b6b`) - Critical issues, failures - 🟡 Orange (`0xffa500`) - Warnings, actions needed - 🟢 Green (`0x28a745`) - Success, recovery, normal operations ### Logging Strategy **Centralized Log Locations**: - **ubuntu-manticore**: `/home/cal/logs/` (GPU and driver monitoring) - **Development repo**: `/mnt/NV2/Development/claude-home/logs/` (file monitor, state files) **Log Formats**: - Timestamped entries - Structured logging with severity levels - Decision reasoning included - Error stack traces when applicable **Log Rotation**: - Manual cleanup recommended - Focus on recent activity (last 30-90 days) - State files maintained separately **Accessing Logs**: ```bash # GPU monitor logs ssh cal@10.10.0.226 "tail -f /home/cal/logs/jellyfin-gpu-monitor.log" # Driver update logs ssh cal@10.10.0.226 "tail -f /home/cal/logs/nvidia-update-checker.log" # File monitor logs tail -f /mnt/NV2/Development/claude-home/logs/tdarr_file_monitor.log ``` ## Troubleshooting Context ### Common Issues **1. Discord Webhooks Not Working** ```bash # Test webhook connectivity python3