--- title: "Monitoring Scripts Context" description: "Operational context for all monitoring scripts: Proxmox backup checker, CT 302 self-health, Jellyfin GPU health monitor, NVIDIA driver update checker, Tdarr API/file monitors, and Windows reboot detection. Includes cron schedules, Discord integration patterns, and troubleshooting." type: context domain: monitoring tags: [proxmox, backup, jellyfin, gpu, nvidia, tdarr, discord, cron, python, bash, windows, scripts] --- # Monitoring Scripts - Operational Context ## Script Overview This directory contains active operational scripts for system monitoring, health checks, alert notifications, and automation across the homelab infrastructure. ## Core Monitoring Scripts ### Proxmox Backup Verification **Script**: `proxmox-backup-check.sh` **Purpose**: Weekly check that every running VM/CT has a successful vzdump backup within 7 days. Posts a color-coded Discord embed with per-guest status. **Key Features**: - SSHes to Proxmox host and queries `pvesh` task history + guest lists via API - Categorizes each guest: 🟢 green (backed up), 🟡 yellow (overdue), 🔴 red (no backup) - Sorts output by VMID; only posts to Discord — no local side effects - `--dry-run` mode prints the Discord payload without sending - `--days N` overrides the default 7-day window **Schedule**: Weekly on Monday 08:00 UTC (CT 302 cron) ```bash 0 8 * * 1 DISCORD_WEBHOOK="" /root/scripts/proxmox-backup-check.sh >> /var/log/proxmox-backup-check.log 2>&1 ``` **Usage**: ```bash # Dry run (no Discord) proxmox-backup-check.sh --dry-run # Post to Discord DISCORD_WEBHOOK="https://discord.com/api/webhooks/..." proxmox-backup-check.sh # Custom window proxmox-backup-check.sh --days 14 --discord-webhook "https://..." ``` **Dependencies**: `jq`, `curl`, SSH access to Proxmox host alias `proxmox` **Install on CT 302**: ```bash cp proxmox-backup-check.sh /root/scripts/ chmod +x /root/scripts/proxmox-backup-check.sh ``` ### CT 302 Self-Health Monitor **Script**: `ct302-self-health.sh` **Purpose**: Monitors disk usage on CT 302 (claude-runner) itself. Alerts to Discord when any filesystem exceeds the threshold (default 80%). Runs silently when healthy — no Discord spam on green. **Key Features**: - Checks all non-virtual filesystems (`df`, excludes tmpfs/devtmpfs/overlay) - Only sends a Discord alert when a filesystem is at or above threshold - `--always-post` flag forces a post even when healthy (useful for testing) - `--dry-run` mode prints payload without sending **Schedule**: Daily at 07:00 UTC (CT 302 cron) ```bash 0 7 * * * DISCORD_WEBHOOK="" /root/scripts/ct302-self-health.sh >> /var/log/ct302-self-health.log 2>&1 ``` **Usage**: ```bash # Check and alert if over 80% DISCORD_WEBHOOK="https://discord.com/api/webhooks/..." ct302-self-health.sh # Lower threshold test ct302-self-health.sh --threshold 50 --dry-run # Always post (weekly status report pattern) ct302-self-health.sh --always-post --discord-webhook "https://..." ``` **Dependencies**: `jq`, `curl`, `df` **Install on CT 302**: ```bash cp ct302-self-health.sh /root/scripts/ chmod +x /root/scripts/ct302-self-health.sh ``` ### Jellyfin GPU Health Monitor **Script**: `jellyfin_gpu_monitor.py` **Purpose**: Monitor Jellyfin container GPU access with Discord alerts and auto-restart capability **Key Features**: - GPU accessibility monitoring via nvidia-smi in container - Container status verification - Discord webhook notifications for GPU issues - Automatic container restart on GPU access loss (configurable) - Comprehensive logging with decision tracking **Schedule**: Every 5 minutes via cron ```bash */5 * * * * /usr/bin/python3 /home/cal/scripts/jellyfin_gpu_monitor.py --check --discord-alerts --auto-restart >> /home/cal/logs/jellyfin-gpu-monitor.log 2>&1 ``` **Usage**: ```bash # Health check with Discord alerts python3 jellyfin_gpu_monitor.py --check --discord-alerts # With auto-restart on failure python3 jellyfin_gpu_monitor.py --check --discord-alerts --auto-restart # Test Discord integration python3 jellyfin_gpu_monitor.py --discord-test # JSON output for parsing python3 jellyfin_gpu_monitor.py --check --output json ``` **Alert Types**: - 🔴 **GPU Access Lost** - nvidia-smi fails in container, transcoding will fail - 🟢 **GPU Access Restored** - After successful restart, GPU working again - ⚠️ **Restart Failed** - Host-level issue (requires host reboot) **Locations**: - Script: `/home/cal/scripts/jellyfin_gpu_monitor.py` (on ubuntu-manticore) - Logs: `/home/cal/logs/jellyfin-gpu-monitor.log` - Remote execution via SSH from monitoring system **Limitations**: Container restart cannot fix host-level NVIDIA driver issues. If "Restart failed" with driver/library mismatch, host reboot required. ### NVIDIA Driver Update Monitor **Script**: `nvidia_update_checker.py` **Purpose**: Weekly monitoring for NVIDIA driver updates on held packages with Discord notifications **Key Features**: - Checks for available updates to held NVIDIA packages - Sends Discord alerts when new driver versions available - Includes manual update instructions in alert - JSON and pretty output modes - Remote execution via SSH **Schedule**: Weekly (Mondays at 9 AM) ```bash 0 9 * * 1 /usr/bin/python3 /home/cal/scripts/nvidia_update_checker.py --check --discord-alerts >> /home/cal/logs/nvidia-update-checker.log 2>&1 ``` **Usage**: ```bash # Check for updates with Discord alert python3 nvidia_update_checker.py --check --discord-alerts # Silent check (for cron) python3 nvidia_update_checker.py --check # Test Discord integration python3 nvidia_update_checker.py --discord-test # JSON output python3 nvidia_update_checker.py --check --output json ``` **Alert Content**: - 🔔 **Update Available** - Lists package versions (current → available) - ⚠️ **Action Required** - Includes manual update procedure with commands - Package list with version comparison - Reminder that packages are held and won't auto-update **Locations**: - Script: `/home/cal/scripts/nvidia_update_checker.py` (on ubuntu-manticore) - Logs: `/home/cal/logs/nvidia-update-checker.log` **Context**: Part of NVIDIA driver management strategy to prevent surprise auto-updates causing GPU access loss. See `/media-servers/jellyfin-ubuntu-manticore.md` for full driver management documentation. ### Tdarr API Monitor **Script**: `tdarr_monitor.py` **Purpose**: Comprehensive Tdarr server/node monitoring with Discord notifications and dataclass-based status tracking **Key Features**: - Server health monitoring (API connectivity, database status) - Node status tracking (worker count, queue depth, GPU usage) - Transcode statistics (files processed, queue size, errors) - Discord notifications for critical issues - Dataclass-based status representation for type safety - JSON and pretty output modes **Usage**: ```bash # Full health check python3 tdarr_monitor.py --check # With Discord alerts python3 tdarr_monitor.py --check --discord-alerts # Monitor specific node python3 tdarr_monitor.py --check --node-id tdarr-node-gpu # JSON output python3 tdarr_monitor.py --check --output json ``` **Monitoring Scope**: - **Server Health**: API availability, response times, database connectivity - **Node Health**: Worker status, GPU availability, processing capacity - **Queue Status**: Files waiting, active transcodes, completion rate - **Error Detection**: Failed transcodes, stuck jobs, node disconnections **Integration**: Works with Tdarr's gaming-aware scheduler. Monitors both unmapped nodes (local cache) and standard nodes. **Documentation**: See `/tdarr/CONTEXT.md` and `/tdarr/scripts/CONTEXT.md` for Tdarr infrastructure details. ### Tdarr File Monitor **Script**: `tdarr_file_monitor.py` **Purpose**: Monitors Tdarr cache directory for completed .mkv files and backs them up **Key Features**: - Recursive .mkv file detection in Tdarr cache - Size change monitoring to detect completion - Configurable completion wait time (default: 60 seconds) - Automatic backup to manual-backup directory - Persistent state tracking across runs - Duplicate handling (keeps smallest version) - Comprehensive logging **Schedule**: Via cron wrapper `tdarr-file-monitor-cron.sh` **Usage**: ```bash # Run file monitor scan python3 tdarr_file_monitor.py # Custom directories python3 tdarr_file_monitor.py --source /path/to/cache --dest /path/to/backup ``` **Configuration**: - **Source**: `/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/temp` - **Media**: `/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/media` - **Destination**: `/mnt/NV2/tdarr-cache/manual-backup` - **State File**: `/mnt/NV2/Development/claude-home/logs/tdarr_file_monitor_state.json` - **Log File**: `/mnt/NV2/Development/claude-home/logs/tdarr_file_monitor.log` **Completion Detection**: 1. File discovered in cache directory 2. Size tracked over time 3. When size stable for completion_wait_seconds (60s), marked complete 4. File copied to backup location 5. State persisted for next run **Cron Wrapper**: `tdarr-file-monitor-cron.sh` ```bash #!/bin/bash # Wrapper for tdarr_file_monitor.py with logging cd /mnt/NV2/Development/claude-home/monitoring/scripts/ python3 tdarr_file_monitor.py >> /mnt/NV2/Development/claude-home/logs/tdarr-file-monitor-cron.log 2>&1 ``` **Note**: Schedule not currently active. Enable when needed for automatic backup of completed transcodes. ### Windows Desktop Monitoring **Directory**: `windows-desktop/` **Purpose**: Monitor Windows machine reboots and system events with Discord notifications **Core Script**: `windows-reboot-monitor.ps1` **Features**: - System startup monitoring (normal and unexpected) - Shutdown detection (planned and unplanned) - Reboot reason analysis (Windows Updates, power outages, user-initiated) - System uptime and boot statistics tracking - Discord webhook notifications with color coding - Event log analysis for root cause determination **Task Scheduler Integration**: - **Startup Task**: `windows-reboot-task-startup.xml` - **Shutdown Task**: `windows-reboot-task-shutdown.xml` **Notification Types**: - 🟢 **Normal Startup** - System booted after planned shutdown - 🔴 **Unexpected Restart** - Recovery from power loss/crash/forced reboot - 🟡 **Planned Shutdown** - System shutting down gracefully **Information Captured**: - Computer name and timestamp - Boot/shutdown reasons (detailed) - System uptime duration - Boot counter for restart frequency tracking - Event log context **Use Cases**: - Power outage detection - Windows Update monitoring - Hardware failure alerts - Remote system availability tracking - Uptime statistics **Setup**: See `windows-desktop/windows-setup-instructions.md` for complete installation guide ## Operational Patterns ### Monitoring Schedule **Active Cron Jobs** (on ubuntu-manticore via cal user): ```bash # Jellyfin GPU monitoring - Every 5 minutes */5 * * * * /usr/bin/python3 /home/cal/scripts/jellyfin_gpu_monitor.py --check --discord-alerts --auto-restart >> /home/cal/logs/jellyfin-gpu-monitor.log 2>&1 # NVIDIA driver update checks - Weekly (Mondays at 9 AM) 0 9 * * 1 /usr/bin/python3 /home/cal/scripts/nvidia_update_checker.py --check --discord-alerts >> /home/cal/logs/nvidia-update-checker.log 2>&1 ``` **Active Cron Jobs** (on CT 302 / claude-runner, root user): ```bash # Proxmox backup verification - Weekly (Mondays at 8 AM UTC) 0 8 * * 1 DISCORD_WEBHOOK="" /root/scripts/proxmox-backup-check.sh >> /var/log/proxmox-backup-check.log 2>&1 # CT 302 self-health disk check - Daily at 7 AM UTC (alerts only when >80%) 0 7 * * * DISCORD_WEBHOOK="" /root/scripts/ct302-self-health.sh >> /var/log/ct302-self-health.log 2>&1 ``` **Note**: Scripts must be installed manually on CT 302. Source of truth is `monitoring/scripts/` in this repo — copy to `/root/scripts/` on CT 302 to deploy. **Manual/On-Demand**: - `tdarr_monitor.py` - Run as needed for Tdarr health checks - `tdarr_file_monitor.py` - Can be scheduled if automatic backup needed - Windows monitoring - Automatic via Task Scheduler on Windows machines ### Discord Integration **All monitoring scripts use Discord webhooks** for notifications: - Standardized embed format with color coding - Timestamp inclusion - Actionable information in alerts - Field-based structured data presentation **Webhook Configuration**: - Default webhook embedded in scripts - Can be overridden via command-line arguments - Test commands available for verification **Color Coding**: - 🔴 Red (`0xff6b6b`) - Critical issues, failures - 🟡 Orange (`0xffa500`) - Warnings, actions needed - 🟢 Green (`0x28a745`) - Success, recovery, normal operations ### Logging Strategy **Centralized Log Locations**: - **ubuntu-manticore**: `/home/cal/logs/` (GPU and driver monitoring) - **Development repo**: `/mnt/NV2/Development/claude-home/logs/` (file monitor, state files) **Log Formats**: - Timestamped entries - Structured logging with severity levels - Decision reasoning included - Error stack traces when applicable **Log Rotation**: - Manual cleanup recommended - Focus on recent activity (last 30-90 days) - State files maintained separately **Accessing Logs**: ```bash # GPU monitor logs ssh cal@10.10.0.226 "tail -f /home/cal/logs/jellyfin-gpu-monitor.log" # Driver update logs ssh cal@10.10.0.226 "tail -f /home/cal/logs/nvidia-update-checker.log" # File monitor logs tail -f /mnt/NV2/Development/claude-home/logs/tdarr_file_monitor.log ``` ## Troubleshooting Context ### Common Issues **1. Discord Webhooks Not Working** ```bash # Test webhook connectivity python3