From d0dbe86fba4ec065ce58f82fb6ab206690151f3a Mon Sep 17 00:00:00 2001 From: Cal Corum Date: Sat, 7 Feb 2026 22:21:00 -0600 Subject: [PATCH] Add NVIDIA update checker and monitoring scripts documentation Add nvidia_update_checker.py for weekly driver update monitoring with Discord alerts. Add scripts CONTEXT.md and update README. Co-Authored-By: Claude Opus 4.6 --- monitoring/scripts/CONTEXT.md | 463 ++++++++++++++++++++ monitoring/scripts/README.md | 14 +- monitoring/scripts/nvidia_update_checker.py | 300 +++++++++++++ 3 files changed, 771 insertions(+), 6 deletions(-) create mode 100644 monitoring/scripts/CONTEXT.md create mode 100644 monitoring/scripts/nvidia_update_checker.py diff --git a/monitoring/scripts/CONTEXT.md b/monitoring/scripts/CONTEXT.md new file mode 100644 index 0000000..b838f75 --- /dev/null +++ b/monitoring/scripts/CONTEXT.md @@ -0,0 +1,463 @@ +# Monitoring Scripts - Operational Context + +## Script Overview +This directory contains active operational scripts for system monitoring, health checks, alert notifications, and automation across the homelab infrastructure. + +## Core Monitoring Scripts + +### Jellyfin GPU Health Monitor +**Script**: `jellyfin_gpu_monitor.py` +**Purpose**: Monitor Jellyfin container GPU access with Discord alerts and auto-restart capability + +**Key Features**: +- GPU accessibility monitoring via nvidia-smi in container +- Container status verification +- Discord webhook notifications for GPU issues +- Automatic container restart on GPU access loss (configurable) +- Comprehensive logging with decision tracking + +**Schedule**: Every 5 minutes via cron +```bash +*/5 * * * * /usr/bin/python3 /home/cal/scripts/jellyfin_gpu_monitor.py --check --discord-alerts --auto-restart >> /home/cal/logs/jellyfin-gpu-monitor.log 2>&1 +``` + +**Usage**: +```bash +# Health check with Discord alerts +python3 jellyfin_gpu_monitor.py --check --discord-alerts + +# With auto-restart on failure +python3 jellyfin_gpu_monitor.py --check --discord-alerts --auto-restart + +# Test Discord integration +python3 jellyfin_gpu_monitor.py --discord-test + +# JSON output for parsing +python3 jellyfin_gpu_monitor.py --check --output json +``` + +**Alert Types**: +- šŸ”“ **GPU Access Lost** - nvidia-smi fails in container, transcoding will fail +- 🟢 **GPU Access Restored** - After successful restart, GPU working again +- āš ļø **Restart Failed** - Host-level issue (requires host reboot) + +**Locations**: +- Script: `/home/cal/scripts/jellyfin_gpu_monitor.py` (on ubuntu-manticore) +- Logs: `/home/cal/logs/jellyfin-gpu-monitor.log` +- Remote execution via SSH from monitoring system + +**Limitations**: Container restart cannot fix host-level NVIDIA driver issues. If "Restart failed" with driver/library mismatch, host reboot required. + +### NVIDIA Driver Update Monitor +**Script**: `nvidia_update_checker.py` +**Purpose**: Weekly monitoring for NVIDIA driver updates on held packages with Discord notifications + +**Key Features**: +- Checks for available updates to held NVIDIA packages +- Sends Discord alerts when new driver versions available +- Includes manual update instructions in alert +- JSON and pretty output modes +- Remote execution via SSH + +**Schedule**: Weekly (Mondays at 9 AM) +```bash +0 9 * * 1 /usr/bin/python3 /home/cal/scripts/nvidia_update_checker.py --check --discord-alerts >> /home/cal/logs/nvidia-update-checker.log 2>&1 +``` + +**Usage**: +```bash +# Check for updates with Discord alert +python3 nvidia_update_checker.py --check --discord-alerts + +# Silent check (for cron) +python3 nvidia_update_checker.py --check + +# Test Discord integration +python3 nvidia_update_checker.py --discord-test + +# JSON output +python3 nvidia_update_checker.py --check --output json +``` + +**Alert Content**: +- šŸ”” **Update Available** - Lists package versions (current → available) +- āš ļø **Action Required** - Includes manual update procedure with commands +- Package list with version comparison +- Reminder that packages are held and won't auto-update + +**Locations**: +- Script: `/home/cal/scripts/nvidia_update_checker.py` (on ubuntu-manticore) +- Logs: `/home/cal/logs/nvidia-update-checker.log` + +**Context**: Part of NVIDIA driver management strategy to prevent surprise auto-updates causing GPU access loss. See `/media-servers/jellyfin-ubuntu-manticore.md` for full driver management documentation. + +### Tdarr API Monitor +**Script**: `tdarr_monitor.py` +**Purpose**: Comprehensive Tdarr server/node monitoring with Discord notifications and dataclass-based status tracking + +**Key Features**: +- Server health monitoring (API connectivity, database status) +- Node status tracking (worker count, queue depth, GPU usage) +- Transcode statistics (files processed, queue size, errors) +- Discord notifications for critical issues +- Dataclass-based status representation for type safety +- JSON and pretty output modes + +**Usage**: +```bash +# Full health check +python3 tdarr_monitor.py --check + +# With Discord alerts +python3 tdarr_monitor.py --check --discord-alerts + +# Monitor specific node +python3 tdarr_monitor.py --check --node-id tdarr-node-gpu + +# JSON output +python3 tdarr_monitor.py --check --output json +``` + +**Monitoring Scope**: +- **Server Health**: API availability, response times, database connectivity +- **Node Health**: Worker status, GPU availability, processing capacity +- **Queue Status**: Files waiting, active transcodes, completion rate +- **Error Detection**: Failed transcodes, stuck jobs, node disconnections + +**Integration**: Works with Tdarr's gaming-aware scheduler. Monitors both unmapped nodes (local cache) and standard nodes. + +**Documentation**: See `/tdarr/CONTEXT.md` and `/tdarr/scripts/CONTEXT.md` for Tdarr infrastructure details. + +### Tdarr File Monitor +**Script**: `tdarr_file_monitor.py` +**Purpose**: Monitors Tdarr cache directory for completed .mkv files and backs them up + +**Key Features**: +- Recursive .mkv file detection in Tdarr cache +- Size change monitoring to detect completion +- Configurable completion wait time (default: 60 seconds) +- Automatic backup to manual-backup directory +- Persistent state tracking across runs +- Duplicate handling (keeps smallest version) +- Comprehensive logging + +**Schedule**: Via cron wrapper `tdarr-file-monitor-cron.sh` + +**Usage**: +```bash +# Run file monitor scan +python3 tdarr_file_monitor.py + +# Custom directories +python3 tdarr_file_monitor.py --source /path/to/cache --dest /path/to/backup +``` + +**Configuration**: +- **Source**: `/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/temp` +- **Media**: `/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/media` +- **Destination**: `/mnt/NV2/tdarr-cache/manual-backup` +- **State File**: `/mnt/NV2/Development/claude-home/logs/tdarr_file_monitor_state.json` +- **Log File**: `/mnt/NV2/Development/claude-home/logs/tdarr_file_monitor.log` + +**Completion Detection**: +1. File discovered in cache directory +2. Size tracked over time +3. When size stable for completion_wait_seconds (60s), marked complete +4. File copied to backup location +5. State persisted for next run + +**Cron Wrapper**: `tdarr-file-monitor-cron.sh` +```bash +#!/bin/bash +# Wrapper for tdarr_file_monitor.py with logging +cd /mnt/NV2/Development/claude-home/monitoring/scripts/ +python3 tdarr_file_monitor.py >> /mnt/NV2/Development/claude-home/logs/tdarr-file-monitor-cron.log 2>&1 +``` + +**Note**: Schedule not currently active. Enable when needed for automatic backup of completed transcodes. + +### Windows Desktop Monitoring +**Directory**: `windows-desktop/` +**Purpose**: Monitor Windows machine reboots and system events with Discord notifications + +**Core Script**: `windows-reboot-monitor.ps1` +**Features**: +- System startup monitoring (normal and unexpected) +- Shutdown detection (planned and unplanned) +- Reboot reason analysis (Windows Updates, power outages, user-initiated) +- System uptime and boot statistics tracking +- Discord webhook notifications with color coding +- Event log analysis for root cause determination + +**Task Scheduler Integration**: +- **Startup Task**: `windows-reboot-task-startup.xml` +- **Shutdown Task**: `windows-reboot-task-shutdown.xml` + +**Notification Types**: +- 🟢 **Normal Startup** - System booted after planned shutdown +- šŸ”“ **Unexpected Restart** - Recovery from power loss/crash/forced reboot +- 🟔 **Planned Shutdown** - System shutting down gracefully + +**Information Captured**: +- Computer name and timestamp +- Boot/shutdown reasons (detailed) +- System uptime duration +- Boot counter for restart frequency tracking +- Event log context + +**Use Cases**: +- Power outage detection +- Windows Update monitoring +- Hardware failure alerts +- Remote system availability tracking +- Uptime statistics + +**Setup**: See `windows-desktop/windows-setup-instructions.md` for complete installation guide + +## Operational Patterns + +### Monitoring Schedule + +**Active Cron Jobs** (on ubuntu-manticore via cal user): +```bash +# Jellyfin GPU monitoring - Every 5 minutes +*/5 * * * * /usr/bin/python3 /home/cal/scripts/jellyfin_gpu_monitor.py --check --discord-alerts --auto-restart >> /home/cal/logs/jellyfin-gpu-monitor.log 2>&1 + +# NVIDIA driver update checks - Weekly (Mondays at 9 AM) +0 9 * * 1 /usr/bin/python3 /home/cal/scripts/nvidia_update_checker.py --check --discord-alerts >> /home/cal/logs/nvidia-update-checker.log 2>&1 +``` + +**Manual/On-Demand**: +- `tdarr_monitor.py` - Run as needed for Tdarr health checks +- `tdarr_file_monitor.py` - Can be scheduled if automatic backup needed +- Windows monitoring - Automatic via Task Scheduler on Windows machines + +### Discord Integration + +**All monitoring scripts use Discord webhooks** for notifications: +- Standardized embed format with color coding +- Timestamp inclusion +- Actionable information in alerts +- Field-based structured data presentation + +**Webhook Configuration**: +- Default webhook embedded in scripts +- Can be overridden via command-line arguments +- Test commands available for verification + +**Color Coding**: +- šŸ”“ Red (`0xff6b6b`) - Critical issues, failures +- 🟔 Orange (`0xffa500`) - Warnings, actions needed +- 🟢 Green (`0x28a745`) - Success, recovery, normal operations + +### Logging Strategy + +**Centralized Log Locations**: +- **ubuntu-manticore**: `/home/cal/logs/` (GPU and driver monitoring) +- **Development repo**: `/mnt/NV2/Development/claude-home/logs/` (file monitor, state files) + +**Log Formats**: +- Timestamped entries +- Structured logging with severity levels +- Decision reasoning included +- Error stack traces when applicable + +**Log Rotation**: +- Manual cleanup recommended +- Focus on recent activity (last 30-90 days) +- State files maintained separately + +**Accessing Logs**: +```bash +# GPU monitor logs +ssh cal@10.10.0.226 "tail -f /home/cal/logs/jellyfin-gpu-monitor.log" + +# Driver update logs +ssh cal@10.10.0.226 "tail -f /home/cal/logs/nvidia-update-checker.log" + +# File monitor logs +tail -f /mnt/NV2/Development/claude-home/logs/tdarr_file_monitor.log +``` + +## Troubleshooting Context + +### Common Issues + +**1. Discord Webhooks Not Working** +```bash +# Test webhook connectivity +python3