claude-home/monitoring/scripts/CONTEXT.md
Cal Corum d0dbe86fba Add NVIDIA update checker and monitoring scripts documentation
Add nvidia_update_checker.py for weekly driver update monitoring with
Discord alerts. Add scripts CONTEXT.md and update README.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 22:21:00 -06:00

464 lines
16 KiB
Markdown

# Monitoring Scripts - Operational Context
## Script Overview
This directory contains active operational scripts for system monitoring, health checks, alert notifications, and automation across the homelab infrastructure.
## Core Monitoring Scripts
### Jellyfin GPU Health Monitor
**Script**: `jellyfin_gpu_monitor.py`
**Purpose**: Monitor Jellyfin container GPU access with Discord alerts and auto-restart capability
**Key Features**:
- GPU accessibility monitoring via nvidia-smi in container
- Container status verification
- Discord webhook notifications for GPU issues
- Automatic container restart on GPU access loss (configurable)
- Comprehensive logging with decision tracking
**Schedule**: Every 5 minutes via cron
```bash
*/5 * * * * /usr/bin/python3 /home/cal/scripts/jellyfin_gpu_monitor.py --check --discord-alerts --auto-restart >> /home/cal/logs/jellyfin-gpu-monitor.log 2>&1
```
**Usage**:
```bash
# Health check with Discord alerts
python3 jellyfin_gpu_monitor.py --check --discord-alerts
# With auto-restart on failure
python3 jellyfin_gpu_monitor.py --check --discord-alerts --auto-restart
# Test Discord integration
python3 jellyfin_gpu_monitor.py --discord-test
# JSON output for parsing
python3 jellyfin_gpu_monitor.py --check --output json
```
**Alert Types**:
- 🔴 **GPU Access Lost** - nvidia-smi fails in container, transcoding will fail
- 🟢 **GPU Access Restored** - After successful restart, GPU working again
- ⚠️ **Restart Failed** - Host-level issue (requires host reboot)
**Locations**:
- Script: `/home/cal/scripts/jellyfin_gpu_monitor.py` (on ubuntu-manticore)
- Logs: `/home/cal/logs/jellyfin-gpu-monitor.log`
- Remote execution via SSH from monitoring system
**Limitations**: Container restart cannot fix host-level NVIDIA driver issues. If "Restart failed" with driver/library mismatch, host reboot required.
### NVIDIA Driver Update Monitor
**Script**: `nvidia_update_checker.py`
**Purpose**: Weekly monitoring for NVIDIA driver updates on held packages with Discord notifications
**Key Features**:
- Checks for available updates to held NVIDIA packages
- Sends Discord alerts when new driver versions available
- Includes manual update instructions in alert
- JSON and pretty output modes
- Remote execution via SSH
**Schedule**: Weekly (Mondays at 9 AM)
```bash
0 9 * * 1 /usr/bin/python3 /home/cal/scripts/nvidia_update_checker.py --check --discord-alerts >> /home/cal/logs/nvidia-update-checker.log 2>&1
```
**Usage**:
```bash
# Check for updates with Discord alert
python3 nvidia_update_checker.py --check --discord-alerts
# Silent check (for cron)
python3 nvidia_update_checker.py --check
# Test Discord integration
python3 nvidia_update_checker.py --discord-test
# JSON output
python3 nvidia_update_checker.py --check --output json
```
**Alert Content**:
- 🔔 **Update Available** - Lists package versions (current → available)
- ⚠️ **Action Required** - Includes manual update procedure with commands
- Package list with version comparison
- Reminder that packages are held and won't auto-update
**Locations**:
- Script: `/home/cal/scripts/nvidia_update_checker.py` (on ubuntu-manticore)
- Logs: `/home/cal/logs/nvidia-update-checker.log`
**Context**: Part of NVIDIA driver management strategy to prevent surprise auto-updates causing GPU access loss. See `/media-servers/jellyfin-ubuntu-manticore.md` for full driver management documentation.
### Tdarr API Monitor
**Script**: `tdarr_monitor.py`
**Purpose**: Comprehensive Tdarr server/node monitoring with Discord notifications and dataclass-based status tracking
**Key Features**:
- Server health monitoring (API connectivity, database status)
- Node status tracking (worker count, queue depth, GPU usage)
- Transcode statistics (files processed, queue size, errors)
- Discord notifications for critical issues
- Dataclass-based status representation for type safety
- JSON and pretty output modes
**Usage**:
```bash
# Full health check
python3 tdarr_monitor.py --check
# With Discord alerts
python3 tdarr_monitor.py --check --discord-alerts
# Monitor specific node
python3 tdarr_monitor.py --check --node-id tdarr-node-gpu
# JSON output
python3 tdarr_monitor.py --check --output json
```
**Monitoring Scope**:
- **Server Health**: API availability, response times, database connectivity
- **Node Health**: Worker status, GPU availability, processing capacity
- **Queue Status**: Files waiting, active transcodes, completion rate
- **Error Detection**: Failed transcodes, stuck jobs, node disconnections
**Integration**: Works with Tdarr's gaming-aware scheduler. Monitors both unmapped nodes (local cache) and standard nodes.
**Documentation**: See `/tdarr/CONTEXT.md` and `/tdarr/scripts/CONTEXT.md` for Tdarr infrastructure details.
### Tdarr File Monitor
**Script**: `tdarr_file_monitor.py`
**Purpose**: Monitors Tdarr cache directory for completed .mkv files and backs them up
**Key Features**:
- Recursive .mkv file detection in Tdarr cache
- Size change monitoring to detect completion
- Configurable completion wait time (default: 60 seconds)
- Automatic backup to manual-backup directory
- Persistent state tracking across runs
- Duplicate handling (keeps smallest version)
- Comprehensive logging
**Schedule**: Via cron wrapper `tdarr-file-monitor-cron.sh`
**Usage**:
```bash
# Run file monitor scan
python3 tdarr_file_monitor.py
# Custom directories
python3 tdarr_file_monitor.py --source /path/to/cache --dest /path/to/backup
```
**Configuration**:
- **Source**: `/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/temp`
- **Media**: `/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/media`
- **Destination**: `/mnt/NV2/tdarr-cache/manual-backup`
- **State File**: `/mnt/NV2/Development/claude-home/logs/tdarr_file_monitor_state.json`
- **Log File**: `/mnt/NV2/Development/claude-home/logs/tdarr_file_monitor.log`
**Completion Detection**:
1. File discovered in cache directory
2. Size tracked over time
3. When size stable for completion_wait_seconds (60s), marked complete
4. File copied to backup location
5. State persisted for next run
**Cron Wrapper**: `tdarr-file-monitor-cron.sh`
```bash
#!/bin/bash
# Wrapper for tdarr_file_monitor.py with logging
cd /mnt/NV2/Development/claude-home/monitoring/scripts/
python3 tdarr_file_monitor.py >> /mnt/NV2/Development/claude-home/logs/tdarr-file-monitor-cron.log 2>&1
```
**Note**: Schedule not currently active. Enable when needed for automatic backup of completed transcodes.
### Windows Desktop Monitoring
**Directory**: `windows-desktop/`
**Purpose**: Monitor Windows machine reboots and system events with Discord notifications
**Core Script**: `windows-reboot-monitor.ps1`
**Features**:
- System startup monitoring (normal and unexpected)
- Shutdown detection (planned and unplanned)
- Reboot reason analysis (Windows Updates, power outages, user-initiated)
- System uptime and boot statistics tracking
- Discord webhook notifications with color coding
- Event log analysis for root cause determination
**Task Scheduler Integration**:
- **Startup Task**: `windows-reboot-task-startup.xml`
- **Shutdown Task**: `windows-reboot-task-shutdown.xml`
**Notification Types**:
- 🟢 **Normal Startup** - System booted after planned shutdown
- 🔴 **Unexpected Restart** - Recovery from power loss/crash/forced reboot
- 🟡 **Planned Shutdown** - System shutting down gracefully
**Information Captured**:
- Computer name and timestamp
- Boot/shutdown reasons (detailed)
- System uptime duration
- Boot counter for restart frequency tracking
- Event log context
**Use Cases**:
- Power outage detection
- Windows Update monitoring
- Hardware failure alerts
- Remote system availability tracking
- Uptime statistics
**Setup**: See `windows-desktop/windows-setup-instructions.md` for complete installation guide
## Operational Patterns
### Monitoring Schedule
**Active Cron Jobs** (on ubuntu-manticore via cal user):
```bash
# Jellyfin GPU monitoring - Every 5 minutes
*/5 * * * * /usr/bin/python3 /home/cal/scripts/jellyfin_gpu_monitor.py --check --discord-alerts --auto-restart >> /home/cal/logs/jellyfin-gpu-monitor.log 2>&1
# NVIDIA driver update checks - Weekly (Mondays at 9 AM)
0 9 * * 1 /usr/bin/python3 /home/cal/scripts/nvidia_update_checker.py --check --discord-alerts >> /home/cal/logs/nvidia-update-checker.log 2>&1
```
**Manual/On-Demand**:
- `tdarr_monitor.py` - Run as needed for Tdarr health checks
- `tdarr_file_monitor.py` - Can be scheduled if automatic backup needed
- Windows monitoring - Automatic via Task Scheduler on Windows machines
### Discord Integration
**All monitoring scripts use Discord webhooks** for notifications:
- Standardized embed format with color coding
- Timestamp inclusion
- Actionable information in alerts
- Field-based structured data presentation
**Webhook Configuration**:
- Default webhook embedded in scripts
- Can be overridden via command-line arguments
- Test commands available for verification
**Color Coding**:
- 🔴 Red (`0xff6b6b`) - Critical issues, failures
- 🟡 Orange (`0xffa500`) - Warnings, actions needed
- 🟢 Green (`0x28a745`) - Success, recovery, normal operations
### Logging Strategy
**Centralized Log Locations**:
- **ubuntu-manticore**: `/home/cal/logs/` (GPU and driver monitoring)
- **Development repo**: `/mnt/NV2/Development/claude-home/logs/` (file monitor, state files)
**Log Formats**:
- Timestamped entries
- Structured logging with severity levels
- Decision reasoning included
- Error stack traces when applicable
**Log Rotation**:
- Manual cleanup recommended
- Focus on recent activity (last 30-90 days)
- State files maintained separately
**Accessing Logs**:
```bash
# GPU monitor logs
ssh cal@10.10.0.226 "tail -f /home/cal/logs/jellyfin-gpu-monitor.log"
# Driver update logs
ssh cal@10.10.0.226 "tail -f /home/cal/logs/nvidia-update-checker.log"
# File monitor logs
tail -f /mnt/NV2/Development/claude-home/logs/tdarr_file_monitor.log
```
## Troubleshooting Context
### Common Issues
**1. Discord Webhooks Not Working**
```bash
# Test webhook connectivity
python3 <script>.py --discord-test
# Check for network connectivity
curl -X POST <webhook_url>
# Verify webhook URL is correct and active
```
**2. Cron Jobs Not Running**
```bash
# Verify cron service
ssh cal@10.10.0.226 "systemctl status cron"
# Check crontab
ssh cal@10.10.0.226 "crontab -l"
# Check cron logs
ssh cal@10.10.0.226 "grep CRON /var/log/syslog | tail -20"
# Test script manually
ssh cal@10.10.0.226 "/usr/bin/python3 /home/cal/scripts/jellyfin_gpu_monitor.py --check"
```
**3. GPU Monitor Reports "Restart Failed"**
```bash
# This indicates host-level GPU issue, not container issue
# Check host GPU status
ssh cal@10.10.0.226 "nvidia-smi"
# If driver/library mismatch, reboot host
ssh cal@10.10.0.226 "sudo reboot"
```
**4. File Monitor Not Detecting Files**
```bash
# Check source directory accessibility
ls -la /mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/temp/
# Verify state file
cat /mnt/NV2/Development/claude-home/logs/tdarr_file_monitor_state.json
# Check permissions
stat /mnt/NV2/tdarr-cache/manual-backup/
```
**5. Windows Monitor Not Sending Alerts**
```powershell
# Check Task Scheduler tasks
Get-ScheduledTask | Where-Object {$_.TaskName -like "*reboot*"}
# Test script manually
powershell.exe -ExecutionPolicy Bypass -File windows-reboot-monitor.ps1 -EventType Startup
# Check Windows Event Logs
Get-EventLog -LogName System -Newest 20
```
### Diagnostic Commands
```bash
# Test all monitoring scripts
python3 jellyfin_gpu_monitor.py --discord-test
python3 nvidia_update_checker.py --discord-test
python3 tdarr_monitor.py --check
# Check monitoring script locations
ssh cal@10.10.0.226 "ls -la /home/cal/scripts/"
# Verify log file creation
ssh cal@10.10.0.226 "ls -lah /home/cal/logs/"
# Check script dependencies
python3 -c "import requests; print('requests OK')"
python3 -c "import dataclasses; print('dataclasses OK')"
# Monitor real-time logs
ssh cal@10.10.0.226 "tail -f /home/cal/logs/*.log"
```
## Integration Points
### External Dependencies
- **Discord Webhooks**: For all notification delivery
- **SSH**: Remote script execution on ubuntu-manticore
- **Python 3**: Runtime for all Python scripts
- **requests library**: HTTP communication for webhooks and APIs
- **cron**: Scheduled task execution
- **nvidia-smi**: GPU status checks
- **Docker**: Container inspection and management
- **Task Scheduler**: Windows automation (for windows-desktop monitoring)
### File System Dependencies
- **Script Locations**:
- Ubuntu-manticore: `/home/cal/scripts/`
- Development repo: `/mnt/NV2/Development/claude-home/monitoring/scripts/`
- **Log Directories**:
- Ubuntu-manticore: `/home/cal/logs/`
- Development repo: `/mnt/NV2/Development/claude-home/logs/`
- **State Files**: JSON persistence for stateful monitors
- **Tdarr Cache**: `/mnt/NV2/tdarr-cache/` (for file monitor)
### Network Dependencies
- **Discord API**: webhook.discord.com (HTTPS)
- **Tdarr API**: For tdarr_monitor.py (typically localhost or LAN)
- **SSH Access**: For remote script execution and log access
- **DNS Resolution**: For hostname-based connections
## Security Considerations
### Webhook Security
- Webhook URLs embedded in scripts (consider environment variables)
- Webhooks provide one-way notification (no command execution risk)
- Rate limiting on Discord side prevents abuse
- No sensitive data in notifications (system status only)
### Script Execution
- Scripts run as cal user (non-root)
- Cron jobs execute with user permissions
- No password authentication required (key-based SSH)
- Scripts cannot modify system configuration (except via sudo)
### Log Files
- Logs may contain system information
- File permissions: 644 (readable but not world-writable)
- No secrets or credentials in logs
- Regular cleanup recommended
## Performance Considerations
### Monitoring Overhead
- **GPU Monitor**: Minimal (docker exec + nvidia-smi)
- **Update Checker**: Low (weekly, apt cache updates)
- **Tdarr Monitor**: Low (API calls only)
- **File Monitor**: Medium (recursive directory scanning)
- **Windows Monitor**: Negligible (event-triggered only)
### Optimization Strategies
- Cron schedules spaced to avoid conflicts
- JSON state files for persistence (avoid redundant work)
- Efficient file scanning (globbing vs full directory walks)
- Short-circuit logic (fail fast on errors)
### Resource Usage
- **CPU**: <1% during normal operation
- **Memory**: Minimal (Python scripts ~10-50 MB each)
- **Disk I/O**: Logs grow slowly (<100 MB/year typical)
- **Network**: Minimal (webhook POSTs only)
## Future Enhancements
**Planned Improvements**:
- Centralized monitoring dashboard
- Grafana/Prometheus integration
- Automatic log rotation
- Status aggregation across all monitors
- Retry logic for failed webhook deliveries
- Enhanced error recovery procedures
- Multi-channel notification support (email, SMS)
## Related Documentation
- **Technology Overview**: `/monitoring/CONTEXT.md`
- **Troubleshooting**: `/monitoring/troubleshooting.md`
- **Cron Management**: `/monitoring/examples/cron-job-management.md`
- **Tdarr Integration**: `/tdarr/scripts/CONTEXT.md`
- **Jellyfin Setup**: `/media-servers/jellyfin-ubuntu-manticore.md`
- **Main Instructions**: `/CLAUDE.md` - Context loading rules
## Notes
This monitoring infrastructure provides comprehensive visibility into homelab services with minimal overhead. The Discord-based notification system ensures prompt awareness of issues while maintaining simplicity.
Scripts are designed for reliability and ease of troubleshooting, with extensive logging and test modes for validation. The modular approach allows individual monitors to be enabled/disabled independently based on current needs.
GPU monitoring and driver update checking were added specifically to prevent unplanned downtime from NVIDIA driver auto-updates, demonstrating the system's evolution based on operational learnings.