Add nvidia_update_checker.py for weekly driver update monitoring with Discord alerts. Add scripts CONTEXT.md and update README. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
464 lines
16 KiB
Markdown
464 lines
16 KiB
Markdown
# Monitoring Scripts - Operational Context
|
|
|
|
## Script Overview
|
|
This directory contains active operational scripts for system monitoring, health checks, alert notifications, and automation across the homelab infrastructure.
|
|
|
|
## Core Monitoring Scripts
|
|
|
|
### Jellyfin GPU Health Monitor
|
|
**Script**: `jellyfin_gpu_monitor.py`
|
|
**Purpose**: Monitor Jellyfin container GPU access with Discord alerts and auto-restart capability
|
|
|
|
**Key Features**:
|
|
- GPU accessibility monitoring via nvidia-smi in container
|
|
- Container status verification
|
|
- Discord webhook notifications for GPU issues
|
|
- Automatic container restart on GPU access loss (configurable)
|
|
- Comprehensive logging with decision tracking
|
|
|
|
**Schedule**: Every 5 minutes via cron
|
|
```bash
|
|
*/5 * * * * /usr/bin/python3 /home/cal/scripts/jellyfin_gpu_monitor.py --check --discord-alerts --auto-restart >> /home/cal/logs/jellyfin-gpu-monitor.log 2>&1
|
|
```
|
|
|
|
**Usage**:
|
|
```bash
|
|
# Health check with Discord alerts
|
|
python3 jellyfin_gpu_monitor.py --check --discord-alerts
|
|
|
|
# With auto-restart on failure
|
|
python3 jellyfin_gpu_monitor.py --check --discord-alerts --auto-restart
|
|
|
|
# Test Discord integration
|
|
python3 jellyfin_gpu_monitor.py --discord-test
|
|
|
|
# JSON output for parsing
|
|
python3 jellyfin_gpu_monitor.py --check --output json
|
|
```
|
|
|
|
**Alert Types**:
|
|
- 🔴 **GPU Access Lost** - nvidia-smi fails in container, transcoding will fail
|
|
- 🟢 **GPU Access Restored** - After successful restart, GPU working again
|
|
- ⚠️ **Restart Failed** - Host-level issue (requires host reboot)
|
|
|
|
**Locations**:
|
|
- Script: `/home/cal/scripts/jellyfin_gpu_monitor.py` (on ubuntu-manticore)
|
|
- Logs: `/home/cal/logs/jellyfin-gpu-monitor.log`
|
|
- Remote execution via SSH from monitoring system
|
|
|
|
**Limitations**: Container restart cannot fix host-level NVIDIA driver issues. If "Restart failed" with driver/library mismatch, host reboot required.
|
|
|
|
### NVIDIA Driver Update Monitor
|
|
**Script**: `nvidia_update_checker.py`
|
|
**Purpose**: Weekly monitoring for NVIDIA driver updates on held packages with Discord notifications
|
|
|
|
**Key Features**:
|
|
- Checks for available updates to held NVIDIA packages
|
|
- Sends Discord alerts when new driver versions available
|
|
- Includes manual update instructions in alert
|
|
- JSON and pretty output modes
|
|
- Remote execution via SSH
|
|
|
|
**Schedule**: Weekly (Mondays at 9 AM)
|
|
```bash
|
|
0 9 * * 1 /usr/bin/python3 /home/cal/scripts/nvidia_update_checker.py --check --discord-alerts >> /home/cal/logs/nvidia-update-checker.log 2>&1
|
|
```
|
|
|
|
**Usage**:
|
|
```bash
|
|
# Check for updates with Discord alert
|
|
python3 nvidia_update_checker.py --check --discord-alerts
|
|
|
|
# Silent check (for cron)
|
|
python3 nvidia_update_checker.py --check
|
|
|
|
# Test Discord integration
|
|
python3 nvidia_update_checker.py --discord-test
|
|
|
|
# JSON output
|
|
python3 nvidia_update_checker.py --check --output json
|
|
```
|
|
|
|
**Alert Content**:
|
|
- 🔔 **Update Available** - Lists package versions (current → available)
|
|
- ⚠️ **Action Required** - Includes manual update procedure with commands
|
|
- Package list with version comparison
|
|
- Reminder that packages are held and won't auto-update
|
|
|
|
**Locations**:
|
|
- Script: `/home/cal/scripts/nvidia_update_checker.py` (on ubuntu-manticore)
|
|
- Logs: `/home/cal/logs/nvidia-update-checker.log`
|
|
|
|
**Context**: Part of NVIDIA driver management strategy to prevent surprise auto-updates causing GPU access loss. See `/media-servers/jellyfin-ubuntu-manticore.md` for full driver management documentation.
|
|
|
|
### Tdarr API Monitor
|
|
**Script**: `tdarr_monitor.py`
|
|
**Purpose**: Comprehensive Tdarr server/node monitoring with Discord notifications and dataclass-based status tracking
|
|
|
|
**Key Features**:
|
|
- Server health monitoring (API connectivity, database status)
|
|
- Node status tracking (worker count, queue depth, GPU usage)
|
|
- Transcode statistics (files processed, queue size, errors)
|
|
- Discord notifications for critical issues
|
|
- Dataclass-based status representation for type safety
|
|
- JSON and pretty output modes
|
|
|
|
**Usage**:
|
|
```bash
|
|
# Full health check
|
|
python3 tdarr_monitor.py --check
|
|
|
|
# With Discord alerts
|
|
python3 tdarr_monitor.py --check --discord-alerts
|
|
|
|
# Monitor specific node
|
|
python3 tdarr_monitor.py --check --node-id tdarr-node-gpu
|
|
|
|
# JSON output
|
|
python3 tdarr_monitor.py --check --output json
|
|
```
|
|
|
|
**Monitoring Scope**:
|
|
- **Server Health**: API availability, response times, database connectivity
|
|
- **Node Health**: Worker status, GPU availability, processing capacity
|
|
- **Queue Status**: Files waiting, active transcodes, completion rate
|
|
- **Error Detection**: Failed transcodes, stuck jobs, node disconnections
|
|
|
|
**Integration**: Works with Tdarr's gaming-aware scheduler. Monitors both unmapped nodes (local cache) and standard nodes.
|
|
|
|
**Documentation**: See `/tdarr/CONTEXT.md` and `/tdarr/scripts/CONTEXT.md` for Tdarr infrastructure details.
|
|
|
|
### Tdarr File Monitor
|
|
**Script**: `tdarr_file_monitor.py`
|
|
**Purpose**: Monitors Tdarr cache directory for completed .mkv files and backs them up
|
|
|
|
**Key Features**:
|
|
- Recursive .mkv file detection in Tdarr cache
|
|
- Size change monitoring to detect completion
|
|
- Configurable completion wait time (default: 60 seconds)
|
|
- Automatic backup to manual-backup directory
|
|
- Persistent state tracking across runs
|
|
- Duplicate handling (keeps smallest version)
|
|
- Comprehensive logging
|
|
|
|
**Schedule**: Via cron wrapper `tdarr-file-monitor-cron.sh`
|
|
|
|
**Usage**:
|
|
```bash
|
|
# Run file monitor scan
|
|
python3 tdarr_file_monitor.py
|
|
|
|
# Custom directories
|
|
python3 tdarr_file_monitor.py --source /path/to/cache --dest /path/to/backup
|
|
```
|
|
|
|
**Configuration**:
|
|
- **Source**: `/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/temp`
|
|
- **Media**: `/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/media`
|
|
- **Destination**: `/mnt/NV2/tdarr-cache/manual-backup`
|
|
- **State File**: `/mnt/NV2/Development/claude-home/logs/tdarr_file_monitor_state.json`
|
|
- **Log File**: `/mnt/NV2/Development/claude-home/logs/tdarr_file_monitor.log`
|
|
|
|
**Completion Detection**:
|
|
1. File discovered in cache directory
|
|
2. Size tracked over time
|
|
3. When size stable for completion_wait_seconds (60s), marked complete
|
|
4. File copied to backup location
|
|
5. State persisted for next run
|
|
|
|
**Cron Wrapper**: `tdarr-file-monitor-cron.sh`
|
|
```bash
|
|
#!/bin/bash
|
|
# Wrapper for tdarr_file_monitor.py with logging
|
|
cd /mnt/NV2/Development/claude-home/monitoring/scripts/
|
|
python3 tdarr_file_monitor.py >> /mnt/NV2/Development/claude-home/logs/tdarr-file-monitor-cron.log 2>&1
|
|
```
|
|
|
|
**Note**: Schedule not currently active. Enable when needed for automatic backup of completed transcodes.
|
|
|
|
### Windows Desktop Monitoring
|
|
**Directory**: `windows-desktop/`
|
|
**Purpose**: Monitor Windows machine reboots and system events with Discord notifications
|
|
|
|
**Core Script**: `windows-reboot-monitor.ps1`
|
|
**Features**:
|
|
- System startup monitoring (normal and unexpected)
|
|
- Shutdown detection (planned and unplanned)
|
|
- Reboot reason analysis (Windows Updates, power outages, user-initiated)
|
|
- System uptime and boot statistics tracking
|
|
- Discord webhook notifications with color coding
|
|
- Event log analysis for root cause determination
|
|
|
|
**Task Scheduler Integration**:
|
|
- **Startup Task**: `windows-reboot-task-startup.xml`
|
|
- **Shutdown Task**: `windows-reboot-task-shutdown.xml`
|
|
|
|
**Notification Types**:
|
|
- 🟢 **Normal Startup** - System booted after planned shutdown
|
|
- 🔴 **Unexpected Restart** - Recovery from power loss/crash/forced reboot
|
|
- 🟡 **Planned Shutdown** - System shutting down gracefully
|
|
|
|
**Information Captured**:
|
|
- Computer name and timestamp
|
|
- Boot/shutdown reasons (detailed)
|
|
- System uptime duration
|
|
- Boot counter for restart frequency tracking
|
|
- Event log context
|
|
|
|
**Use Cases**:
|
|
- Power outage detection
|
|
- Windows Update monitoring
|
|
- Hardware failure alerts
|
|
- Remote system availability tracking
|
|
- Uptime statistics
|
|
|
|
**Setup**: See `windows-desktop/windows-setup-instructions.md` for complete installation guide
|
|
|
|
## Operational Patterns
|
|
|
|
### Monitoring Schedule
|
|
|
|
**Active Cron Jobs** (on ubuntu-manticore via cal user):
|
|
```bash
|
|
# Jellyfin GPU monitoring - Every 5 minutes
|
|
*/5 * * * * /usr/bin/python3 /home/cal/scripts/jellyfin_gpu_monitor.py --check --discord-alerts --auto-restart >> /home/cal/logs/jellyfin-gpu-monitor.log 2>&1
|
|
|
|
# NVIDIA driver update checks - Weekly (Mondays at 9 AM)
|
|
0 9 * * 1 /usr/bin/python3 /home/cal/scripts/nvidia_update_checker.py --check --discord-alerts >> /home/cal/logs/nvidia-update-checker.log 2>&1
|
|
```
|
|
|
|
**Manual/On-Demand**:
|
|
- `tdarr_monitor.py` - Run as needed for Tdarr health checks
|
|
- `tdarr_file_monitor.py` - Can be scheduled if automatic backup needed
|
|
- Windows monitoring - Automatic via Task Scheduler on Windows machines
|
|
|
|
### Discord Integration
|
|
|
|
**All monitoring scripts use Discord webhooks** for notifications:
|
|
- Standardized embed format with color coding
|
|
- Timestamp inclusion
|
|
- Actionable information in alerts
|
|
- Field-based structured data presentation
|
|
|
|
**Webhook Configuration**:
|
|
- Default webhook embedded in scripts
|
|
- Can be overridden via command-line arguments
|
|
- Test commands available for verification
|
|
|
|
**Color Coding**:
|
|
- 🔴 Red (`0xff6b6b`) - Critical issues, failures
|
|
- 🟡 Orange (`0xffa500`) - Warnings, actions needed
|
|
- 🟢 Green (`0x28a745`) - Success, recovery, normal operations
|
|
|
|
### Logging Strategy
|
|
|
|
**Centralized Log Locations**:
|
|
- **ubuntu-manticore**: `/home/cal/logs/` (GPU and driver monitoring)
|
|
- **Development repo**: `/mnt/NV2/Development/claude-home/logs/` (file monitor, state files)
|
|
|
|
**Log Formats**:
|
|
- Timestamped entries
|
|
- Structured logging with severity levels
|
|
- Decision reasoning included
|
|
- Error stack traces when applicable
|
|
|
|
**Log Rotation**:
|
|
- Manual cleanup recommended
|
|
- Focus on recent activity (last 30-90 days)
|
|
- State files maintained separately
|
|
|
|
**Accessing Logs**:
|
|
```bash
|
|
# GPU monitor logs
|
|
ssh cal@10.10.0.226 "tail -f /home/cal/logs/jellyfin-gpu-monitor.log"
|
|
|
|
# Driver update logs
|
|
ssh cal@10.10.0.226 "tail -f /home/cal/logs/nvidia-update-checker.log"
|
|
|
|
# File monitor logs
|
|
tail -f /mnt/NV2/Development/claude-home/logs/tdarr_file_monitor.log
|
|
```
|
|
|
|
## Troubleshooting Context
|
|
|
|
### Common Issues
|
|
|
|
**1. Discord Webhooks Not Working**
|
|
```bash
|
|
# Test webhook connectivity
|
|
python3 <script>.py --discord-test
|
|
|
|
# Check for network connectivity
|
|
curl -X POST <webhook_url>
|
|
|
|
# Verify webhook URL is correct and active
|
|
```
|
|
|
|
**2. Cron Jobs Not Running**
|
|
```bash
|
|
# Verify cron service
|
|
ssh cal@10.10.0.226 "systemctl status cron"
|
|
|
|
# Check crontab
|
|
ssh cal@10.10.0.226 "crontab -l"
|
|
|
|
# Check cron logs
|
|
ssh cal@10.10.0.226 "grep CRON /var/log/syslog | tail -20"
|
|
|
|
# Test script manually
|
|
ssh cal@10.10.0.226 "/usr/bin/python3 /home/cal/scripts/jellyfin_gpu_monitor.py --check"
|
|
```
|
|
|
|
**3. GPU Monitor Reports "Restart Failed"**
|
|
```bash
|
|
# This indicates host-level GPU issue, not container issue
|
|
# Check host GPU status
|
|
ssh cal@10.10.0.226 "nvidia-smi"
|
|
|
|
# If driver/library mismatch, reboot host
|
|
ssh cal@10.10.0.226 "sudo reboot"
|
|
```
|
|
|
|
**4. File Monitor Not Detecting Files**
|
|
```bash
|
|
# Check source directory accessibility
|
|
ls -la /mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/temp/
|
|
|
|
# Verify state file
|
|
cat /mnt/NV2/Development/claude-home/logs/tdarr_file_monitor_state.json
|
|
|
|
# Check permissions
|
|
stat /mnt/NV2/tdarr-cache/manual-backup/
|
|
```
|
|
|
|
**5. Windows Monitor Not Sending Alerts**
|
|
```powershell
|
|
# Check Task Scheduler tasks
|
|
Get-ScheduledTask | Where-Object {$_.TaskName -like "*reboot*"}
|
|
|
|
# Test script manually
|
|
powershell.exe -ExecutionPolicy Bypass -File windows-reboot-monitor.ps1 -EventType Startup
|
|
|
|
# Check Windows Event Logs
|
|
Get-EventLog -LogName System -Newest 20
|
|
```
|
|
|
|
### Diagnostic Commands
|
|
|
|
```bash
|
|
# Test all monitoring scripts
|
|
python3 jellyfin_gpu_monitor.py --discord-test
|
|
python3 nvidia_update_checker.py --discord-test
|
|
python3 tdarr_monitor.py --check
|
|
|
|
# Check monitoring script locations
|
|
ssh cal@10.10.0.226 "ls -la /home/cal/scripts/"
|
|
|
|
# Verify log file creation
|
|
ssh cal@10.10.0.226 "ls -lah /home/cal/logs/"
|
|
|
|
# Check script dependencies
|
|
python3 -c "import requests; print('requests OK')"
|
|
python3 -c "import dataclasses; print('dataclasses OK')"
|
|
|
|
# Monitor real-time logs
|
|
ssh cal@10.10.0.226 "tail -f /home/cal/logs/*.log"
|
|
```
|
|
|
|
## Integration Points
|
|
|
|
### External Dependencies
|
|
- **Discord Webhooks**: For all notification delivery
|
|
- **SSH**: Remote script execution on ubuntu-manticore
|
|
- **Python 3**: Runtime for all Python scripts
|
|
- **requests library**: HTTP communication for webhooks and APIs
|
|
- **cron**: Scheduled task execution
|
|
- **nvidia-smi**: GPU status checks
|
|
- **Docker**: Container inspection and management
|
|
- **Task Scheduler**: Windows automation (for windows-desktop monitoring)
|
|
|
|
### File System Dependencies
|
|
- **Script Locations**:
|
|
- Ubuntu-manticore: `/home/cal/scripts/`
|
|
- Development repo: `/mnt/NV2/Development/claude-home/monitoring/scripts/`
|
|
- **Log Directories**:
|
|
- Ubuntu-manticore: `/home/cal/logs/`
|
|
- Development repo: `/mnt/NV2/Development/claude-home/logs/`
|
|
- **State Files**: JSON persistence for stateful monitors
|
|
- **Tdarr Cache**: `/mnt/NV2/tdarr-cache/` (for file monitor)
|
|
|
|
### Network Dependencies
|
|
- **Discord API**: webhook.discord.com (HTTPS)
|
|
- **Tdarr API**: For tdarr_monitor.py (typically localhost or LAN)
|
|
- **SSH Access**: For remote script execution and log access
|
|
- **DNS Resolution**: For hostname-based connections
|
|
|
|
## Security Considerations
|
|
|
|
### Webhook Security
|
|
- Webhook URLs embedded in scripts (consider environment variables)
|
|
- Webhooks provide one-way notification (no command execution risk)
|
|
- Rate limiting on Discord side prevents abuse
|
|
- No sensitive data in notifications (system status only)
|
|
|
|
### Script Execution
|
|
- Scripts run as cal user (non-root)
|
|
- Cron jobs execute with user permissions
|
|
- No password authentication required (key-based SSH)
|
|
- Scripts cannot modify system configuration (except via sudo)
|
|
|
|
### Log Files
|
|
- Logs may contain system information
|
|
- File permissions: 644 (readable but not world-writable)
|
|
- No secrets or credentials in logs
|
|
- Regular cleanup recommended
|
|
|
|
## Performance Considerations
|
|
|
|
### Monitoring Overhead
|
|
- **GPU Monitor**: Minimal (docker exec + nvidia-smi)
|
|
- **Update Checker**: Low (weekly, apt cache updates)
|
|
- **Tdarr Monitor**: Low (API calls only)
|
|
- **File Monitor**: Medium (recursive directory scanning)
|
|
- **Windows Monitor**: Negligible (event-triggered only)
|
|
|
|
### Optimization Strategies
|
|
- Cron schedules spaced to avoid conflicts
|
|
- JSON state files for persistence (avoid redundant work)
|
|
- Efficient file scanning (globbing vs full directory walks)
|
|
- Short-circuit logic (fail fast on errors)
|
|
|
|
### Resource Usage
|
|
- **CPU**: <1% during normal operation
|
|
- **Memory**: Minimal (Python scripts ~10-50 MB each)
|
|
- **Disk I/O**: Logs grow slowly (<100 MB/year typical)
|
|
- **Network**: Minimal (webhook POSTs only)
|
|
|
|
## Future Enhancements
|
|
|
|
**Planned Improvements**:
|
|
- Centralized monitoring dashboard
|
|
- Grafana/Prometheus integration
|
|
- Automatic log rotation
|
|
- Status aggregation across all monitors
|
|
- Retry logic for failed webhook deliveries
|
|
- Enhanced error recovery procedures
|
|
- Multi-channel notification support (email, SMS)
|
|
|
|
## Related Documentation
|
|
|
|
- **Technology Overview**: `/monitoring/CONTEXT.md`
|
|
- **Troubleshooting**: `/monitoring/troubleshooting.md`
|
|
- **Cron Management**: `/monitoring/examples/cron-job-management.md`
|
|
- **Tdarr Integration**: `/tdarr/scripts/CONTEXT.md`
|
|
- **Jellyfin Setup**: `/media-servers/jellyfin-ubuntu-manticore.md`
|
|
- **Main Instructions**: `/CLAUDE.md` - Context loading rules
|
|
|
|
## Notes
|
|
|
|
This monitoring infrastructure provides comprehensive visibility into homelab services with minimal overhead. The Discord-based notification system ensures prompt awareness of issues while maintaining simplicity.
|
|
|
|
Scripts are designed for reliability and ease of troubleshooting, with extensive logging and test modes for validation. The modular approach allows individual monitors to be enabled/disabled independently based on current needs.
|
|
|
|
GPU monitoring and driver update checking were added specifically to prevent unplanned downtime from NVIDIA driver auto-updates, demonstrating the system's evolution based on operational learnings.
|