Add NVIDIA update checker and monitoring scripts documentation
Add nvidia_update_checker.py for weekly driver update monitoring with Discord alerts. Add scripts CONTEXT.md and update README. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
0d552a839e
commit
d0dbe86fba
463
monitoring/scripts/CONTEXT.md
Normal file
463
monitoring/scripts/CONTEXT.md
Normal file
@ -0,0 +1,463 @@
|
||||
# Monitoring Scripts - Operational Context
|
||||
|
||||
## Script Overview
|
||||
This directory contains active operational scripts for system monitoring, health checks, alert notifications, and automation across the homelab infrastructure.
|
||||
|
||||
## Core Monitoring Scripts
|
||||
|
||||
### Jellyfin GPU Health Monitor
|
||||
**Script**: `jellyfin_gpu_monitor.py`
|
||||
**Purpose**: Monitor Jellyfin container GPU access with Discord alerts and auto-restart capability
|
||||
|
||||
**Key Features**:
|
||||
- GPU accessibility monitoring via nvidia-smi in container
|
||||
- Container status verification
|
||||
- Discord webhook notifications for GPU issues
|
||||
- Automatic container restart on GPU access loss (configurable)
|
||||
- Comprehensive logging with decision tracking
|
||||
|
||||
**Schedule**: Every 5 minutes via cron
|
||||
```bash
|
||||
*/5 * * * * /usr/bin/python3 /home/cal/scripts/jellyfin_gpu_monitor.py --check --discord-alerts --auto-restart >> /home/cal/logs/jellyfin-gpu-monitor.log 2>&1
|
||||
```
|
||||
|
||||
**Usage**:
|
||||
```bash
|
||||
# Health check with Discord alerts
|
||||
python3 jellyfin_gpu_monitor.py --check --discord-alerts
|
||||
|
||||
# With auto-restart on failure
|
||||
python3 jellyfin_gpu_monitor.py --check --discord-alerts --auto-restart
|
||||
|
||||
# Test Discord integration
|
||||
python3 jellyfin_gpu_monitor.py --discord-test
|
||||
|
||||
# JSON output for parsing
|
||||
python3 jellyfin_gpu_monitor.py --check --output json
|
||||
```
|
||||
|
||||
**Alert Types**:
|
||||
- 🔴 **GPU Access Lost** - nvidia-smi fails in container, transcoding will fail
|
||||
- 🟢 **GPU Access Restored** - After successful restart, GPU working again
|
||||
- ⚠️ **Restart Failed** - Host-level issue (requires host reboot)
|
||||
|
||||
**Locations**:
|
||||
- Script: `/home/cal/scripts/jellyfin_gpu_monitor.py` (on ubuntu-manticore)
|
||||
- Logs: `/home/cal/logs/jellyfin-gpu-monitor.log`
|
||||
- Remote execution via SSH from monitoring system
|
||||
|
||||
**Limitations**: Container restart cannot fix host-level NVIDIA driver issues. If "Restart failed" with driver/library mismatch, host reboot required.
|
||||
|
||||
### NVIDIA Driver Update Monitor
|
||||
**Script**: `nvidia_update_checker.py`
|
||||
**Purpose**: Weekly monitoring for NVIDIA driver updates on held packages with Discord notifications
|
||||
|
||||
**Key Features**:
|
||||
- Checks for available updates to held NVIDIA packages
|
||||
- Sends Discord alerts when new driver versions available
|
||||
- Includes manual update instructions in alert
|
||||
- JSON and pretty output modes
|
||||
- Remote execution via SSH
|
||||
|
||||
**Schedule**: Weekly (Mondays at 9 AM)
|
||||
```bash
|
||||
0 9 * * 1 /usr/bin/python3 /home/cal/scripts/nvidia_update_checker.py --check --discord-alerts >> /home/cal/logs/nvidia-update-checker.log 2>&1
|
||||
```
|
||||
|
||||
**Usage**:
|
||||
```bash
|
||||
# Check for updates with Discord alert
|
||||
python3 nvidia_update_checker.py --check --discord-alerts
|
||||
|
||||
# Silent check (for cron)
|
||||
python3 nvidia_update_checker.py --check
|
||||
|
||||
# Test Discord integration
|
||||
python3 nvidia_update_checker.py --discord-test
|
||||
|
||||
# JSON output
|
||||
python3 nvidia_update_checker.py --check --output json
|
||||
```
|
||||
|
||||
**Alert Content**:
|
||||
- 🔔 **Update Available** - Lists package versions (current → available)
|
||||
- ⚠️ **Action Required** - Includes manual update procedure with commands
|
||||
- Package list with version comparison
|
||||
- Reminder that packages are held and won't auto-update
|
||||
|
||||
**Locations**:
|
||||
- Script: `/home/cal/scripts/nvidia_update_checker.py` (on ubuntu-manticore)
|
||||
- Logs: `/home/cal/logs/nvidia-update-checker.log`
|
||||
|
||||
**Context**: Part of NVIDIA driver management strategy to prevent surprise auto-updates causing GPU access loss. See `/media-servers/jellyfin-ubuntu-manticore.md` for full driver management documentation.
|
||||
|
||||
### Tdarr API Monitor
|
||||
**Script**: `tdarr_monitor.py`
|
||||
**Purpose**: Comprehensive Tdarr server/node monitoring with Discord notifications and dataclass-based status tracking
|
||||
|
||||
**Key Features**:
|
||||
- Server health monitoring (API connectivity, database status)
|
||||
- Node status tracking (worker count, queue depth, GPU usage)
|
||||
- Transcode statistics (files processed, queue size, errors)
|
||||
- Discord notifications for critical issues
|
||||
- Dataclass-based status representation for type safety
|
||||
- JSON and pretty output modes
|
||||
|
||||
**Usage**:
|
||||
```bash
|
||||
# Full health check
|
||||
python3 tdarr_monitor.py --check
|
||||
|
||||
# With Discord alerts
|
||||
python3 tdarr_monitor.py --check --discord-alerts
|
||||
|
||||
# Monitor specific node
|
||||
python3 tdarr_monitor.py --check --node-id tdarr-node-gpu
|
||||
|
||||
# JSON output
|
||||
python3 tdarr_monitor.py --check --output json
|
||||
```
|
||||
|
||||
**Monitoring Scope**:
|
||||
- **Server Health**: API availability, response times, database connectivity
|
||||
- **Node Health**: Worker status, GPU availability, processing capacity
|
||||
- **Queue Status**: Files waiting, active transcodes, completion rate
|
||||
- **Error Detection**: Failed transcodes, stuck jobs, node disconnections
|
||||
|
||||
**Integration**: Works with Tdarr's gaming-aware scheduler. Monitors both unmapped nodes (local cache) and standard nodes.
|
||||
|
||||
**Documentation**: See `/tdarr/CONTEXT.md` and `/tdarr/scripts/CONTEXT.md` for Tdarr infrastructure details.
|
||||
|
||||
### Tdarr File Monitor
|
||||
**Script**: `tdarr_file_monitor.py`
|
||||
**Purpose**: Monitors Tdarr cache directory for completed .mkv files and backs them up
|
||||
|
||||
**Key Features**:
|
||||
- Recursive .mkv file detection in Tdarr cache
|
||||
- Size change monitoring to detect completion
|
||||
- Configurable completion wait time (default: 60 seconds)
|
||||
- Automatic backup to manual-backup directory
|
||||
- Persistent state tracking across runs
|
||||
- Duplicate handling (keeps smallest version)
|
||||
- Comprehensive logging
|
||||
|
||||
**Schedule**: Via cron wrapper `tdarr-file-monitor-cron.sh`
|
||||
|
||||
**Usage**:
|
||||
```bash
|
||||
# Run file monitor scan
|
||||
python3 tdarr_file_monitor.py
|
||||
|
||||
# Custom directories
|
||||
python3 tdarr_file_monitor.py --source /path/to/cache --dest /path/to/backup
|
||||
```
|
||||
|
||||
**Configuration**:
|
||||
- **Source**: `/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/temp`
|
||||
- **Media**: `/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/media`
|
||||
- **Destination**: `/mnt/NV2/tdarr-cache/manual-backup`
|
||||
- **State File**: `/mnt/NV2/Development/claude-home/logs/tdarr_file_monitor_state.json`
|
||||
- **Log File**: `/mnt/NV2/Development/claude-home/logs/tdarr_file_monitor.log`
|
||||
|
||||
**Completion Detection**:
|
||||
1. File discovered in cache directory
|
||||
2. Size tracked over time
|
||||
3. When size stable for completion_wait_seconds (60s), marked complete
|
||||
4. File copied to backup location
|
||||
5. State persisted for next run
|
||||
|
||||
**Cron Wrapper**: `tdarr-file-monitor-cron.sh`
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Wrapper for tdarr_file_monitor.py with logging
|
||||
cd /mnt/NV2/Development/claude-home/monitoring/scripts/
|
||||
python3 tdarr_file_monitor.py >> /mnt/NV2/Development/claude-home/logs/tdarr-file-monitor-cron.log 2>&1
|
||||
```
|
||||
|
||||
**Note**: Schedule not currently active. Enable when needed for automatic backup of completed transcodes.
|
||||
|
||||
### Windows Desktop Monitoring
|
||||
**Directory**: `windows-desktop/`
|
||||
**Purpose**: Monitor Windows machine reboots and system events with Discord notifications
|
||||
|
||||
**Core Script**: `windows-reboot-monitor.ps1`
|
||||
**Features**:
|
||||
- System startup monitoring (normal and unexpected)
|
||||
- Shutdown detection (planned and unplanned)
|
||||
- Reboot reason analysis (Windows Updates, power outages, user-initiated)
|
||||
- System uptime and boot statistics tracking
|
||||
- Discord webhook notifications with color coding
|
||||
- Event log analysis for root cause determination
|
||||
|
||||
**Task Scheduler Integration**:
|
||||
- **Startup Task**: `windows-reboot-task-startup.xml`
|
||||
- **Shutdown Task**: `windows-reboot-task-shutdown.xml`
|
||||
|
||||
**Notification Types**:
|
||||
- 🟢 **Normal Startup** - System booted after planned shutdown
|
||||
- 🔴 **Unexpected Restart** - Recovery from power loss/crash/forced reboot
|
||||
- 🟡 **Planned Shutdown** - System shutting down gracefully
|
||||
|
||||
**Information Captured**:
|
||||
- Computer name and timestamp
|
||||
- Boot/shutdown reasons (detailed)
|
||||
- System uptime duration
|
||||
- Boot counter for restart frequency tracking
|
||||
- Event log context
|
||||
|
||||
**Use Cases**:
|
||||
- Power outage detection
|
||||
- Windows Update monitoring
|
||||
- Hardware failure alerts
|
||||
- Remote system availability tracking
|
||||
- Uptime statistics
|
||||
|
||||
**Setup**: See `windows-desktop/windows-setup-instructions.md` for complete installation guide
|
||||
|
||||
## Operational Patterns
|
||||
|
||||
### Monitoring Schedule
|
||||
|
||||
**Active Cron Jobs** (on ubuntu-manticore via cal user):
|
||||
```bash
|
||||
# Jellyfin GPU monitoring - Every 5 minutes
|
||||
*/5 * * * * /usr/bin/python3 /home/cal/scripts/jellyfin_gpu_monitor.py --check --discord-alerts --auto-restart >> /home/cal/logs/jellyfin-gpu-monitor.log 2>&1
|
||||
|
||||
# NVIDIA driver update checks - Weekly (Mondays at 9 AM)
|
||||
0 9 * * 1 /usr/bin/python3 /home/cal/scripts/nvidia_update_checker.py --check --discord-alerts >> /home/cal/logs/nvidia-update-checker.log 2>&1
|
||||
```
|
||||
|
||||
**Manual/On-Demand**:
|
||||
- `tdarr_monitor.py` - Run as needed for Tdarr health checks
|
||||
- `tdarr_file_monitor.py` - Can be scheduled if automatic backup needed
|
||||
- Windows monitoring - Automatic via Task Scheduler on Windows machines
|
||||
|
||||
### Discord Integration
|
||||
|
||||
**All monitoring scripts use Discord webhooks** for notifications:
|
||||
- Standardized embed format with color coding
|
||||
- Timestamp inclusion
|
||||
- Actionable information in alerts
|
||||
- Field-based structured data presentation
|
||||
|
||||
**Webhook Configuration**:
|
||||
- Default webhook embedded in scripts
|
||||
- Can be overridden via command-line arguments
|
||||
- Test commands available for verification
|
||||
|
||||
**Color Coding**:
|
||||
- 🔴 Red (`0xff6b6b`) - Critical issues, failures
|
||||
- 🟡 Orange (`0xffa500`) - Warnings, actions needed
|
||||
- 🟢 Green (`0x28a745`) - Success, recovery, normal operations
|
||||
|
||||
### Logging Strategy
|
||||
|
||||
**Centralized Log Locations**:
|
||||
- **ubuntu-manticore**: `/home/cal/logs/` (GPU and driver monitoring)
|
||||
- **Development repo**: `/mnt/NV2/Development/claude-home/logs/` (file monitor, state files)
|
||||
|
||||
**Log Formats**:
|
||||
- Timestamped entries
|
||||
- Structured logging with severity levels
|
||||
- Decision reasoning included
|
||||
- Error stack traces when applicable
|
||||
|
||||
**Log Rotation**:
|
||||
- Manual cleanup recommended
|
||||
- Focus on recent activity (last 30-90 days)
|
||||
- State files maintained separately
|
||||
|
||||
**Accessing Logs**:
|
||||
```bash
|
||||
# GPU monitor logs
|
||||
ssh cal@10.10.0.226 "tail -f /home/cal/logs/jellyfin-gpu-monitor.log"
|
||||
|
||||
# Driver update logs
|
||||
ssh cal@10.10.0.226 "tail -f /home/cal/logs/nvidia-update-checker.log"
|
||||
|
||||
# File monitor logs
|
||||
tail -f /mnt/NV2/Development/claude-home/logs/tdarr_file_monitor.log
|
||||
```
|
||||
|
||||
## Troubleshooting Context
|
||||
|
||||
### Common Issues
|
||||
|
||||
**1. Discord Webhooks Not Working**
|
||||
```bash
|
||||
# Test webhook connectivity
|
||||
python3 <script>.py --discord-test
|
||||
|
||||
# Check for network connectivity
|
||||
curl -X POST <webhook_url>
|
||||
|
||||
# Verify webhook URL is correct and active
|
||||
```
|
||||
|
||||
**2. Cron Jobs Not Running**
|
||||
```bash
|
||||
# Verify cron service
|
||||
ssh cal@10.10.0.226 "systemctl status cron"
|
||||
|
||||
# Check crontab
|
||||
ssh cal@10.10.0.226 "crontab -l"
|
||||
|
||||
# Check cron logs
|
||||
ssh cal@10.10.0.226 "grep CRON /var/log/syslog | tail -20"
|
||||
|
||||
# Test script manually
|
||||
ssh cal@10.10.0.226 "/usr/bin/python3 /home/cal/scripts/jellyfin_gpu_monitor.py --check"
|
||||
```
|
||||
|
||||
**3. GPU Monitor Reports "Restart Failed"**
|
||||
```bash
|
||||
# This indicates host-level GPU issue, not container issue
|
||||
# Check host GPU status
|
||||
ssh cal@10.10.0.226 "nvidia-smi"
|
||||
|
||||
# If driver/library mismatch, reboot host
|
||||
ssh cal@10.10.0.226 "sudo reboot"
|
||||
```
|
||||
|
||||
**4. File Monitor Not Detecting Files**
|
||||
```bash
|
||||
# Check source directory accessibility
|
||||
ls -la /mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/temp/
|
||||
|
||||
# Verify state file
|
||||
cat /mnt/NV2/Development/claude-home/logs/tdarr_file_monitor_state.json
|
||||
|
||||
# Check permissions
|
||||
stat /mnt/NV2/tdarr-cache/manual-backup/
|
||||
```
|
||||
|
||||
**5. Windows Monitor Not Sending Alerts**
|
||||
```powershell
|
||||
# Check Task Scheduler tasks
|
||||
Get-ScheduledTask | Where-Object {$_.TaskName -like "*reboot*"}
|
||||
|
||||
# Test script manually
|
||||
powershell.exe -ExecutionPolicy Bypass -File windows-reboot-monitor.ps1 -EventType Startup
|
||||
|
||||
# Check Windows Event Logs
|
||||
Get-EventLog -LogName System -Newest 20
|
||||
```
|
||||
|
||||
### Diagnostic Commands
|
||||
|
||||
```bash
|
||||
# Test all monitoring scripts
|
||||
python3 jellyfin_gpu_monitor.py --discord-test
|
||||
python3 nvidia_update_checker.py --discord-test
|
||||
python3 tdarr_monitor.py --check
|
||||
|
||||
# Check monitoring script locations
|
||||
ssh cal@10.10.0.226 "ls -la /home/cal/scripts/"
|
||||
|
||||
# Verify log file creation
|
||||
ssh cal@10.10.0.226 "ls -lah /home/cal/logs/"
|
||||
|
||||
# Check script dependencies
|
||||
python3 -c "import requests; print('requests OK')"
|
||||
python3 -c "import dataclasses; print('dataclasses OK')"
|
||||
|
||||
# Monitor real-time logs
|
||||
ssh cal@10.10.0.226 "tail -f /home/cal/logs/*.log"
|
||||
```
|
||||
|
||||
## Integration Points
|
||||
|
||||
### External Dependencies
|
||||
- **Discord Webhooks**: For all notification delivery
|
||||
- **SSH**: Remote script execution on ubuntu-manticore
|
||||
- **Python 3**: Runtime for all Python scripts
|
||||
- **requests library**: HTTP communication for webhooks and APIs
|
||||
- **cron**: Scheduled task execution
|
||||
- **nvidia-smi**: GPU status checks
|
||||
- **Docker**: Container inspection and management
|
||||
- **Task Scheduler**: Windows automation (for windows-desktop monitoring)
|
||||
|
||||
### File System Dependencies
|
||||
- **Script Locations**:
|
||||
- Ubuntu-manticore: `/home/cal/scripts/`
|
||||
- Development repo: `/mnt/NV2/Development/claude-home/monitoring/scripts/`
|
||||
- **Log Directories**:
|
||||
- Ubuntu-manticore: `/home/cal/logs/`
|
||||
- Development repo: `/mnt/NV2/Development/claude-home/logs/`
|
||||
- **State Files**: JSON persistence for stateful monitors
|
||||
- **Tdarr Cache**: `/mnt/NV2/tdarr-cache/` (for file monitor)
|
||||
|
||||
### Network Dependencies
|
||||
- **Discord API**: webhook.discord.com (HTTPS)
|
||||
- **Tdarr API**: For tdarr_monitor.py (typically localhost or LAN)
|
||||
- **SSH Access**: For remote script execution and log access
|
||||
- **DNS Resolution**: For hostname-based connections
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### Webhook Security
|
||||
- Webhook URLs embedded in scripts (consider environment variables)
|
||||
- Webhooks provide one-way notification (no command execution risk)
|
||||
- Rate limiting on Discord side prevents abuse
|
||||
- No sensitive data in notifications (system status only)
|
||||
|
||||
### Script Execution
|
||||
- Scripts run as cal user (non-root)
|
||||
- Cron jobs execute with user permissions
|
||||
- No password authentication required (key-based SSH)
|
||||
- Scripts cannot modify system configuration (except via sudo)
|
||||
|
||||
### Log Files
|
||||
- Logs may contain system information
|
||||
- File permissions: 644 (readable but not world-writable)
|
||||
- No secrets or credentials in logs
|
||||
- Regular cleanup recommended
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Monitoring Overhead
|
||||
- **GPU Monitor**: Minimal (docker exec + nvidia-smi)
|
||||
- **Update Checker**: Low (weekly, apt cache updates)
|
||||
- **Tdarr Monitor**: Low (API calls only)
|
||||
- **File Monitor**: Medium (recursive directory scanning)
|
||||
- **Windows Monitor**: Negligible (event-triggered only)
|
||||
|
||||
### Optimization Strategies
|
||||
- Cron schedules spaced to avoid conflicts
|
||||
- JSON state files for persistence (avoid redundant work)
|
||||
- Efficient file scanning (globbing vs full directory walks)
|
||||
- Short-circuit logic (fail fast on errors)
|
||||
|
||||
### Resource Usage
|
||||
- **CPU**: <1% during normal operation
|
||||
- **Memory**: Minimal (Python scripts ~10-50 MB each)
|
||||
- **Disk I/O**: Logs grow slowly (<100 MB/year typical)
|
||||
- **Network**: Minimal (webhook POSTs only)
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
**Planned Improvements**:
|
||||
- Centralized monitoring dashboard
|
||||
- Grafana/Prometheus integration
|
||||
- Automatic log rotation
|
||||
- Status aggregation across all monitors
|
||||
- Retry logic for failed webhook deliveries
|
||||
- Enhanced error recovery procedures
|
||||
- Multi-channel notification support (email, SMS)
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- **Technology Overview**: `/monitoring/CONTEXT.md`
|
||||
- **Troubleshooting**: `/monitoring/troubleshooting.md`
|
||||
- **Cron Management**: `/monitoring/examples/cron-job-management.md`
|
||||
- **Tdarr Integration**: `/tdarr/scripts/CONTEXT.md`
|
||||
- **Jellyfin Setup**: `/media-servers/jellyfin-ubuntu-manticore.md`
|
||||
- **Main Instructions**: `/CLAUDE.md` - Context loading rules
|
||||
|
||||
## Notes
|
||||
|
||||
This monitoring infrastructure provides comprehensive visibility into homelab services with minimal overhead. The Discord-based notification system ensures prompt awareness of issues while maintaining simplicity.
|
||||
|
||||
Scripts are designed for reliability and ease of troubleshooting, with extensive logging and test modes for validation. The modular approach allows individual monitors to be enabled/disabled independently based on current needs.
|
||||
|
||||
GPU monitoring and driver update checking were added specifically to prevent unplanned downtime from NVIDIA driver auto-updates, demonstrating the system's evolution based on operational learnings.
|
||||
@ -91,9 +91,9 @@ See `setup-discord-monitoring.md` for Discord webhook setup instructions.
|
||||
|
||||
### Tdarr Keywords Trigger
|
||||
When working with Tdarr-related tasks, the following documentation is automatically loaded:
|
||||
- `reference/docker/tdarr-troubleshooting.md`
|
||||
- `patterns/docker/distributed-transcoding.md`
|
||||
- `scripts/tdarr/README.md`
|
||||
- `docker/examples/tdarr-troubleshooting.md`
|
||||
- `docker/examples/distributed-transcoding.md`
|
||||
- `tdarr/scripts/README.md`
|
||||
|
||||
### Gaming-Aware Scheduling
|
||||
The monitoring scripts integrate with the gaming-aware Tdarr scheduling system that provides:
|
||||
@ -112,6 +112,8 @@ The monitoring scripts integrate with the gaming-aware Tdarr scheduling system t
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- `/patterns/docker/distributed-transcoding.md` - Tdarr architecture patterns
|
||||
- `/reference/docker/tdarr-troubleshooting.md` - Troubleshooting guide
|
||||
- `/scripts/tdarr/README.md` - Tdarr management scripts
|
||||
- `/docker/examples/distributed-transcoding.md` - Tdarr architecture patterns
|
||||
- `/docker/examples/tdarr-troubleshooting.md` - Troubleshooting guide
|
||||
- `/tdarr/scripts/README.md` - Tdarr management scripts
|
||||
- `/tdarr/CONTEXT.md` - Tdarr technology overview
|
||||
- `/monitoring/CONTEXT.md` - Monitoring overview and patterns
|
||||
300
monitoring/scripts/nvidia_update_checker.py
Normal file
300
monitoring/scripts/nvidia_update_checker.py
Normal file
@ -0,0 +1,300 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
NVIDIA Driver Update Checker
|
||||
|
||||
Monitors for available updates to held NVIDIA packages and sends
|
||||
Discord notifications when new versions are available.
|
||||
|
||||
This allows manual, planned updates during maintenance windows
|
||||
rather than surprise auto-updates causing downtime.
|
||||
|
||||
Usage:
|
||||
# Check for updates (with Discord alert)
|
||||
python3 nvidia_update_checker.py --check --discord-alerts
|
||||
|
||||
# Check silently (cron job logging)
|
||||
python3 nvidia_update_checker.py --check
|
||||
|
||||
# Test Discord integration
|
||||
python3 nvidia_update_checker.py --discord-test
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import logging
|
||||
import subprocess
|
||||
import sys
|
||||
from dataclasses import dataclass, asdict
|
||||
from datetime import datetime
|
||||
from typing import List, Optional
|
||||
import requests
|
||||
|
||||
|
||||
@dataclass
|
||||
class PackageUpdate:
|
||||
name: str
|
||||
current_version: str
|
||||
available_version: str
|
||||
held: bool
|
||||
|
||||
|
||||
@dataclass
|
||||
class UpdateCheckResult:
|
||||
timestamp: str
|
||||
updates_available: bool
|
||||
held_packages: List[PackageUpdate]
|
||||
other_packages: List[PackageUpdate]
|
||||
total_updates: int
|
||||
|
||||
|
||||
class DiscordNotifier:
|
||||
def __init__(self, webhook_url: str, timeout: int = 10):
|
||||
self.webhook_url = webhook_url
|
||||
self.timeout = timeout
|
||||
self.logger = logging.getLogger(f"{__name__}.DiscordNotifier")
|
||||
|
||||
def send_alert(self, title: str, description: str, color: int = 0xffa500,
|
||||
fields: list = None) -> bool:
|
||||
"""Send embed alert to Discord."""
|
||||
embed = {
|
||||
"title": title,
|
||||
"description": description,
|
||||
"color": color,
|
||||
"timestamp": datetime.now().isoformat(),
|
||||
"fields": fields or []
|
||||
}
|
||||
|
||||
payload = {
|
||||
"username": "NVIDIA Update Monitor",
|
||||
"embeds": [embed]
|
||||
}
|
||||
|
||||
try:
|
||||
response = requests.post(
|
||||
self.webhook_url,
|
||||
json=payload,
|
||||
timeout=self.timeout
|
||||
)
|
||||
response.raise_for_status()
|
||||
self.logger.info("Discord notification sent successfully")
|
||||
return True
|
||||
except Exception as e:
|
||||
self.logger.error(f"Failed to send Discord notification: {e}")
|
||||
return False
|
||||
|
||||
def send_update_available_alert(self, updates: List[PackageUpdate]) -> bool:
|
||||
"""Send alert when NVIDIA driver updates are available."""
|
||||
version_list = "\n".join([
|
||||
f"• **{pkg.name}**: {pkg.current_version} → {pkg.available_version}"
|
||||
for pkg in updates
|
||||
])
|
||||
|
||||
fields = [
|
||||
{
|
||||
"name": "Available Updates",
|
||||
"value": version_list,
|
||||
"inline": False
|
||||
},
|
||||
{
|
||||
"name": "⚠️ Action Required",
|
||||
"value": (
|
||||
"These packages are held and will NOT auto-update.\n"
|
||||
"Plan a maintenance window to update manually:\n"
|
||||
"```bash\n"
|
||||
"sudo apt-mark unhold nvidia-driver-570\n"
|
||||
"sudo apt update && sudo apt upgrade\n"
|
||||
"sudo reboot\n"
|
||||
"```"
|
||||
),
|
||||
"inline": False
|
||||
}
|
||||
]
|
||||
|
||||
return self.send_alert(
|
||||
title="🔔 NVIDIA Driver Update Available",
|
||||
description=f"New NVIDIA driver version(s) available for ubuntu-manticore ({len(updates)} package(s))",
|
||||
color=0xffa500, # Orange
|
||||
fields=fields
|
||||
)
|
||||
|
||||
|
||||
class NvidiaUpdateChecker:
|
||||
def __init__(self, ssh_host: str = None, discord_webhook: str = None,
|
||||
enable_discord: bool = False):
|
||||
self.ssh_host = ssh_host
|
||||
self.logger = logging.getLogger(__name__)
|
||||
|
||||
self.discord = None
|
||||
if enable_discord and discord_webhook:
|
||||
self.discord = DiscordNotifier(discord_webhook)
|
||||
|
||||
def _run_command(self, cmd: list, timeout: int = 30) -> tuple:
|
||||
"""Run command locally or via SSH."""
|
||||
if self.ssh_host:
|
||||
cmd = ["ssh", self.ssh_host] + [" ".join(cmd)]
|
||||
|
||||
try:
|
||||
result = subprocess.run(
|
||||
cmd,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=timeout,
|
||||
shell=isinstance(cmd[-1], str) and self.ssh_host is not None
|
||||
)
|
||||
return result.returncode, result.stdout.strip(), result.stderr.strip()
|
||||
except subprocess.TimeoutExpired:
|
||||
return -1, "", "Command timed out"
|
||||
except Exception as e:
|
||||
return -1, "", str(e)
|
||||
|
||||
def get_held_packages(self) -> List[str]:
|
||||
"""Get list of held packages."""
|
||||
cmd = ["apt-mark", "showhold"]
|
||||
code, stdout, stderr = self._run_command(cmd)
|
||||
|
||||
if code != 0:
|
||||
self.logger.error(f"Failed to get held packages: {stderr}")
|
||||
return []
|
||||
|
||||
return [line.strip() for line in stdout.split("\n") if line.strip()]
|
||||
|
||||
def check_package_updates(self) -> List[PackageUpdate]:
|
||||
"""Check for available updates."""
|
||||
# Update package cache
|
||||
update_cmd = ["apt-get", "update", "-qq"]
|
||||
self._run_command(update_cmd)
|
||||
|
||||
# Get list of upgradable packages
|
||||
cmd = ["apt", "list", "--upgradable"]
|
||||
code, stdout, stderr = self._run_command(cmd)
|
||||
|
||||
if code != 0:
|
||||
self.logger.error(f"Failed to check updates: {stderr}")
|
||||
return []
|
||||
|
||||
held_packages = self.get_held_packages()
|
||||
updates = []
|
||||
|
||||
for line in stdout.split("\n"):
|
||||
if "/" not in line or "[upgradable" not in line:
|
||||
continue
|
||||
|
||||
# Parse: package/release version arch [upgradable from: old_version]
|
||||
parts = line.split()
|
||||
if len(parts) < 6:
|
||||
continue
|
||||
|
||||
package_name = parts[0].split("/")[0]
|
||||
new_version = parts[1]
|
||||
old_version = parts[5].rstrip("]")
|
||||
|
||||
# Filter for NVIDIA packages
|
||||
if "nvidia" in package_name.lower():
|
||||
updates.append(PackageUpdate(
|
||||
name=package_name,
|
||||
current_version=old_version,
|
||||
available_version=new_version,
|
||||
held=package_name in held_packages
|
||||
))
|
||||
|
||||
return updates
|
||||
|
||||
def check_updates(self) -> UpdateCheckResult:
|
||||
"""Perform full update check."""
|
||||
timestamp = datetime.now().isoformat()
|
||||
|
||||
updates = self.check_package_updates()
|
||||
held_updates = [u for u in updates if u.held]
|
||||
other_updates = [u for u in updates if not u.held]
|
||||
|
||||
result = UpdateCheckResult(
|
||||
timestamp=timestamp,
|
||||
updates_available=len(held_updates) > 0,
|
||||
held_packages=held_updates,
|
||||
other_packages=other_updates,
|
||||
total_updates=len(updates)
|
||||
)
|
||||
|
||||
# Send Discord alert for held packages with updates
|
||||
if result.updates_available and self.discord:
|
||||
self.discord.send_update_available_alert(held_updates)
|
||||
|
||||
return result
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Monitor NVIDIA driver updates on held packages'
|
||||
)
|
||||
parser.add_argument('--check', action='store_true', help='Check for updates')
|
||||
parser.add_argument('--discord-webhook',
|
||||
default='https://discord.com/api/webhooks/1404105821549498398/y2Ud1RK9rzFjv58xbypUfQNe3jrL7ZUq1FkQHa4_dfOHm2ylp93z0f4tY0O8Z-vQgKhD',
|
||||
help='Discord webhook URL')
|
||||
parser.add_argument('--discord-alerts', action='store_true',
|
||||
help='Enable Discord alerts')
|
||||
parser.add_argument('--discord-test', action='store_true',
|
||||
help='Test Discord integration')
|
||||
parser.add_argument('--ssh-host', default='cal@10.10.0.226',
|
||||
help='SSH host for remote monitoring')
|
||||
parser.add_argument('--output', choices=['json', 'pretty'], default='pretty')
|
||||
parser.add_argument('--verbose', action='store_true', help='Verbose logging')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Configure logging
|
||||
level = logging.DEBUG if args.verbose else logging.INFO
|
||||
logging.basicConfig(
|
||||
level=level,
|
||||
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
|
||||
)
|
||||
|
||||
# Discord test
|
||||
if args.discord_test:
|
||||
notifier = DiscordNotifier(args.discord_webhook)
|
||||
success = notifier.send_alert(
|
||||
title="NVIDIA Update Monitor Test",
|
||||
description="Discord integration is working correctly.",
|
||||
color=0x00ff00,
|
||||
fields=[
|
||||
{"name": "Host", "value": args.ssh_host, "inline": True},
|
||||
{"name": "Status", "value": "Test successful", "inline": True}
|
||||
]
|
||||
)
|
||||
sys.exit(0 if success else 1)
|
||||
|
||||
# Update check
|
||||
if args.check:
|
||||
checker = NvidiaUpdateChecker(
|
||||
ssh_host=args.ssh_host,
|
||||
discord_webhook=args.discord_webhook,
|
||||
enable_discord=args.discord_alerts
|
||||
)
|
||||
|
||||
result = checker.check_updates()
|
||||
|
||||
if args.output == 'json':
|
||||
print(json.dumps(asdict(result), indent=2))
|
||||
else:
|
||||
print(f"=== NVIDIA Update Check - {result.timestamp} ===")
|
||||
|
||||
if result.updates_available:
|
||||
print(f"\n⚠️ {len(result.held_packages)} held package(s) have updates:")
|
||||
for pkg in result.held_packages:
|
||||
print(f" • {pkg.name}: {pkg.current_version} → {pkg.available_version}")
|
||||
print("\nThese packages will NOT auto-update (held)")
|
||||
print("Plan a maintenance window to update manually")
|
||||
else:
|
||||
print("\n✅ All held NVIDIA packages are up to date")
|
||||
|
||||
if result.other_packages:
|
||||
print(f"\nℹ️ {len(result.other_packages)} other NVIDIA package(s) have updates:")
|
||||
for pkg in result.other_packages:
|
||||
print(f" • {pkg.name}: {pkg.current_version} → {pkg.available_version}")
|
||||
|
||||
sys.exit(0 if not result.updates_available else 1)
|
||||
|
||||
parser.print_help()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
Loading…
Reference in New Issue
Block a user