Add NVIDIA update checker and monitoring scripts documentation

Add nvidia_update_checker.py for weekly driver update monitoring with
Discord alerts. Add scripts CONTEXT.md and update README.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Cal Corum 2026-02-07 22:21:00 -06:00
parent 0d552a839e
commit d0dbe86fba
3 changed files with 771 additions and 6 deletions

View File

@ -0,0 +1,463 @@
# Monitoring Scripts - Operational Context
## Script Overview
This directory contains active operational scripts for system monitoring, health checks, alert notifications, and automation across the homelab infrastructure.
## Core Monitoring Scripts
### Jellyfin GPU Health Monitor
**Script**: `jellyfin_gpu_monitor.py`
**Purpose**: Monitor Jellyfin container GPU access with Discord alerts and auto-restart capability
**Key Features**:
- GPU accessibility monitoring via nvidia-smi in container
- Container status verification
- Discord webhook notifications for GPU issues
- Automatic container restart on GPU access loss (configurable)
- Comprehensive logging with decision tracking
**Schedule**: Every 5 minutes via cron
```bash
*/5 * * * * /usr/bin/python3 /home/cal/scripts/jellyfin_gpu_monitor.py --check --discord-alerts --auto-restart >> /home/cal/logs/jellyfin-gpu-monitor.log 2>&1
```
**Usage**:
```bash
# Health check with Discord alerts
python3 jellyfin_gpu_monitor.py --check --discord-alerts
# With auto-restart on failure
python3 jellyfin_gpu_monitor.py --check --discord-alerts --auto-restart
# Test Discord integration
python3 jellyfin_gpu_monitor.py --discord-test
# JSON output for parsing
python3 jellyfin_gpu_monitor.py --check --output json
```
**Alert Types**:
- 🔴 **GPU Access Lost** - nvidia-smi fails in container, transcoding will fail
- 🟢 **GPU Access Restored** - After successful restart, GPU working again
- ⚠️ **Restart Failed** - Host-level issue (requires host reboot)
**Locations**:
- Script: `/home/cal/scripts/jellyfin_gpu_monitor.py` (on ubuntu-manticore)
- Logs: `/home/cal/logs/jellyfin-gpu-monitor.log`
- Remote execution via SSH from monitoring system
**Limitations**: Container restart cannot fix host-level NVIDIA driver issues. If "Restart failed" with driver/library mismatch, host reboot required.
### NVIDIA Driver Update Monitor
**Script**: `nvidia_update_checker.py`
**Purpose**: Weekly monitoring for NVIDIA driver updates on held packages with Discord notifications
**Key Features**:
- Checks for available updates to held NVIDIA packages
- Sends Discord alerts when new driver versions available
- Includes manual update instructions in alert
- JSON and pretty output modes
- Remote execution via SSH
**Schedule**: Weekly (Mondays at 9 AM)
```bash
0 9 * * 1 /usr/bin/python3 /home/cal/scripts/nvidia_update_checker.py --check --discord-alerts >> /home/cal/logs/nvidia-update-checker.log 2>&1
```
**Usage**:
```bash
# Check for updates with Discord alert
python3 nvidia_update_checker.py --check --discord-alerts
# Silent check (for cron)
python3 nvidia_update_checker.py --check
# Test Discord integration
python3 nvidia_update_checker.py --discord-test
# JSON output
python3 nvidia_update_checker.py --check --output json
```
**Alert Content**:
- 🔔 **Update Available** - Lists package versions (current → available)
- ⚠️ **Action Required** - Includes manual update procedure with commands
- Package list with version comparison
- Reminder that packages are held and won't auto-update
**Locations**:
- Script: `/home/cal/scripts/nvidia_update_checker.py` (on ubuntu-manticore)
- Logs: `/home/cal/logs/nvidia-update-checker.log`
**Context**: Part of NVIDIA driver management strategy to prevent surprise auto-updates causing GPU access loss. See `/media-servers/jellyfin-ubuntu-manticore.md` for full driver management documentation.
### Tdarr API Monitor
**Script**: `tdarr_monitor.py`
**Purpose**: Comprehensive Tdarr server/node monitoring with Discord notifications and dataclass-based status tracking
**Key Features**:
- Server health monitoring (API connectivity, database status)
- Node status tracking (worker count, queue depth, GPU usage)
- Transcode statistics (files processed, queue size, errors)
- Discord notifications for critical issues
- Dataclass-based status representation for type safety
- JSON and pretty output modes
**Usage**:
```bash
# Full health check
python3 tdarr_monitor.py --check
# With Discord alerts
python3 tdarr_monitor.py --check --discord-alerts
# Monitor specific node
python3 tdarr_monitor.py --check --node-id tdarr-node-gpu
# JSON output
python3 tdarr_monitor.py --check --output json
```
**Monitoring Scope**:
- **Server Health**: API availability, response times, database connectivity
- **Node Health**: Worker status, GPU availability, processing capacity
- **Queue Status**: Files waiting, active transcodes, completion rate
- **Error Detection**: Failed transcodes, stuck jobs, node disconnections
**Integration**: Works with Tdarr's gaming-aware scheduler. Monitors both unmapped nodes (local cache) and standard nodes.
**Documentation**: See `/tdarr/CONTEXT.md` and `/tdarr/scripts/CONTEXT.md` for Tdarr infrastructure details.
### Tdarr File Monitor
**Script**: `tdarr_file_monitor.py`
**Purpose**: Monitors Tdarr cache directory for completed .mkv files and backs them up
**Key Features**:
- Recursive .mkv file detection in Tdarr cache
- Size change monitoring to detect completion
- Configurable completion wait time (default: 60 seconds)
- Automatic backup to manual-backup directory
- Persistent state tracking across runs
- Duplicate handling (keeps smallest version)
- Comprehensive logging
**Schedule**: Via cron wrapper `tdarr-file-monitor-cron.sh`
**Usage**:
```bash
# Run file monitor scan
python3 tdarr_file_monitor.py
# Custom directories
python3 tdarr_file_monitor.py --source /path/to/cache --dest /path/to/backup
```
**Configuration**:
- **Source**: `/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/temp`
- **Media**: `/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/media`
- **Destination**: `/mnt/NV2/tdarr-cache/manual-backup`
- **State File**: `/mnt/NV2/Development/claude-home/logs/tdarr_file_monitor_state.json`
- **Log File**: `/mnt/NV2/Development/claude-home/logs/tdarr_file_monitor.log`
**Completion Detection**:
1. File discovered in cache directory
2. Size tracked over time
3. When size stable for completion_wait_seconds (60s), marked complete
4. File copied to backup location
5. State persisted for next run
**Cron Wrapper**: `tdarr-file-monitor-cron.sh`
```bash
#!/bin/bash
# Wrapper for tdarr_file_monitor.py with logging
cd /mnt/NV2/Development/claude-home/monitoring/scripts/
python3 tdarr_file_monitor.py >> /mnt/NV2/Development/claude-home/logs/tdarr-file-monitor-cron.log 2>&1
```
**Note**: Schedule not currently active. Enable when needed for automatic backup of completed transcodes.
### Windows Desktop Monitoring
**Directory**: `windows-desktop/`
**Purpose**: Monitor Windows machine reboots and system events with Discord notifications
**Core Script**: `windows-reboot-monitor.ps1`
**Features**:
- System startup monitoring (normal and unexpected)
- Shutdown detection (planned and unplanned)
- Reboot reason analysis (Windows Updates, power outages, user-initiated)
- System uptime and boot statistics tracking
- Discord webhook notifications with color coding
- Event log analysis for root cause determination
**Task Scheduler Integration**:
- **Startup Task**: `windows-reboot-task-startup.xml`
- **Shutdown Task**: `windows-reboot-task-shutdown.xml`
**Notification Types**:
- 🟢 **Normal Startup** - System booted after planned shutdown
- 🔴 **Unexpected Restart** - Recovery from power loss/crash/forced reboot
- 🟡 **Planned Shutdown** - System shutting down gracefully
**Information Captured**:
- Computer name and timestamp
- Boot/shutdown reasons (detailed)
- System uptime duration
- Boot counter for restart frequency tracking
- Event log context
**Use Cases**:
- Power outage detection
- Windows Update monitoring
- Hardware failure alerts
- Remote system availability tracking
- Uptime statistics
**Setup**: See `windows-desktop/windows-setup-instructions.md` for complete installation guide
## Operational Patterns
### Monitoring Schedule
**Active Cron Jobs** (on ubuntu-manticore via cal user):
```bash
# Jellyfin GPU monitoring - Every 5 minutes
*/5 * * * * /usr/bin/python3 /home/cal/scripts/jellyfin_gpu_monitor.py --check --discord-alerts --auto-restart >> /home/cal/logs/jellyfin-gpu-monitor.log 2>&1
# NVIDIA driver update checks - Weekly (Mondays at 9 AM)
0 9 * * 1 /usr/bin/python3 /home/cal/scripts/nvidia_update_checker.py --check --discord-alerts >> /home/cal/logs/nvidia-update-checker.log 2>&1
```
**Manual/On-Demand**:
- `tdarr_monitor.py` - Run as needed for Tdarr health checks
- `tdarr_file_monitor.py` - Can be scheduled if automatic backup needed
- Windows monitoring - Automatic via Task Scheduler on Windows machines
### Discord Integration
**All monitoring scripts use Discord webhooks** for notifications:
- Standardized embed format with color coding
- Timestamp inclusion
- Actionable information in alerts
- Field-based structured data presentation
**Webhook Configuration**:
- Default webhook embedded in scripts
- Can be overridden via command-line arguments
- Test commands available for verification
**Color Coding**:
- 🔴 Red (`0xff6b6b`) - Critical issues, failures
- 🟡 Orange (`0xffa500`) - Warnings, actions needed
- 🟢 Green (`0x28a745`) - Success, recovery, normal operations
### Logging Strategy
**Centralized Log Locations**:
- **ubuntu-manticore**: `/home/cal/logs/` (GPU and driver monitoring)
- **Development repo**: `/mnt/NV2/Development/claude-home/logs/` (file monitor, state files)
**Log Formats**:
- Timestamped entries
- Structured logging with severity levels
- Decision reasoning included
- Error stack traces when applicable
**Log Rotation**:
- Manual cleanup recommended
- Focus on recent activity (last 30-90 days)
- State files maintained separately
**Accessing Logs**:
```bash
# GPU monitor logs
ssh cal@10.10.0.226 "tail -f /home/cal/logs/jellyfin-gpu-monitor.log"
# Driver update logs
ssh cal@10.10.0.226 "tail -f /home/cal/logs/nvidia-update-checker.log"
# File monitor logs
tail -f /mnt/NV2/Development/claude-home/logs/tdarr_file_monitor.log
```
## Troubleshooting Context
### Common Issues
**1. Discord Webhooks Not Working**
```bash
# Test webhook connectivity
python3 <script>.py --discord-test
# Check for network connectivity
curl -X POST <webhook_url>
# Verify webhook URL is correct and active
```
**2. Cron Jobs Not Running**
```bash
# Verify cron service
ssh cal@10.10.0.226 "systemctl status cron"
# Check crontab
ssh cal@10.10.0.226 "crontab -l"
# Check cron logs
ssh cal@10.10.0.226 "grep CRON /var/log/syslog | tail -20"
# Test script manually
ssh cal@10.10.0.226 "/usr/bin/python3 /home/cal/scripts/jellyfin_gpu_monitor.py --check"
```
**3. GPU Monitor Reports "Restart Failed"**
```bash
# This indicates host-level GPU issue, not container issue
# Check host GPU status
ssh cal@10.10.0.226 "nvidia-smi"
# If driver/library mismatch, reboot host
ssh cal@10.10.0.226 "sudo reboot"
```
**4. File Monitor Not Detecting Files**
```bash
# Check source directory accessibility
ls -la /mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/temp/
# Verify state file
cat /mnt/NV2/Development/claude-home/logs/tdarr_file_monitor_state.json
# Check permissions
stat /mnt/NV2/tdarr-cache/manual-backup/
```
**5. Windows Monitor Not Sending Alerts**
```powershell
# Check Task Scheduler tasks
Get-ScheduledTask | Where-Object {$_.TaskName -like "*reboot*"}
# Test script manually
powershell.exe -ExecutionPolicy Bypass -File windows-reboot-monitor.ps1 -EventType Startup
# Check Windows Event Logs
Get-EventLog -LogName System -Newest 20
```
### Diagnostic Commands
```bash
# Test all monitoring scripts
python3 jellyfin_gpu_monitor.py --discord-test
python3 nvidia_update_checker.py --discord-test
python3 tdarr_monitor.py --check
# Check monitoring script locations
ssh cal@10.10.0.226 "ls -la /home/cal/scripts/"
# Verify log file creation
ssh cal@10.10.0.226 "ls -lah /home/cal/logs/"
# Check script dependencies
python3 -c "import requests; print('requests OK')"
python3 -c "import dataclasses; print('dataclasses OK')"
# Monitor real-time logs
ssh cal@10.10.0.226 "tail -f /home/cal/logs/*.log"
```
## Integration Points
### External Dependencies
- **Discord Webhooks**: For all notification delivery
- **SSH**: Remote script execution on ubuntu-manticore
- **Python 3**: Runtime for all Python scripts
- **requests library**: HTTP communication for webhooks and APIs
- **cron**: Scheduled task execution
- **nvidia-smi**: GPU status checks
- **Docker**: Container inspection and management
- **Task Scheduler**: Windows automation (for windows-desktop monitoring)
### File System Dependencies
- **Script Locations**:
- Ubuntu-manticore: `/home/cal/scripts/`
- Development repo: `/mnt/NV2/Development/claude-home/monitoring/scripts/`
- **Log Directories**:
- Ubuntu-manticore: `/home/cal/logs/`
- Development repo: `/mnt/NV2/Development/claude-home/logs/`
- **State Files**: JSON persistence for stateful monitors
- **Tdarr Cache**: `/mnt/NV2/tdarr-cache/` (for file monitor)
### Network Dependencies
- **Discord API**: webhook.discord.com (HTTPS)
- **Tdarr API**: For tdarr_monitor.py (typically localhost or LAN)
- **SSH Access**: For remote script execution and log access
- **DNS Resolution**: For hostname-based connections
## Security Considerations
### Webhook Security
- Webhook URLs embedded in scripts (consider environment variables)
- Webhooks provide one-way notification (no command execution risk)
- Rate limiting on Discord side prevents abuse
- No sensitive data in notifications (system status only)
### Script Execution
- Scripts run as cal user (non-root)
- Cron jobs execute with user permissions
- No password authentication required (key-based SSH)
- Scripts cannot modify system configuration (except via sudo)
### Log Files
- Logs may contain system information
- File permissions: 644 (readable but not world-writable)
- No secrets or credentials in logs
- Regular cleanup recommended
## Performance Considerations
### Monitoring Overhead
- **GPU Monitor**: Minimal (docker exec + nvidia-smi)
- **Update Checker**: Low (weekly, apt cache updates)
- **Tdarr Monitor**: Low (API calls only)
- **File Monitor**: Medium (recursive directory scanning)
- **Windows Monitor**: Negligible (event-triggered only)
### Optimization Strategies
- Cron schedules spaced to avoid conflicts
- JSON state files for persistence (avoid redundant work)
- Efficient file scanning (globbing vs full directory walks)
- Short-circuit logic (fail fast on errors)
### Resource Usage
- **CPU**: <1% during normal operation
- **Memory**: Minimal (Python scripts ~10-50 MB each)
- **Disk I/O**: Logs grow slowly (<100 MB/year typical)
- **Network**: Minimal (webhook POSTs only)
## Future Enhancements
**Planned Improvements**:
- Centralized monitoring dashboard
- Grafana/Prometheus integration
- Automatic log rotation
- Status aggregation across all monitors
- Retry logic for failed webhook deliveries
- Enhanced error recovery procedures
- Multi-channel notification support (email, SMS)
## Related Documentation
- **Technology Overview**: `/monitoring/CONTEXT.md`
- **Troubleshooting**: `/monitoring/troubleshooting.md`
- **Cron Management**: `/monitoring/examples/cron-job-management.md`
- **Tdarr Integration**: `/tdarr/scripts/CONTEXT.md`
- **Jellyfin Setup**: `/media-servers/jellyfin-ubuntu-manticore.md`
- **Main Instructions**: `/CLAUDE.md` - Context loading rules
## Notes
This monitoring infrastructure provides comprehensive visibility into homelab services with minimal overhead. The Discord-based notification system ensures prompt awareness of issues while maintaining simplicity.
Scripts are designed for reliability and ease of troubleshooting, with extensive logging and test modes for validation. The modular approach allows individual monitors to be enabled/disabled independently based on current needs.
GPU monitoring and driver update checking were added specifically to prevent unplanned downtime from NVIDIA driver auto-updates, demonstrating the system's evolution based on operational learnings.

View File

@ -91,9 +91,9 @@ See `setup-discord-monitoring.md` for Discord webhook setup instructions.
### Tdarr Keywords Trigger
When working with Tdarr-related tasks, the following documentation is automatically loaded:
- `reference/docker/tdarr-troubleshooting.md`
- `patterns/docker/distributed-transcoding.md`
- `scripts/tdarr/README.md`
- `docker/examples/tdarr-troubleshooting.md`
- `docker/examples/distributed-transcoding.md`
- `tdarr/scripts/README.md`
### Gaming-Aware Scheduling
The monitoring scripts integrate with the gaming-aware Tdarr scheduling system that provides:
@ -112,6 +112,8 @@ The monitoring scripts integrate with the gaming-aware Tdarr scheduling system t
## Related Documentation
- `/patterns/docker/distributed-transcoding.md` - Tdarr architecture patterns
- `/reference/docker/tdarr-troubleshooting.md` - Troubleshooting guide
- `/scripts/tdarr/README.md` - Tdarr management scripts
- `/docker/examples/distributed-transcoding.md` - Tdarr architecture patterns
- `/docker/examples/tdarr-troubleshooting.md` - Troubleshooting guide
- `/tdarr/scripts/README.md` - Tdarr management scripts
- `/tdarr/CONTEXT.md` - Tdarr technology overview
- `/monitoring/CONTEXT.md` - Monitoring overview and patterns

View File

@ -0,0 +1,300 @@
#!/usr/bin/env python3
"""
NVIDIA Driver Update Checker
Monitors for available updates to held NVIDIA packages and sends
Discord notifications when new versions are available.
This allows manual, planned updates during maintenance windows
rather than surprise auto-updates causing downtime.
Usage:
# Check for updates (with Discord alert)
python3 nvidia_update_checker.py --check --discord-alerts
# Check silently (cron job logging)
python3 nvidia_update_checker.py --check
# Test Discord integration
python3 nvidia_update_checker.py --discord-test
"""
import argparse
import json
import logging
import subprocess
import sys
from dataclasses import dataclass, asdict
from datetime import datetime
from typing import List, Optional
import requests
@dataclass
class PackageUpdate:
name: str
current_version: str
available_version: str
held: bool
@dataclass
class UpdateCheckResult:
timestamp: str
updates_available: bool
held_packages: List[PackageUpdate]
other_packages: List[PackageUpdate]
total_updates: int
class DiscordNotifier:
def __init__(self, webhook_url: str, timeout: int = 10):
self.webhook_url = webhook_url
self.timeout = timeout
self.logger = logging.getLogger(f"{__name__}.DiscordNotifier")
def send_alert(self, title: str, description: str, color: int = 0xffa500,
fields: list = None) -> bool:
"""Send embed alert to Discord."""
embed = {
"title": title,
"description": description,
"color": color,
"timestamp": datetime.now().isoformat(),
"fields": fields or []
}
payload = {
"username": "NVIDIA Update Monitor",
"embeds": [embed]
}
try:
response = requests.post(
self.webhook_url,
json=payload,
timeout=self.timeout
)
response.raise_for_status()
self.logger.info("Discord notification sent successfully")
return True
except Exception as e:
self.logger.error(f"Failed to send Discord notification: {e}")
return False
def send_update_available_alert(self, updates: List[PackageUpdate]) -> bool:
"""Send alert when NVIDIA driver updates are available."""
version_list = "\n".join([
f"• **{pkg.name}**: {pkg.current_version}{pkg.available_version}"
for pkg in updates
])
fields = [
{
"name": "Available Updates",
"value": version_list,
"inline": False
},
{
"name": "⚠️ Action Required",
"value": (
"These packages are held and will NOT auto-update.\n"
"Plan a maintenance window to update manually:\n"
"```bash\n"
"sudo apt-mark unhold nvidia-driver-570\n"
"sudo apt update && sudo apt upgrade\n"
"sudo reboot\n"
"```"
),
"inline": False
}
]
return self.send_alert(
title="🔔 NVIDIA Driver Update Available",
description=f"New NVIDIA driver version(s) available for ubuntu-manticore ({len(updates)} package(s))",
color=0xffa500, # Orange
fields=fields
)
class NvidiaUpdateChecker:
def __init__(self, ssh_host: str = None, discord_webhook: str = None,
enable_discord: bool = False):
self.ssh_host = ssh_host
self.logger = logging.getLogger(__name__)
self.discord = None
if enable_discord and discord_webhook:
self.discord = DiscordNotifier(discord_webhook)
def _run_command(self, cmd: list, timeout: int = 30) -> tuple:
"""Run command locally or via SSH."""
if self.ssh_host:
cmd = ["ssh", self.ssh_host] + [" ".join(cmd)]
try:
result = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=timeout,
shell=isinstance(cmd[-1], str) and self.ssh_host is not None
)
return result.returncode, result.stdout.strip(), result.stderr.strip()
except subprocess.TimeoutExpired:
return -1, "", "Command timed out"
except Exception as e:
return -1, "", str(e)
def get_held_packages(self) -> List[str]:
"""Get list of held packages."""
cmd = ["apt-mark", "showhold"]
code, stdout, stderr = self._run_command(cmd)
if code != 0:
self.logger.error(f"Failed to get held packages: {stderr}")
return []
return [line.strip() for line in stdout.split("\n") if line.strip()]
def check_package_updates(self) -> List[PackageUpdate]:
"""Check for available updates."""
# Update package cache
update_cmd = ["apt-get", "update", "-qq"]
self._run_command(update_cmd)
# Get list of upgradable packages
cmd = ["apt", "list", "--upgradable"]
code, stdout, stderr = self._run_command(cmd)
if code != 0:
self.logger.error(f"Failed to check updates: {stderr}")
return []
held_packages = self.get_held_packages()
updates = []
for line in stdout.split("\n"):
if "/" not in line or "[upgradable" not in line:
continue
# Parse: package/release version arch [upgradable from: old_version]
parts = line.split()
if len(parts) < 6:
continue
package_name = parts[0].split("/")[0]
new_version = parts[1]
old_version = parts[5].rstrip("]")
# Filter for NVIDIA packages
if "nvidia" in package_name.lower():
updates.append(PackageUpdate(
name=package_name,
current_version=old_version,
available_version=new_version,
held=package_name in held_packages
))
return updates
def check_updates(self) -> UpdateCheckResult:
"""Perform full update check."""
timestamp = datetime.now().isoformat()
updates = self.check_package_updates()
held_updates = [u for u in updates if u.held]
other_updates = [u for u in updates if not u.held]
result = UpdateCheckResult(
timestamp=timestamp,
updates_available=len(held_updates) > 0,
held_packages=held_updates,
other_packages=other_updates,
total_updates=len(updates)
)
# Send Discord alert for held packages with updates
if result.updates_available and self.discord:
self.discord.send_update_available_alert(held_updates)
return result
def main():
parser = argparse.ArgumentParser(
description='Monitor NVIDIA driver updates on held packages'
)
parser.add_argument('--check', action='store_true', help='Check for updates')
parser.add_argument('--discord-webhook',
default='https://discord.com/api/webhooks/1404105821549498398/y2Ud1RK9rzFjv58xbypUfQNe3jrL7ZUq1FkQHa4_dfOHm2ylp93z0f4tY0O8Z-vQgKhD',
help='Discord webhook URL')
parser.add_argument('--discord-alerts', action='store_true',
help='Enable Discord alerts')
parser.add_argument('--discord-test', action='store_true',
help='Test Discord integration')
parser.add_argument('--ssh-host', default='cal@10.10.0.226',
help='SSH host for remote monitoring')
parser.add_argument('--output', choices=['json', 'pretty'], default='pretty')
parser.add_argument('--verbose', action='store_true', help='Verbose logging')
args = parser.parse_args()
# Configure logging
level = logging.DEBUG if args.verbose else logging.INFO
logging.basicConfig(
level=level,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
# Discord test
if args.discord_test:
notifier = DiscordNotifier(args.discord_webhook)
success = notifier.send_alert(
title="NVIDIA Update Monitor Test",
description="Discord integration is working correctly.",
color=0x00ff00,
fields=[
{"name": "Host", "value": args.ssh_host, "inline": True},
{"name": "Status", "value": "Test successful", "inline": True}
]
)
sys.exit(0 if success else 1)
# Update check
if args.check:
checker = NvidiaUpdateChecker(
ssh_host=args.ssh_host,
discord_webhook=args.discord_webhook,
enable_discord=args.discord_alerts
)
result = checker.check_updates()
if args.output == 'json':
print(json.dumps(asdict(result), indent=2))
else:
print(f"=== NVIDIA Update Check - {result.timestamp} ===")
if result.updates_available:
print(f"\n⚠️ {len(result.held_packages)} held package(s) have updates:")
for pkg in result.held_packages:
print(f"{pkg.name}: {pkg.current_version}{pkg.available_version}")
print("\nThese packages will NOT auto-update (held)")
print("Plan a maintenance window to update manually")
else:
print("\n✅ All held NVIDIA packages are up to date")
if result.other_packages:
print(f"\n {len(result.other_packages)} other NVIDIA package(s) have updates:")
for pkg in result.other_packages:
print(f"{pkg.name}: {pkg.current_version}{pkg.available_version}")
sys.exit(0 if not result.updates_available else 1)
parser.print_help()
if __name__ == '__main__':
main()