Closes #27 - proxmox-backup-check.sh: SSHes to Proxmox, queries pvesh task history, classifies each running VM/CT as green/yellow/red by backup recency, posts a Discord embed summary. Designed for weekly cron on CT 302. - ct302-self-health.sh: Checks disk usage on CT 302 itself, silently exits when healthy, posts a Discord alert when any filesystem exceeds 80% threshold. Closes the blind spot where the monitoring system cannot monitor itself externally. - Updated monitoring/scripts/CONTEXT.md with full operational docs, install instructions, and cron schedules for both new scripts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
19 KiB
| title | description | type | domain | tags | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Monitoring Scripts Context | Operational context for all monitoring scripts: Proxmox backup checker, CT 302 self-health, Jellyfin GPU health monitor, NVIDIA driver update checker, Tdarr API/file monitors, and Windows reboot detection. Includes cron schedules, Discord integration patterns, and troubleshooting. | context | monitoring |
|
Monitoring Scripts - Operational Context
Script Overview
This directory contains active operational scripts for system monitoring, health checks, alert notifications, and automation across the homelab infrastructure.
Core Monitoring Scripts
Proxmox Backup Verification
Script: proxmox-backup-check.sh
Purpose: Weekly check that every running VM/CT has a successful vzdump backup within 7 days. Posts a color-coded Discord embed with per-guest status.
Key Features:
- SSHes to Proxmox host and queries
pveshtask history + guest lists via API - Categorizes each guest: 🟢 green (backed up), 🟡 yellow (overdue), 🔴 red (no backup)
- Sorts output by VMID; only posts to Discord — no local side effects
--dry-runmode prints the Discord payload without sending--days Noverrides the default 7-day window
Schedule: Weekly on Monday 08:00 UTC (CT 302 cron)
0 8 * * 1 DISCORD_WEBHOOK="<url>" /root/scripts/proxmox-backup-check.sh >> /var/log/proxmox-backup-check.log 2>&1
Usage:
# Dry run (no Discord)
proxmox-backup-check.sh --dry-run
# Post to Discord
DISCORD_WEBHOOK="https://discord.com/api/webhooks/..." proxmox-backup-check.sh
# Custom window
proxmox-backup-check.sh --days 14 --discord-webhook "https://..."
Dependencies: jq, curl, SSH access to Proxmox host alias proxmox
Install on CT 302:
cp proxmox-backup-check.sh /root/scripts/
chmod +x /root/scripts/proxmox-backup-check.sh
CT 302 Self-Health Monitor
Script: ct302-self-health.sh
Purpose: Monitors disk usage on CT 302 (claude-runner) itself. Alerts to Discord when any filesystem exceeds the threshold (default 80%). Runs silently when healthy — no Discord spam on green.
Key Features:
- Checks all non-virtual filesystems (
df, excludes tmpfs/devtmpfs/overlay) - Only sends a Discord alert when a filesystem is at or above threshold
--always-postflag forces a post even when healthy (useful for testing)--dry-runmode prints payload without sending
Schedule: Daily at 07:00 UTC (CT 302 cron)
0 7 * * * DISCORD_WEBHOOK="<url>" /root/scripts/ct302-self-health.sh >> /var/log/ct302-self-health.log 2>&1
Usage:
# Check and alert if over 80%
DISCORD_WEBHOOK="https://discord.com/api/webhooks/..." ct302-self-health.sh
# Lower threshold test
ct302-self-health.sh --threshold 50 --dry-run
# Always post (weekly status report pattern)
ct302-self-health.sh --always-post --discord-webhook "https://..."
Dependencies: jq, curl, df
Install on CT 302:
cp ct302-self-health.sh /root/scripts/
chmod +x /root/scripts/ct302-self-health.sh
Jellyfin GPU Health Monitor
Script: jellyfin_gpu_monitor.py
Purpose: Monitor Jellyfin container GPU access with Discord alerts and auto-restart capability
Key Features:
- GPU accessibility monitoring via nvidia-smi in container
- Container status verification
- Discord webhook notifications for GPU issues
- Automatic container restart on GPU access loss (configurable)
- Comprehensive logging with decision tracking
Schedule: Every 5 minutes via cron
*/5 * * * * /usr/bin/python3 /home/cal/scripts/jellyfin_gpu_monitor.py --check --discord-alerts --auto-restart >> /home/cal/logs/jellyfin-gpu-monitor.log 2>&1
Usage:
# Health check with Discord alerts
python3 jellyfin_gpu_monitor.py --check --discord-alerts
# With auto-restart on failure
python3 jellyfin_gpu_monitor.py --check --discord-alerts --auto-restart
# Test Discord integration
python3 jellyfin_gpu_monitor.py --discord-test
# JSON output for parsing
python3 jellyfin_gpu_monitor.py --check --output json
Alert Types:
- 🔴 GPU Access Lost - nvidia-smi fails in container, transcoding will fail
- 🟢 GPU Access Restored - After successful restart, GPU working again
- ⚠️ Restart Failed - Host-level issue (requires host reboot)
Locations:
- Script:
/home/cal/scripts/jellyfin_gpu_monitor.py(on ubuntu-manticore) - Logs:
/home/cal/logs/jellyfin-gpu-monitor.log - Remote execution via SSH from monitoring system
Limitations: Container restart cannot fix host-level NVIDIA driver issues. If "Restart failed" with driver/library mismatch, host reboot required.
NVIDIA Driver Update Monitor
Script: nvidia_update_checker.py
Purpose: Weekly monitoring for NVIDIA driver updates on held packages with Discord notifications
Key Features:
- Checks for available updates to held NVIDIA packages
- Sends Discord alerts when new driver versions available
- Includes manual update instructions in alert
- JSON and pretty output modes
- Remote execution via SSH
Schedule: Weekly (Mondays at 9 AM)
0 9 * * 1 /usr/bin/python3 /home/cal/scripts/nvidia_update_checker.py --check --discord-alerts >> /home/cal/logs/nvidia-update-checker.log 2>&1
Usage:
# Check for updates with Discord alert
python3 nvidia_update_checker.py --check --discord-alerts
# Silent check (for cron)
python3 nvidia_update_checker.py --check
# Test Discord integration
python3 nvidia_update_checker.py --discord-test
# JSON output
python3 nvidia_update_checker.py --check --output json
Alert Content:
- 🔔 Update Available - Lists package versions (current → available)
- ⚠️ Action Required - Includes manual update procedure with commands
- Package list with version comparison
- Reminder that packages are held and won't auto-update
Locations:
- Script:
/home/cal/scripts/nvidia_update_checker.py(on ubuntu-manticore) - Logs:
/home/cal/logs/nvidia-update-checker.log
Context: Part of NVIDIA driver management strategy to prevent surprise auto-updates causing GPU access loss. See /media-servers/jellyfin-ubuntu-manticore.md for full driver management documentation.
Tdarr API Monitor
Script: tdarr_monitor.py
Purpose: Comprehensive Tdarr server/node monitoring with Discord notifications and dataclass-based status tracking
Key Features:
- Server health monitoring (API connectivity, database status)
- Node status tracking (worker count, queue depth, GPU usage)
- Transcode statistics (files processed, queue size, errors)
- Discord notifications for critical issues
- Dataclass-based status representation for type safety
- JSON and pretty output modes
Usage:
# Full health check
python3 tdarr_monitor.py --check
# With Discord alerts
python3 tdarr_monitor.py --check --discord-alerts
# Monitor specific node
python3 tdarr_monitor.py --check --node-id tdarr-node-gpu
# JSON output
python3 tdarr_monitor.py --check --output json
Monitoring Scope:
- Server Health: API availability, response times, database connectivity
- Node Health: Worker status, GPU availability, processing capacity
- Queue Status: Files waiting, active transcodes, completion rate
- Error Detection: Failed transcodes, stuck jobs, node disconnections
Integration: Works with Tdarr's gaming-aware scheduler. Monitors both unmapped nodes (local cache) and standard nodes.
Documentation: See /tdarr/CONTEXT.md and /tdarr/scripts/CONTEXT.md for Tdarr infrastructure details.
Tdarr File Monitor
Script: tdarr_file_monitor.py
Purpose: Monitors Tdarr cache directory for completed .mkv files and backs them up
Key Features:
- Recursive .mkv file detection in Tdarr cache
- Size change monitoring to detect completion
- Configurable completion wait time (default: 60 seconds)
- Automatic backup to manual-backup directory
- Persistent state tracking across runs
- Duplicate handling (keeps smallest version)
- Comprehensive logging
Schedule: Via cron wrapper tdarr-file-monitor-cron.sh
Usage:
# Run file monitor scan
python3 tdarr_file_monitor.py
# Custom directories
python3 tdarr_file_monitor.py --source /path/to/cache --dest /path/to/backup
Configuration:
- Source:
/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/temp - Media:
/mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/media - Destination:
/mnt/NV2/tdarr-cache/manual-backup - State File:
/mnt/NV2/Development/claude-home/logs/tdarr_file_monitor_state.json - Log File:
/mnt/NV2/Development/claude-home/logs/tdarr_file_monitor.log
Completion Detection:
- File discovered in cache directory
- Size tracked over time
- When size stable for completion_wait_seconds (60s), marked complete
- File copied to backup location
- State persisted for next run
Cron Wrapper: tdarr-file-monitor-cron.sh
#!/bin/bash
# Wrapper for tdarr_file_monitor.py with logging
cd /mnt/NV2/Development/claude-home/monitoring/scripts/
python3 tdarr_file_monitor.py >> /mnt/NV2/Development/claude-home/logs/tdarr-file-monitor-cron.log 2>&1
Note: Schedule not currently active. Enable when needed for automatic backup of completed transcodes.
Windows Desktop Monitoring
Directory: windows-desktop/
Purpose: Monitor Windows machine reboots and system events with Discord notifications
Core Script: windows-reboot-monitor.ps1
Features:
- System startup monitoring (normal and unexpected)
- Shutdown detection (planned and unplanned)
- Reboot reason analysis (Windows Updates, power outages, user-initiated)
- System uptime and boot statistics tracking
- Discord webhook notifications with color coding
- Event log analysis for root cause determination
Task Scheduler Integration:
- Startup Task:
windows-reboot-task-startup.xml - Shutdown Task:
windows-reboot-task-shutdown.xml
Notification Types:
- 🟢 Normal Startup - System booted after planned shutdown
- 🔴 Unexpected Restart - Recovery from power loss/crash/forced reboot
- 🟡 Planned Shutdown - System shutting down gracefully
Information Captured:
- Computer name and timestamp
- Boot/shutdown reasons (detailed)
- System uptime duration
- Boot counter for restart frequency tracking
- Event log context
Use Cases:
- Power outage detection
- Windows Update monitoring
- Hardware failure alerts
- Remote system availability tracking
- Uptime statistics
Setup: See windows-desktop/windows-setup-instructions.md for complete installation guide
Operational Patterns
Monitoring Schedule
Active Cron Jobs (on ubuntu-manticore via cal user):
# Jellyfin GPU monitoring - Every 5 minutes
*/5 * * * * /usr/bin/python3 /home/cal/scripts/jellyfin_gpu_monitor.py --check --discord-alerts --auto-restart >> /home/cal/logs/jellyfin-gpu-monitor.log 2>&1
# NVIDIA driver update checks - Weekly (Mondays at 9 AM)
0 9 * * 1 /usr/bin/python3 /home/cal/scripts/nvidia_update_checker.py --check --discord-alerts >> /home/cal/logs/nvidia-update-checker.log 2>&1
Active Cron Jobs (on CT 302 / claude-runner, root user):
# Proxmox backup verification - Weekly (Mondays at 8 AM UTC)
0 8 * * 1 DISCORD_WEBHOOK="<homelab-alerts-webhook>" /root/scripts/proxmox-backup-check.sh >> /var/log/proxmox-backup-check.log 2>&1
# CT 302 self-health disk check - Daily at 7 AM UTC (alerts only when >80%)
0 7 * * * DISCORD_WEBHOOK="<homelab-alerts-webhook>" /root/scripts/ct302-self-health.sh >> /var/log/ct302-self-health.log 2>&1
Note: Scripts must be installed manually on CT 302. Source of truth is monitoring/scripts/ in this repo — copy to /root/scripts/ on CT 302 to deploy.
Manual/On-Demand:
tdarr_monitor.py- Run as needed for Tdarr health checkstdarr_file_monitor.py- Can be scheduled if automatic backup needed- Windows monitoring - Automatic via Task Scheduler on Windows machines
Discord Integration
All monitoring scripts use Discord webhooks for notifications:
- Standardized embed format with color coding
- Timestamp inclusion
- Actionable information in alerts
- Field-based structured data presentation
Webhook Configuration:
- Default webhook embedded in scripts
- Can be overridden via command-line arguments
- Test commands available for verification
Color Coding:
- 🔴 Red (
0xff6b6b) - Critical issues, failures - 🟡 Orange (
0xffa500) - Warnings, actions needed - 🟢 Green (
0x28a745) - Success, recovery, normal operations
Logging Strategy
Centralized Log Locations:
- ubuntu-manticore:
/home/cal/logs/(GPU and driver monitoring) - Development repo:
/mnt/NV2/Development/claude-home/logs/(file monitor, state files)
Log Formats:
- Timestamped entries
- Structured logging with severity levels
- Decision reasoning included
- Error stack traces when applicable
Log Rotation:
- Manual cleanup recommended
- Focus on recent activity (last 30-90 days)
- State files maintained separately
Accessing Logs:
# GPU monitor logs
ssh cal@10.10.0.226 "tail -f /home/cal/logs/jellyfin-gpu-monitor.log"
# Driver update logs
ssh cal@10.10.0.226 "tail -f /home/cal/logs/nvidia-update-checker.log"
# File monitor logs
tail -f /mnt/NV2/Development/claude-home/logs/tdarr_file_monitor.log
Troubleshooting Context
Common Issues
1. Discord Webhooks Not Working
# Test webhook connectivity
python3 <script>.py --discord-test
# Check for network connectivity
curl -X POST <webhook_url>
# Verify webhook URL is correct and active
2. Cron Jobs Not Running
# Verify cron service
ssh cal@10.10.0.226 "systemctl status cron"
# Check crontab
ssh cal@10.10.0.226 "crontab -l"
# Check cron logs
ssh cal@10.10.0.226 "grep CRON /var/log/syslog | tail -20"
# Test script manually
ssh cal@10.10.0.226 "/usr/bin/python3 /home/cal/scripts/jellyfin_gpu_monitor.py --check"
3. GPU Monitor Reports "Restart Failed"
# This indicates host-level GPU issue, not container issue
# Check host GPU status
ssh cal@10.10.0.226 "nvidia-smi"
# If driver/library mismatch, reboot host
ssh cal@10.10.0.226 "sudo reboot"
4. File Monitor Not Detecting Files
# Check source directory accessibility
ls -la /mnt/NV2/tdarr-cache/nobara-pc-gpu-unmapped/temp/
# Verify state file
cat /mnt/NV2/Development/claude-home/logs/tdarr_file_monitor_state.json
# Check permissions
stat /mnt/NV2/tdarr-cache/manual-backup/
5. Windows Monitor Not Sending Alerts
# Check Task Scheduler tasks
Get-ScheduledTask | Where-Object {$_.TaskName -like "*reboot*"}
# Test script manually
powershell.exe -ExecutionPolicy Bypass -File windows-reboot-monitor.ps1 -EventType Startup
# Check Windows Event Logs
Get-EventLog -LogName System -Newest 20
Diagnostic Commands
# Test all monitoring scripts
python3 jellyfin_gpu_monitor.py --discord-test
python3 nvidia_update_checker.py --discord-test
python3 tdarr_monitor.py --check
# Check monitoring script locations
ssh cal@10.10.0.226 "ls -la /home/cal/scripts/"
# Verify log file creation
ssh cal@10.10.0.226 "ls -lah /home/cal/logs/"
# Check script dependencies
python3 -c "import requests; print('requests OK')"
python3 -c "import dataclasses; print('dataclasses OK')"
# Monitor real-time logs
ssh cal@10.10.0.226 "tail -f /home/cal/logs/*.log"
Integration Points
External Dependencies
- Discord Webhooks: For all notification delivery
- SSH: Remote script execution on ubuntu-manticore
- Python 3: Runtime for all Python scripts
- requests library: HTTP communication for webhooks and APIs
- cron: Scheduled task execution
- nvidia-smi: GPU status checks
- Docker: Container inspection and management
- Task Scheduler: Windows automation (for windows-desktop monitoring)
File System Dependencies
- Script Locations:
- Ubuntu-manticore:
/home/cal/scripts/ - Development repo:
/mnt/NV2/Development/claude-home/monitoring/scripts/
- Ubuntu-manticore:
- Log Directories:
- Ubuntu-manticore:
/home/cal/logs/ - Development repo:
/mnt/NV2/Development/claude-home/logs/
- Ubuntu-manticore:
- State Files: JSON persistence for stateful monitors
- Tdarr Cache:
/mnt/NV2/tdarr-cache/(for file monitor)
Network Dependencies
- Discord API: webhook.discord.com (HTTPS)
- Tdarr API: For tdarr_monitor.py (typically localhost or LAN)
- SSH Access: For remote script execution and log access
- DNS Resolution: For hostname-based connections
Security Considerations
Webhook Security
- Webhook URLs embedded in scripts (consider environment variables)
- Webhooks provide one-way notification (no command execution risk)
- Rate limiting on Discord side prevents abuse
- No sensitive data in notifications (system status only)
Script Execution
- Scripts run as cal user (non-root)
- Cron jobs execute with user permissions
- No password authentication required (key-based SSH)
- Scripts cannot modify system configuration (except via sudo)
Log Files
- Logs may contain system information
- File permissions: 644 (readable but not world-writable)
- No secrets or credentials in logs
- Regular cleanup recommended
Performance Considerations
Monitoring Overhead
- GPU Monitor: Minimal (docker exec + nvidia-smi)
- Update Checker: Low (weekly, apt cache updates)
- Tdarr Monitor: Low (API calls only)
- File Monitor: Medium (recursive directory scanning)
- Windows Monitor: Negligible (event-triggered only)
Optimization Strategies
- Cron schedules spaced to avoid conflicts
- JSON state files for persistence (avoid redundant work)
- Efficient file scanning (globbing vs full directory walks)
- Short-circuit logic (fail fast on errors)
Resource Usage
- CPU: <1% during normal operation
- Memory: Minimal (Python scripts ~10-50 MB each)
- Disk I/O: Logs grow slowly (<100 MB/year typical)
- Network: Minimal (webhook POSTs only)
Future Enhancements
Planned Improvements:
- Centralized monitoring dashboard
- Grafana/Prometheus integration
- Automatic log rotation
- Status aggregation across all monitors
- Retry logic for failed webhook deliveries
- Enhanced error recovery procedures
- Multi-channel notification support (email, SMS)
Related Documentation
- Technology Overview:
/monitoring/CONTEXT.md - Troubleshooting:
/monitoring/troubleshooting.md - Cron Management:
/monitoring/examples/cron-job-management.md - Tdarr Integration:
/tdarr/scripts/CONTEXT.md - Jellyfin Setup:
/media-servers/jellyfin-ubuntu-manticore.md - Main Instructions:
/CLAUDE.md- Context loading rules
Notes
This monitoring infrastructure provides comprehensive visibility into homelab services with minimal overhead. The Discord-based notification system ensures prompt awareness of issues while maintaining simplicity.
Scripts are designed for reliability and ease of troubleshooting, with extensive logging and test modes for validation. The modular approach allows individual monitors to be enabled/disabled independently based on current needs.
GPU monitoring and driver update checking were added specifically to prevent unplanned downtime from NVIDIA driver auto-updates, demonstrating the system's evolution based on operational learnings.