From 0d552a839eb7f2a3424b9ab02ea26611f1727e9d Mon Sep 17 00:00:00 2001 From: Cal Corum Date: Sat, 7 Feb 2026 22:20:55 -0600 Subject: [PATCH] Add NVIDIA driver management and media server troubleshooting Document NVIDIA driver hold/update workflow, GPU health monitoring, and update checker integration for Jellyfin on ubuntu-manticore. Add media-servers troubleshooting guide. Co-Authored-By: Claude Opus 4.6 --- media-servers/jellyfin-ubuntu-manticore.md | 96 ++++ media-servers/troubleshooting.md | 524 +++++++++++++++++++++ 2 files changed, 620 insertions(+) create mode 100644 media-servers/troubleshooting.md diff --git a/media-servers/jellyfin-ubuntu-manticore.md b/media-servers/jellyfin-ubuntu-manticore.md index 64d8ff7..aff57d9 100644 --- a/media-servers/jellyfin-ubuntu-manticore.md +++ b/media-servers/jellyfin-ubuntu-manticore.md @@ -129,6 +129,86 @@ For syncing watch history between Plex and Jellyfin: - Syncs via API, not NFO files - NFO files don't store watch state +## NVIDIA Driver Management + +### Auto-Update Prevention + +**Issue**: NVIDIA driver auto-updates can cause driver/library version mismatches, breaking GPU access until the host is rebooted. This causes Jellyfin downtime. + +**Solution**: Driver packages are held to prevent automatic updates: +```bash +# Packages currently held (as of 2026-02-05): +nvidia-driver-570 +nvidia-kernel-common-570 +nvidia-dkms-570 +``` + +**Verify held packages:** +```bash +apt-mark showhold +``` + +### Update Monitoring + +A monitoring script checks for NVIDIA driver updates weekly and sends Discord alerts when new versions are available: + +**Script**: `/home/cal/scripts/nvidia_update_checker.py` +**Schedule**: Every Monday at 9 AM +**Logs**: `/home/cal/logs/nvidia-update-checker.log` + +**Manual check:** +```bash +python3 /home/cal/scripts/nvidia_update_checker.py --check --discord-alerts +``` + +**Test Discord integration:** +```bash +python3 /home/cal/scripts/nvidia_update_checker.py --discord-test +``` + +### Planned Driver Updates + +When Discord alerts about available updates, plan a maintenance window: + +1. **Unhold packages:** + ```bash + sudo apt-mark unhold nvidia-driver-570 + ``` + +2. **Update drivers:** + ```bash + sudo apt update && sudo apt upgrade nvidia-driver-570 + ``` + +3. **Reboot immediately** (driver changes require reboot): + ```bash + sudo reboot + ``` + +4. **Verify after reboot:** + ```bash + nvidia-smi + docker exec jellyfin nvidia-smi + ``` + +5. **Re-hold packages:** + ```bash + sudo apt-mark hold nvidia-driver-570 nvidia-kernel-common-570 nvidia-dkms-570 + ``` + +### GPU Health Monitoring + +Jellyfin GPU access is monitored every 5 minutes: + +**Script**: `/home/cal/scripts/jellyfin_gpu_monitor.py` +**Features**: +- Detects GPU access loss +- Sends Discord alerts +- Auto-restarts container (if GPU accessible) +- Logs to `/home/cal/logs/jellyfin-gpu-monitor.log` + +**Note**: Container restart cannot fix host-level driver issues. If Discord alerts show "Restart failed" with driver/library mismatch, a host reboot is required. + ## Troubleshooting ### GPU Not Detected in Transcoding @@ -149,6 +229,22 @@ Check Jellyfin logs in Dashboard → Logs or: docker logs jellyfin 2>&1 | tail -50 ``` +### Driver/Library Version Mismatch + +**Symptoms**: +- `nvidia-smi` fails with "driver/library version mismatch" +- Jellyfin container won't start with NVML error +- GPU monitoring alerts show "Restart failed" + +**Cause**: NVIDIA driver updated but kernel modules not reloaded + +**Solution**: Reboot the host +```bash +sudo reboot +``` + ## Related Documentation - Server inventory: `networking/server-inventory.md` - Tdarr setup: `tdarr/ubuntu-manticore-setup.md` +- GPU monitoring: `monitoring/scripts/jellyfin_gpu_monitor.py` +- Update monitoring: `monitoring/scripts/nvidia_update_checker.py` diff --git a/media-servers/troubleshooting.md b/media-servers/troubleshooting.md new file mode 100644 index 0000000..32d4970 --- /dev/null +++ b/media-servers/troubleshooting.md @@ -0,0 +1,524 @@ +# Media Servers - Troubleshooting Guide + +## Common Issues and Solutions + +### GPU Transcoding Problems + +#### GPU Not Detected in Container +**Symptoms**: +- Jellyfin shows "No hardware acceleration available" +- Transcoding falls back to CPU (slow performance) +- Container logs show NVIDIA device not found + +**Diagnosis**: +```bash +# Check GPU accessibility from container +docker exec jellyfin nvidia-smi + +# Verify NVIDIA runtime is configured +docker info | grep -i nvidia + +# Check container GPU configuration +docker inspect jellyfin | grep -i gpu +``` + +**Solutions**: +1. **Verify NVIDIA Container Runtime**: + ```bash + # On host + nvidia-smi # Should work + + # Install nvidia-container-toolkit if missing + sudo apt install nvidia-container-toolkit + sudo systemctl restart docker + ``` + +2. **Fix Docker Compose Configuration**: + ```yaml + services: + jellyfin: + deploy: + resources: + reservations: + devices: + - driver: nvidia + count: all + capabilities: [gpu] + ``` + +3. **Restart Container**: + ```bash + docker compose down + docker compose up -d + ``` + +#### Driver/Library Version Mismatch +**Symptoms**: +- `nvidia-smi` fails with "driver/library version mismatch" +- Container won't start with NVML error +- GPU monitoring shows "Restart failed" + +**Cause**: NVIDIA driver updated on host but kernel modules not reloaded + +**Solution**: +```bash +# Check host GPU status +nvidia-smi # Will fail with mismatch error + +# Reboot required to reload kernel modules +sudo reboot + +# After reboot, verify +nvidia-smi +docker exec jellyfin nvidia-smi +``` + +**Prevention**: +- See `/media-servers/jellyfin-ubuntu-manticore.md` NVIDIA Driver Management section +- Hold driver packages to prevent auto-updates +- Monitor for updates weekly via automated checks + +#### Transcoding Starts Then Fails +**Symptoms**: +- Playback begins then stops +- Jellyfin logs show ffmpeg errors +- GPU memory errors in logs + +**Diagnosis**: +```bash +# Check GPU memory usage +nvidia-smi + +# Check for concurrent GPU users (Tdarr, other containers) +docker ps | grep -E "tdarr|jellyfin" + +# Check Jellyfin transcode logs +docker logs jellyfin 2>&1 | grep -i transcode | tail -50 +``` + +**Solutions**: +1. **GPU Resource Conflict**: If Tdarr is using GPU, pause transcoding or limit concurrent jobs +2. **Insufficient GPU Memory**: + ```bash + # Check GPU memory + nvidia-smi --query-gpu=memory.used,memory.total --format=csv + + # Reduce Jellyfin transcode resolution or bitrate + ``` +3. **Codec Not Supported**: Verify codec is supported by GPU encoder + ```bash + # Check available encoders + docker exec jellyfin ffmpeg -encoders 2>/dev/null | grep nvenc + ``` + +### Container Startup Issues + +#### Container Won't Start After Update +**Symptoms**: +- Container exits immediately after `docker compose up -d` +- Exit code indicates error (non-zero) + +**Diagnosis**: +```bash +# Check container logs +docker logs jellyfin + +# Check exit code +docker inspect jellyfin | grep ExitCode + +# Try starting in foreground for detailed output +docker compose up +``` + +**Common Causes & Solutions**: + +1. **Permission Issues**: + ```bash + # Fix ownership of config/cache directories + sudo chown -R 1000:1000 ~/docker/jellyfin/config + sudo chown -R 1000:1000 /mnt/NV2/jellyfin-cache + ``` + +2. **Port Already in Use**: + ```bash + # Check if port 8096 is in use + sudo lsof -i :8096 + + # Kill conflicting process or change Jellyfin port + ``` + +3. **Volume Mount Failures**: + ```bash + # Verify all mount points exist and are accessible + ls -la ~/docker/jellyfin/config + ls -la /mnt/NV2/jellyfin-cache + mount | grep /mnt/truenas/media + ``` + +#### Container Stuck in "Restarting" Loop +**Symptoms**: +- Docker shows container constantly restarting +- Brief uptime then crash + +**Diagnosis**: +```bash +# Watch restart behavior +docker stats jellyfin + +# Check logs for crash reason +docker logs jellyfin --tail 200 + +# Check resource limits +docker inspect jellyfin | grep -A 10 Resources +``` + +**Solutions**: +1. **Database Corruption**: + ```bash + # Stop container + docker stop jellyfin + + # Backup database + cp ~/docker/jellyfin/config/data/library.db{,.bak} + + # Try recovery + sqlite3 ~/docker/jellyfin/config/data/library.db "PRAGMA integrity_check;" + ``` + +2. **Configuration File Issue**: + ```bash + # Rename config to force regeneration + mv ~/docker/jellyfin/config/system.xml{,.bak} + + # Restart container + docker compose up -d + ``` + +### Network & Connectivity + +#### Can't Access Web Interface +**Symptoms**: +- http://10.10.0.226:8096 not responding +- Connection timeout or refused + +**Diagnosis**: +```bash +# Check if container is running +docker ps | grep jellyfin + +# Check port binding +docker port jellyfin + +# Test local connectivity +curl -I http://localhost:8096 +curl -I http://10.10.0.226:8096 + +# Check firewall +sudo ufw status | grep 8096 +``` + +**Solutions**: +1. **Container Not Running**: Start container + ```bash + docker compose up -d + ``` + +2. **Port Not Bound Correctly**: + ```yaml + # Fix docker-compose.yml + ports: + - "8096:8096" # Not "0.0.0.0:8096:8096" on some systems + ``` + +3. **Firewall Blocking**: + ```bash + sudo ufw allow 8096/tcp + ``` + +#### Client Discovery Not Working +**Symptoms**: +- Jellyfin apps can't auto-discover server +- Must manually enter IP address + +**Diagnosis**: +```bash +# Check UDP discovery port +docker port jellyfin | grep 7359 + +# Verify UDP traffic allowed +sudo ufw status | grep 7359 +``` + +**Solution**: +```bash +# Ensure UDP port exposed +# In docker-compose.yml: +ports: + - "7359:7359/udp" + +# Allow in firewall +sudo ufw allow 7359/udp +``` + +### Performance Issues + +#### Slow Transcoding Performance +**Symptoms**: +- Buffering during playback +- High CPU usage despite GPU available +- Transcoding slower than real-time + +**Diagnosis**: +```bash +# Check if GPU transcoding is actually being used +nvidia-smi dmon -s u -c 5 # Monitor GPU usage + +# Check Jellyfin Dashboard > Playback for active transcodes + +# Verify hardware accel is enabled in Jellyfin settings +``` + +**Solutions**: +1. **Hardware Acceleration Not Enabled**: + - Dashboard → Playback → Transcoding + - Select "NVIDIA NVENC" + - Enable desired codecs + +2. **GPU Busy with Other Tasks**: + ```bash + # Check what else is using GPU + nvidia-smi + + # Pause Tdarr if running + docker stop tdarr-node-gpu + ``` + +3. **Cache on Slow Storage**: + ```bash + # Verify cache is on NVMe, not network storage + docker inspect jellyfin | grep -A 5 cache + + # Should be /mnt/NV2/jellyfin-cache (NVMe) + # NOT /mnt/truenas/... (network) + ``` + +#### High Memory Usage +**Symptoms**: +- Jellyfin using excessive RAM +- Server becomes unresponsive +- OOM (Out of Memory) errors + +**Diagnosis**: +```bash +# Check memory usage +docker stats jellyfin + +# Check for memory leaks in logs +docker logs jellyfin | grep -i memory +``` + +**Solutions**: +1. **Set Memory Limits**: + ```yaml + # In docker-compose.yml + deploy: + resources: + limits: + memory: 4G + ``` + +2. **Reduce Transcode Throttle**: + - Dashboard → Playback + - Lower "Throttle Transcodes" value + +3. **Clear Transcode Cache**: + ```bash + # Stop container + docker stop jellyfin + + # Clear transcode cache + rm -rf /mnt/NV2/jellyfin-cache/transcodes/* + + # Start container + docker start jellyfin + ``` + +### Playback Problems + +#### Playback Stuttering Despite Good Network +**Symptoms**: +- Video plays but stutters/buffers frequently +- Network speed is adequate +- Direct play works, transcoding stutters + +**Solutions**: +1. **Check Transcode Quality Settings**: + - Lower bitrate in client settings + - Reduce resolution if needed + +2. **Verify GPU Transcoding Active**: + ```bash + # While playing, check GPU usage + nvidia-smi dmon -s u + # Should show encoder (enc) usage + ``` + +3. **Check Storage I/O**: + ```bash + # Monitor disk I/O during playback + iostat -x 2 5 + ``` + +#### Audio/Video Sync Issues +**Symptoms**: +- Audio and video out of sync during playback + +**Solutions**: +1. **Enable Audio Passthrough** (if supported by client) +2. **Update ffmpeg** in container (usually handled by Jellyfin updates) +3. **Try Different Transcode Settings**: + - Disable subtitle burn-in if not needed + - Change audio codec settings + +### Monitoring & Alerts + +#### GPU Monitor Alerts Not Working +**Symptoms**: +- No Discord notifications when GPU issues occur +- Monitoring script seems to run but no alerts + +**Diagnosis**: +```bash +# Test Discord webhook +python3 /home/cal/scripts/jellyfin_gpu_monitor.py --discord-test + +# Check monitoring logs +tail -f /home/cal/logs/jellyfin-gpu-monitor.log + +# Verify cron job is running +crontab -l | grep jellyfin_gpu +``` + +**Solutions**: +1. **Webhook URL Invalid**: + - Verify webhook URL in script + - Test with curl: `curl -X POST ` + +2. **Script Permissions**: + ```bash + chmod +x /home/cal/scripts/jellyfin_gpu_monitor.py + ``` + +3. **Cron Environment Issues**: + ```bash + # Test script manually + /usr/bin/python3 /home/cal/scripts/jellyfin_gpu_monitor.py --check --discord-alerts + ``` + +## Emergency Recovery Procedures + +### Complete System Recovery + +#### Jellyfin Won't Start (All Else Failed) +1. **Stop Container**: + ```bash + docker stop jellyfin + docker rm jellyfin + ``` + +2. **Backup Configuration**: + ```bash + cp -r ~/docker/jellyfin/config ~/docker/jellyfin/config.backup.$(date +%Y%m%d) + ``` + +3. **Pull Fresh Image**: + ```bash + docker pull jellyfin/jellyfin:latest + ``` + +4. **Recreate Container**: + ```bash + cd ~/docker/jellyfin + docker compose up -d + ``` + +5. **Restore Settings** (if needed): + - Copy specific config files from backup + - Don't restore corrupt database + +#### GPU Completely Broken +1. **Verify Host GPU**: + ```bash + # If nvidia-smi fails with driver mismatch + sudo reboot + ``` + +2. **Remove GPU Access** (temporary workaround): + ```yaml + # Comment out GPU sections in docker-compose.yml + # CPU transcoding only until GPU fixed + ``` + +3. **Reinstall NVIDIA Drivers** (if reboot doesn't help): + ```bash + # Unhold packages + sudo apt-mark unhold nvidia-driver-570 + + # Reinstall + sudo apt remove --purge nvidia-* + sudo apt install nvidia-driver-570 + sudo reboot + + # Re-hold after working + sudo apt-mark hold nvidia-driver-570 + ``` + +## Preventive Maintenance + +### Regular Checks (Weekly) +```bash +# Check GPU health +nvidia-smi + +# Verify Jellyfin accessible +curl -I http://10.10.0.226:8096 + +# Check disk space (cache can grow large) +df -h /mnt/NV2 +df -h ~/docker/jellyfin/config + +# Review logs for errors +docker logs jellyfin --since 7d | grep -i error +``` + +### Monthly Tasks +```bash +# Update Jellyfin +cd ~/docker/jellyfin +docker compose pull +docker compose up -d + +# Clean old transcodes +find /mnt/NV2/jellyfin-cache/transcodes/ -type f -mtime +7 -delete + +# Backup configuration +tar -czf ~/jellyfin-config-backup-$(date +%Y%m%d).tar.gz ~/docker/jellyfin/config/ +``` + +### Before Major Changes +- Create snapshot if on Proxmox +- Backup full config directory +- Test on non-production instance if possible +- Document current working configuration + +## Related Documentation +- **Setup Guide**: `/media-servers/jellyfin-ubuntu-manticore.md` +- **NVIDIA Driver Management**: See jellyfin-ubuntu-manticore.md +- **GPU Monitoring**: `/monitoring/scripts/CONTEXT.md` +- **Technology Overview**: `/media-servers/CONTEXT.md` +- **Main Instructions**: `/CLAUDE.md` + +## Support Resources +- **Jellyfin Docs**: https://jellyfin.org/docs/ +- **NVIDIA Container Toolkit**: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/ +- **Discord Monitoring**: See `/monitoring/scripts/jellyfin_gpu_monitor.py`