Cal Corum 3112b3d6fe CLAUDE: Add Jellyfin GPU health monitor with auto-restart

- Created jellyfin_gpu_monitor.py for detecting lost GPU access
- Sends Discord alerts when GPU access fails
- Auto-restarts container to restore GPU binding
- Runs every 5 minutes via cron on ubuntu-manticore
- Documents FFmpeg exit code 187 (NVENC failure) in troubleshooting

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-28 22:57:04 -06:00

4.4 KiB

Raw Blame History

Media Servers - Technology Context

Overview

Media server infrastructure for home lab environments, covering streaming services like Jellyfin and Plex with hardware-accelerated transcoding, library management, and client discovery.

Current Deployments

Jellyfin on ubuntu-manticore

Location: 10.10.0.226:8096
GPU: NVIDIA GTX 1070 (NVENC/NVDEC)
Documentation: jellyfin-ubuntu-manticore.md

Plex (Existing)

Location: TBD (potential migration to ubuntu-manticore)
Note: Currently running elsewhere, may migrate for GPU access

Architecture Patterns

GPU-Accelerated Transcoding

Pattern: Hardware encoding/decoding for real-time streaming

# Docker Compose GPU passthrough
deploy:
  resources:
    reservations:
      devices:
        - driver: nvidia
          count: all
          capabilities: [gpu]
environment:
  - NVIDIA_DRIVER_CAPABILITIES=all
  - NVIDIA_VISIBLE_DEVICES=all

Storage Strategy

Pattern: Tiered storage for different access patterns

Config: Local SSD (small, fast database access)
Cache: Local NVMe (transcoding temp, thumbnails)
Media: Network storage (large capacity, read-only mount)

Pattern: Resource allocation when multiple services share GPU

Limit background tasks (Tdarr) to fewer concurrent jobs
Prioritize real-time services (Jellyfin/Plex playback)
Consumer GPUs limited to 2-3 concurrent NVENC sessions

Common Configurations

NVIDIA GPU Setup

# Verify GPU in container
docker exec <container> nvidia-smi

# Check encoder/decoder utilization
nvidia-smi dmon -s u

Media Volume Mounts

volumes:
  - /mnt/truenas/media:/media:ro  # Read-only for safety

Client Discovery

Jellyfin: UDP 7359
Plex: UDP 32410-32414, GDM

Integration Points

Watch History Sync

Tool: watchstate (ghcr.io/arabcoders/watchstate)
Method: API-based sync between services
Note: NFO files do NOT store watch history

Tdarr Integration

Tdarr pre-processes media for optimal streaming
Shared GPU resources require coordination
See tdarr/CONTEXT.md for transcoding system details

Best Practices

Performance

Use NVMe for cache/transcoding temp directories
Mount media read-only to prevent accidental modifications
Enable hardware transcoding for all supported codecs
Limit concurrent transcodes based on GPU capability

Reliability

Use restart: unless-stopped for containers
Separate config from cache (different failure modes)
Monitor disk space on cache volumes
Regular database backups (config directory)

Security

Run containers as non-root (PUID/PGID)
Use read-only media mounts
Limit network exposure (internal LAN only)
Regular container image updates

GPU Compatibility Notes

NVIDIA Pascal (GTX 10-series)

NVENC: H.264, HEVC (no B-frames for HEVC)
NVDEC: H.264, HEVC, VP8, VP9
Sessions: 2 concurrent (consumer card limit)

NVIDIA Turing+ (RTX 20-series and newer)

NVENC: H.264, HEVC (with B-frames), AV1
NVDEC: H.264, HEVC, VP8, VP9, AV1
Sessions: 3+ concurrent

GPU Health Monitoring

Jellyfin GPU Monitor

Location: ubuntu-manticore:~/scripts/jellyfin_gpu_monitor.py Schedule: Every 5 minutes via cron Logs: ~/logs/jellyfin-gpu-monitor.log

The monitor detects when the Jellyfin container loses GPU access (common after driver updates or Docker restarts) and automatically:

Sends Discord alert
Restarts the container to restore GPU access
Confirms GPU is restored

Manual check:

ssh ubuntu-manticore "python3 ~/scripts/jellyfin_gpu_monitor.py --check"

FFmpeg exit code 187: Indicates NVENC failure due to lost GPU access. The monitor catches this condition before users report playback failures.

Troubleshooting

Common Issues

No GPU in container: Check Docker/Podman GPU passthrough config
Transcoding failures: Verify codec support for your GPU generation
Slow playback start: Check network mount performance
Cache filling up: Monitor trickplay/thumbnail generation
FFmpeg exit code 187: GPU access lost - monitor should auto-restart

Diagnostic Commands

# GPU status
nvidia-smi

# Container GPU access
docker exec <container> nvidia-smi

# Encoder/decoder utilization
nvidia-smi dmon -s u

# Container logs
docker logs <container> 2>&1 | tail -50

4.4 KiB Raw Blame History