- Created jellyfin_gpu_monitor.py for detecting lost GPU access - Sends Discord alerts when GPU access fails - Auto-restarts container to restore GPU binding - Runs every 5 minutes via cron on ubuntu-manticore - Documents FFmpeg exit code 187 (NVENC failure) in troubleshooting Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
156 lines
4.4 KiB
Markdown
156 lines
4.4 KiB
Markdown
# Media Servers - Technology Context
|
|
|
|
## Overview
|
|
Media server infrastructure for home lab environments, covering streaming services like Jellyfin and Plex with hardware-accelerated transcoding, library management, and client discovery.
|
|
|
|
## Current Deployments
|
|
|
|
### Jellyfin on ubuntu-manticore
|
|
- **Location**: 10.10.0.226:8096
|
|
- **GPU**: NVIDIA GTX 1070 (NVENC/NVDEC)
|
|
- **Documentation**: `jellyfin-ubuntu-manticore.md`
|
|
|
|
### Plex (Existing)
|
|
- **Location**: TBD (potential migration to ubuntu-manticore)
|
|
- **Note**: Currently running elsewhere, may migrate for GPU access
|
|
|
|
## Architecture Patterns
|
|
|
|
### GPU-Accelerated Transcoding
|
|
**Pattern**: Hardware encoding/decoding for real-time streaming
|
|
```yaml
|
|
# Docker Compose GPU passthrough
|
|
deploy:
|
|
resources:
|
|
reservations:
|
|
devices:
|
|
- driver: nvidia
|
|
count: all
|
|
capabilities: [gpu]
|
|
environment:
|
|
- NVIDIA_DRIVER_CAPABILITIES=all
|
|
- NVIDIA_VISIBLE_DEVICES=all
|
|
```
|
|
|
|
### Storage Strategy
|
|
**Pattern**: Tiered storage for different access patterns
|
|
- **Config**: Local SSD (small, fast database access)
|
|
- **Cache**: Local NVMe (transcoding temp, thumbnails)
|
|
- **Media**: Network storage (large capacity, read-only mount)
|
|
|
|
### Multi-Service GPU Sharing
|
|
**Pattern**: Resource allocation when multiple services share GPU
|
|
- Limit background tasks (Tdarr) to fewer concurrent jobs
|
|
- Prioritize real-time services (Jellyfin/Plex playback)
|
|
- Consumer GPUs limited to 2-3 concurrent NVENC sessions
|
|
|
|
## Common Configurations
|
|
|
|
### NVIDIA GPU Setup
|
|
```bash
|
|
# Verify GPU in container
|
|
docker exec <container> nvidia-smi
|
|
|
|
# Check encoder/decoder utilization
|
|
nvidia-smi dmon -s u
|
|
```
|
|
|
|
### Media Volume Mounts
|
|
```yaml
|
|
volumes:
|
|
- /mnt/truenas/media:/media:ro # Read-only for safety
|
|
```
|
|
|
|
### Client Discovery
|
|
- **Jellyfin**: UDP 7359
|
|
- **Plex**: UDP 32410-32414, GDM
|
|
|
|
## Integration Points
|
|
|
|
### Watch History Sync
|
|
- **Tool**: watchstate (ghcr.io/arabcoders/watchstate)
|
|
- **Method**: API-based sync between services
|
|
- **Note**: NFO files do NOT store watch history
|
|
|
|
### Tdarr Integration
|
|
- Tdarr pre-processes media for optimal streaming
|
|
- Shared GPU resources require coordination
|
|
- See `tdarr/CONTEXT.md` for transcoding system details
|
|
|
|
## Best Practices
|
|
|
|
### Performance
|
|
1. Use NVMe for cache/transcoding temp directories
|
|
2. Mount media read-only to prevent accidental modifications
|
|
3. Enable hardware transcoding for all supported codecs
|
|
4. Limit concurrent transcodes based on GPU capability
|
|
|
|
### Reliability
|
|
1. Use `restart: unless-stopped` for containers
|
|
2. Separate config from cache (different failure modes)
|
|
3. Monitor disk space on cache volumes
|
|
4. Regular database backups (config directory)
|
|
|
|
### Security
|
|
1. Run containers as non-root (PUID/PGID)
|
|
2. Use read-only media mounts
|
|
3. Limit network exposure (internal LAN only)
|
|
4. Regular container image updates
|
|
|
|
## GPU Compatibility Notes
|
|
|
|
### NVIDIA Pascal (GTX 10-series)
|
|
- NVENC: H.264, HEVC (no B-frames for HEVC)
|
|
- NVDEC: H.264, HEVC, VP8, VP9
|
|
- Sessions: 2 concurrent (consumer card limit)
|
|
|
|
### NVIDIA Turing+ (RTX 20-series and newer)
|
|
- NVENC: H.264, HEVC (with B-frames), AV1
|
|
- NVDEC: H.264, HEVC, VP8, VP9, AV1
|
|
- Sessions: 3+ concurrent
|
|
|
|
## GPU Health Monitoring
|
|
|
|
### Jellyfin GPU Monitor
|
|
**Location**: `ubuntu-manticore:~/scripts/jellyfin_gpu_monitor.py`
|
|
**Schedule**: Every 5 minutes via cron
|
|
**Logs**: `~/logs/jellyfin-gpu-monitor.log`
|
|
|
|
The monitor detects when the Jellyfin container loses GPU access (common after
|
|
driver updates or Docker restarts) and automatically:
|
|
1. Sends Discord alert
|
|
2. Restarts the container to restore GPU access
|
|
3. Confirms GPU is restored
|
|
|
|
**Manual check:**
|
|
```bash
|
|
ssh ubuntu-manticore "python3 ~/scripts/jellyfin_gpu_monitor.py --check"
|
|
```
|
|
|
|
**FFmpeg exit code 187**: Indicates NVENC failure due to lost GPU access.
|
|
The monitor catches this condition before users report playback failures.
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
1. **No GPU in container**: Check Docker/Podman GPU passthrough config
|
|
2. **Transcoding failures**: Verify codec support for your GPU generation
|
|
3. **Slow playback start**: Check network mount performance
|
|
4. **Cache filling up**: Monitor trickplay/thumbnail generation
|
|
5. **FFmpeg exit code 187**: GPU access lost - monitor should auto-restart
|
|
|
|
### Diagnostic Commands
|
|
```bash
|
|
# GPU status
|
|
nvidia-smi
|
|
|
|
# Container GPU access
|
|
docker exec <container> nvidia-smi
|
|
|
|
# Encoder/decoder utilization
|
|
nvidia-smi dmon -s u
|
|
|
|
# Container logs
|
|
docker logs <container> 2>&1 | tail -50
|
|
```
|