claude-home/patterns/docker/gpu-acceleration.md

# GPU Acceleration in Docker Containers

## Overview
Patterns for enabling GPU acceleration in Docker containers, particularly for media transcoding workloads.

## NVIDIA Container Toolkit Approach

### Modern Method (CDI - Container Device Interface)
```bash
# Generate CDI configuration
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

# Use in docker-compose
services:
  app:
    devices:
      - nvidia.com/gpu=all
```

### Legacy Method (Runtime)
```bash
# Configure runtime
sudo nvidia-ctk runtime configure --runtime=docker

# Use in docker-compose
services:
  app:
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
```

### Compose v3 Method (Deploy)
```yaml
services:
  app:
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            count: all
            capabilities: [gpu]
```

## Hardware Considerations

### High-End Consumer GPUs (RTX 4080/4090)
- Excellent NVENC/NVDEC performance
- Multiple concurrent transcoding streams
- High VRAM for large files

### Multi-GPU Setups
```yaml
environment:
  - NVIDIA_VISIBLE_DEVICES=0,1  # Specific GPUs
  # or
  - NVIDIA_VISIBLE_DEVICES=all  # All GPUs
```

## Troubleshooting Patterns

### Gradual Enablement
1. Start with CPU-only configuration
2. Verify container functionality
3. Add GPU support incrementally
4. Test with simple workloads first

### Fallback Strategy
```yaml
# Include both GPU and CPU fallback
devices:
  - /dev/dri:/dev/dri  # Intel/AMD GPU fallback
deploy:
  resources:
    reservations:
      devices:
      - driver: nvidia
        count: all
        capabilities: [gpu]
```

## Common Issues
- Docker service restart failures after toolkit install
- CDI vs runtime configuration conflicts
- Distribution-specific package differences
- Permission issues with device access

## Critical Fedora/Nobara GPU Issue

### Problem: Docker Desktop GPU Integration Failure
On Fedora-based systems (Fedora, RHEL, CentOS, Nobara), Docker Desktop has significant compatibility issues with NVIDIA Container Toolkit, resulting in:
- `CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected`
- `unknown or invalid runtime name: nvidia`
- Manual device mounting works but CUDA runtime fails

### Solution: Use Podman Instead
```bash
# Podman works immediately on Fedora systems
podman run -d --name container-name \
    --device nvidia.com/gpu=all \
    --restart unless-stopped \
    -e NVIDIA_DRIVER_CAPABILITIES=all \
    -e NVIDIA_VISIBLE_DEVICES=all \
    image:tag
```

### Why Podman Works Better on Fedora
- Native systemd integration
- Direct hardware access (no VM layer)
- Default container engine for RHEL/Fedora
- Superior NVIDIA Container Toolkit compatibility

### Testing Commands
```bash
# Test Docker (often fails on Fedora)
docker run --rm --gpus all ubuntu:20.04 nvidia-smi

# Test Podman (works on Fedora)
podman run --rm --device nvidia.com/gpu=all ubuntu:20.04 nvidia-smi
```

### Recommendation by OS
- **Fedora/RHEL/CentOS/Nobara**: Use Podman
- **Ubuntu/Debian**: Use Docker
- **When in doubt**: Test both, use what works

## Media Transcoding Example (Tdarr)
```bash
# Working Podman command for Tdarr on Fedora
podman run -d --name tdarr-node-gpu \
    --device nvidia.com/gpu=all \
    --restart unless-stopped \
    -e nodeName=workstation-gpu \
    -e serverIP=10.10.0.43 \
    -e NVIDIA_VISIBLE_DEVICES=all \
    -v ./media:/media \
    -v ./tmp:/temp \
    ghcr.io/haveagitgat/tdarr_node:latest
```