claude-home/docker/examples/nvidia-gpu-troubleshooting.md

# NVIDIA GPU Container Troubleshooting Guide

## Key Insights from Fedora/Nobara GPU Container Issues

### Problem: Docker Desktop vs Podman GPU Support on Fedora-based Systems

**Issue**: Docker Desktop on Fedora/Nobara systems has significant compatibility issues with NVIDIA Container Toolkit integration, even when properly configured.

**Symptoms**:
- `CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected`
- `unknown or invalid runtime name: nvidia`
- Device nodes created but CUDA runtime fails to initialize
- Manual device creation (`mknod`) works but CUDA still fails

**Root Cause**: Docker Desktop's virtualization layer interferes with direct hardware access on Fedora-based systems.

## Solution: Use Podman Instead of Docker

### Why Podman Works Better on Fedora
- **Native integration**: Better integration with systemd and Linux security contexts
- **Direct hardware access**: No VM layer interfering with GPU communication
- **Superior NVIDIA toolkit support**: Works with same nvidia-container-toolkit installation
- **Built for Fedora**: Designed as the default container engine for RHEL/Fedora systems

### Verification Commands
```bash
# Test basic GPU access with Podman (should work)
podman run --rm --device nvidia.com/gpu=all ubuntu:20.04 nvidia-smi

# Test basic GPU access with Docker (often fails on Fedora)
docker run --rm --gpus all ubuntu:20.04 nvidia-smi
```

## Complete GPU Container Setup for Fedora/Nobara

### Prerequisites
1. NVIDIA drivers installed and working (`nvidia-smi` functional)
2. nvidia-container-toolkit installed via DNF
3. Podman installed (`dnf install podman`)

### NVIDIA Container Toolkit Installation
```bash
# Install NVIDIA container toolkit
sudo dnf install nvidia-container-toolkit

# Configure Docker runtime (may not work but worth trying)
sudo nvidia-ctk runtime configure --runtime=docker

# The key insight: Podman works without additional configuration!
```

### Working Podman Command Template
```bash
podman run -d --name container-name \
    --device nvidia.com/gpu=all \
    --restart unless-stopped \
    -e NVIDIA_DRIVER_CAPABILITIES=all \
    -e NVIDIA_VISIBLE_DEVICES=all \
    [other options] \
    image:tag
```

## Troubleshooting Steps (In Order)

### 1. Verify Host GPU Access
```bash
nvidia-smi                    # Should show GPU info
lsmod | grep nvidia          # Should show nvidia modules loaded
ls -la /dev/nvidia*          # Should show device files
```

### 2. Test Container Runtime
```bash
# Try Podman first (recommended for Fedora)
podman run --rm --device nvidia.com/gpu=all ubuntu:20.04 nvidia-smi

# If Podman works but Docker doesn't, use Podman for production
```

### 3. Check NVIDIA Container Toolkit
```bash
rpm -qa | grep nvidia-container-toolkit
nvidia-ctk --version
```

### 4. Verify CUDA Library Locations
```bash
# Find CUDA libraries
rpm -ql nvidia-driver-cuda-libs | grep libcuda
ldconfig -p | grep cuda

# Common locations:
# /usr/lib64/libcuda.so*
# /usr/lib64/libnvidia-encode.so*
```

## Common Misconceptions

### ❌ Docker Should Always Work
**Wrong**: Docker Desktop has known issues with GPU access on some Linux distributions, especially Fedora-based systems.

### ❌ More Privileges = Better GPU Access
**Wrong**: Adding `privileged: true` or manual device mounting doesn't solve Docker Desktop's fundamental GPU integration issues.

### ❌ NVIDIA Container Toolkit Problems
**Wrong**: The toolkit works fine - the issue is Docker Desktop's compatibility with it on Fedora systems.

## Best Practices

### For Fedora/RHEL/CentOS Systems
1. **Use Podman by default** for GPU containers
2. Test Docker as fallback, but expect issues
3. Podman Compose works for orchestration
4. No special configuration needed beyond nvidia-container-toolkit

### For Production Deployments
1. Test both Docker and Podman in your environment
2. Use whichever works reliably (often Podman on Fedora)
3. Document which container runtime is used
4. Include runtime in deployment scripts

## Success Indicators

### GPU Container Working Correctly
- `nvidia-smi` runs inside container
- NVENC/CUDA applications detect GPU
- No "CUDA_ERROR_NO_DEVICE" errors
- Hardware encoder shows as available in applications

### Example: Successful Tdarr Node
```bash
# Container logs should show:
# h264_nvenc-true-true,hevc_nvenc-true-true,av1_nvenc-true-true

# FFmpeg test should succeed:
podman exec container-name ffmpeg -f lavfi -i testsrc2=duration=1:size=320x240:rate=1 -c:v h264_nvenc -t 1 /tmp/test.mp4
```

## System-Specific Notes

### Nobara/Fedora 42
- Docker Desktop: ❌ GPU support problematic
- Podman: ✅ GPU support works out of the box
- NVIDIA Driver version: 570.169 (tested working)
- Container Toolkit version: 1.17.8 (tested working)

### Key Files and Locations
- GPU devices: `/dev/nvidia*` (auto-created)
- CUDA libraries: `/usr/lib64/libcuda.so*` (via nvidia-driver-cuda-libs package)
- Container toolkit: `nvidia-ctk` command available
- Docker daemon config: `/etc/docker/daemon.json` (may not help)

## Future Reference

When encountering GPU container issues on Fedora-based systems:
1. Try Podman first - it likely works immediately
2. Don't waste time troubleshooting Docker Desktop GPU issues
3. Use the same container images and configurations
4. Podman commands are nearly identical to Docker commands

This approach saves hours of debugging Docker Desktop GPU integration issues on Fedora systems.