claude-home/reference/docker/nvidia-gpu-troubleshooting.md
Cal Corum d723924bdf CLAUDE: Add complete GPU transcoding solution for Tdarr containers
- Add working Podman-based GPU Tdarr startup script for Fedora systems
- Document critical Docker Desktop GPU issues on Fedora/Nobara systems
- Add comprehensive Tdarr configuration examples (CPU and GPU variants)
- Add GPU acceleration patterns and troubleshooting documentation
- Provide working solution for NVIDIA RTX GPU hardware transcoding

Key insight: Podman works immediately for GPU access on Fedora systems
where Docker Desktop fails due to virtualization layer conflicts.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-09 00:47:12 -05:00

5.3 KiB

NVIDIA GPU Container Troubleshooting Guide

Key Insights from Fedora/Nobara GPU Container Issues

Problem: Docker Desktop vs Podman GPU Support on Fedora-based Systems

Issue: Docker Desktop on Fedora/Nobara systems has significant compatibility issues with NVIDIA Container Toolkit integration, even when properly configured.

Symptoms:

  • CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
  • unknown or invalid runtime name: nvidia
  • Device nodes created but CUDA runtime fails to initialize
  • Manual device creation (mknod) works but CUDA still fails

Root Cause: Docker Desktop's virtualization layer interferes with direct hardware access on Fedora-based systems.

Solution: Use Podman Instead of Docker

Why Podman Works Better on Fedora

  • Native integration: Better integration with systemd and Linux security contexts
  • Direct hardware access: No VM layer interfering with GPU communication
  • Superior NVIDIA toolkit support: Works with same nvidia-container-toolkit installation
  • Built for Fedora: Designed as the default container engine for RHEL/Fedora systems

Verification Commands

# Test basic GPU access with Podman (should work)
podman run --rm --device nvidia.com/gpu=all ubuntu:20.04 nvidia-smi

# Test basic GPU access with Docker (often fails on Fedora)
docker run --rm --gpus all ubuntu:20.04 nvidia-smi

Complete GPU Container Setup for Fedora/Nobara

Prerequisites

  1. NVIDIA drivers installed and working (nvidia-smi functional)
  2. nvidia-container-toolkit installed via DNF
  3. Podman installed (dnf install podman)

NVIDIA Container Toolkit Installation

# Install NVIDIA container toolkit
sudo dnf install nvidia-container-toolkit

# Configure Docker runtime (may not work but worth trying)
sudo nvidia-ctk runtime configure --runtime=docker

# The key insight: Podman works without additional configuration!

Working Podman Command Template

podman run -d --name container-name \
    --device nvidia.com/gpu=all \
    --restart unless-stopped \
    -e NVIDIA_DRIVER_CAPABILITIES=all \
    -e NVIDIA_VISIBLE_DEVICES=all \
    [other options] \
    image:tag

Troubleshooting Steps (In Order)

1. Verify Host GPU Access

nvidia-smi                    # Should show GPU info
lsmod | grep nvidia          # Should show nvidia modules loaded
ls -la /dev/nvidia*          # Should show device files

2. Test Container Runtime

# Try Podman first (recommended for Fedora)
podman run --rm --device nvidia.com/gpu=all ubuntu:20.04 nvidia-smi

# If Podman works but Docker doesn't, use Podman for production

3. Check NVIDIA Container Toolkit

rpm -qa | grep nvidia-container-toolkit
nvidia-ctk --version

4. Verify CUDA Library Locations

# Find CUDA libraries
rpm -ql nvidia-driver-cuda-libs | grep libcuda
ldconfig -p | grep cuda

# Common locations:
# /usr/lib64/libcuda.so*
# /usr/lib64/libnvidia-encode.so*

Common Misconceptions

Docker Should Always Work

Wrong: Docker Desktop has known issues with GPU access on some Linux distributions, especially Fedora-based systems.

More Privileges = Better GPU Access

Wrong: Adding privileged: true or manual device mounting doesn't solve Docker Desktop's fundamental GPU integration issues.

NVIDIA Container Toolkit Problems

Wrong: The toolkit works fine - the issue is Docker Desktop's compatibility with it on Fedora systems.

Best Practices

For Fedora/RHEL/CentOS Systems

  1. Use Podman by default for GPU containers
  2. Test Docker as fallback, but expect issues
  3. Podman Compose works for orchestration
  4. No special configuration needed beyond nvidia-container-toolkit

For Production Deployments

  1. Test both Docker and Podman in your environment
  2. Use whichever works reliably (often Podman on Fedora)
  3. Document which container runtime is used
  4. Include runtime in deployment scripts

Success Indicators

GPU Container Working Correctly

  • nvidia-smi runs inside container
  • NVENC/CUDA applications detect GPU
  • No "CUDA_ERROR_NO_DEVICE" errors
  • Hardware encoder shows as available in applications

Example: Successful Tdarr Node

# Container logs should show:
# h264_nvenc-true-true,hevc_nvenc-true-true,av1_nvenc-true-true

# FFmpeg test should succeed:
podman exec container-name ffmpeg -f lavfi -i testsrc2=duration=1:size=320x240:rate=1 -c:v h264_nvenc -t 1 /tmp/test.mp4

System-Specific Notes

Nobara/Fedora 42

  • Docker Desktop: GPU support problematic
  • Podman: GPU support works out of the box
  • NVIDIA Driver version: 570.169 (tested working)
  • Container Toolkit version: 1.17.8 (tested working)

Key Files and Locations

  • GPU devices: /dev/nvidia* (auto-created)
  • CUDA libraries: /usr/lib64/libcuda.so* (via nvidia-driver-cuda-libs package)
  • Container toolkit: nvidia-ctk command available
  • Docker daemon config: /etc/docker/daemon.json (may not help)

Future Reference

When encountering GPU container issues on Fedora-based systems:

  1. Try Podman first - it likely works immediately
  2. Don't waste time troubleshooting Docker Desktop GPU issues
  3. Use the same container images and configurations
  4. Podman commands are nearly identical to Docker commands

This approach saves hours of debugging Docker Desktop GPU integration issues on Fedora systems.