claude-home/docker/examples/nvidia-gpu-troubleshooting.md
Cal Corum 10c9e0d854 CLAUDE: Migrate to technology-first documentation architecture
Complete restructure from patterns/examples/reference to technology-focused directories:

• Created technology-specific directories with comprehensive documentation:
  - /tdarr/ - Transcoding automation with gaming-aware scheduling
  - /docker/ - Container management with GPU acceleration patterns
  - /vm-management/ - Virtual machine automation and cloud-init
  - /networking/ - SSH infrastructure, reverse proxy, and security
  - /monitoring/ - System health checks and Discord notifications
  - /databases/ - Database patterns and troubleshooting
  - /development/ - Programming language patterns (bash, nodejs, python, vuejs)

• Enhanced CLAUDE.md with intelligent context loading:
  - Technology-first loading rules for automatic context provision
  - Troubleshooting keyword triggers for emergency scenarios
  - Documentation maintenance protocols with automated reminders
  - Context window management for optimal documentation updates

• Preserved valuable content from .claude/tmp/:
  - SSH security improvements and server inventory
  - Tdarr CIFS troubleshooting and Docker iptables solutions
  - Operational scripts with proper technology classification

• Benefits achieved:
  - Self-contained technology directories with complete context
  - Automatic loading of relevant documentation based on keywords
  - Emergency-ready troubleshooting with comprehensive guides
  - Scalable structure for future technology additions
  - Eliminated context bloat through targeted loading

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-12 23:20:15 -05:00

5.3 KiB

NVIDIA GPU Container Troubleshooting Guide

Key Insights from Fedora/Nobara GPU Container Issues

Problem: Docker Desktop vs Podman GPU Support on Fedora-based Systems

Issue: Docker Desktop on Fedora/Nobara systems has significant compatibility issues with NVIDIA Container Toolkit integration, even when properly configured.

Symptoms:

  • CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
  • unknown or invalid runtime name: nvidia
  • Device nodes created but CUDA runtime fails to initialize
  • Manual device creation (mknod) works but CUDA still fails

Root Cause: Docker Desktop's virtualization layer interferes with direct hardware access on Fedora-based systems.

Solution: Use Podman Instead of Docker

Why Podman Works Better on Fedora

  • Native integration: Better integration with systemd and Linux security contexts
  • Direct hardware access: No VM layer interfering with GPU communication
  • Superior NVIDIA toolkit support: Works with same nvidia-container-toolkit installation
  • Built for Fedora: Designed as the default container engine for RHEL/Fedora systems

Verification Commands

# Test basic GPU access with Podman (should work)
podman run --rm --device nvidia.com/gpu=all ubuntu:20.04 nvidia-smi

# Test basic GPU access with Docker (often fails on Fedora)
docker run --rm --gpus all ubuntu:20.04 nvidia-smi

Complete GPU Container Setup for Fedora/Nobara

Prerequisites

  1. NVIDIA drivers installed and working (nvidia-smi functional)
  2. nvidia-container-toolkit installed via DNF
  3. Podman installed (dnf install podman)

NVIDIA Container Toolkit Installation

# Install NVIDIA container toolkit
sudo dnf install nvidia-container-toolkit

# Configure Docker runtime (may not work but worth trying)
sudo nvidia-ctk runtime configure --runtime=docker

# The key insight: Podman works without additional configuration!

Working Podman Command Template

podman run -d --name container-name \
    --device nvidia.com/gpu=all \
    --restart unless-stopped \
    -e NVIDIA_DRIVER_CAPABILITIES=all \
    -e NVIDIA_VISIBLE_DEVICES=all \
    [other options] \
    image:tag

Troubleshooting Steps (In Order)

1. Verify Host GPU Access

nvidia-smi                    # Should show GPU info
lsmod | grep nvidia          # Should show nvidia modules loaded
ls -la /dev/nvidia*          # Should show device files

2. Test Container Runtime

# Try Podman first (recommended for Fedora)
podman run --rm --device nvidia.com/gpu=all ubuntu:20.04 nvidia-smi

# If Podman works but Docker doesn't, use Podman for production

3. Check NVIDIA Container Toolkit

rpm -qa | grep nvidia-container-toolkit
nvidia-ctk --version

4. Verify CUDA Library Locations

# Find CUDA libraries
rpm -ql nvidia-driver-cuda-libs | grep libcuda
ldconfig -p | grep cuda

# Common locations:
# /usr/lib64/libcuda.so*
# /usr/lib64/libnvidia-encode.so*

Common Misconceptions

Docker Should Always Work

Wrong: Docker Desktop has known issues with GPU access on some Linux distributions, especially Fedora-based systems.

More Privileges = Better GPU Access

Wrong: Adding privileged: true or manual device mounting doesn't solve Docker Desktop's fundamental GPU integration issues.

NVIDIA Container Toolkit Problems

Wrong: The toolkit works fine - the issue is Docker Desktop's compatibility with it on Fedora systems.

Best Practices

For Fedora/RHEL/CentOS Systems

  1. Use Podman by default for GPU containers
  2. Test Docker as fallback, but expect issues
  3. Podman Compose works for orchestration
  4. No special configuration needed beyond nvidia-container-toolkit

For Production Deployments

  1. Test both Docker and Podman in your environment
  2. Use whichever works reliably (often Podman on Fedora)
  3. Document which container runtime is used
  4. Include runtime in deployment scripts

Success Indicators

GPU Container Working Correctly

  • nvidia-smi runs inside container
  • NVENC/CUDA applications detect GPU
  • No "CUDA_ERROR_NO_DEVICE" errors
  • Hardware encoder shows as available in applications

Example: Successful Tdarr Node

# Container logs should show:
# h264_nvenc-true-true,hevc_nvenc-true-true,av1_nvenc-true-true

# FFmpeg test should succeed:
podman exec container-name ffmpeg -f lavfi -i testsrc2=duration=1:size=320x240:rate=1 -c:v h264_nvenc -t 1 /tmp/test.mp4

System-Specific Notes

Nobara/Fedora 42

  • Docker Desktop: GPU support problematic
  • Podman: GPU support works out of the box
  • NVIDIA Driver version: 570.169 (tested working)
  • Container Toolkit version: 1.17.8 (tested working)

Key Files and Locations

  • GPU devices: /dev/nvidia* (auto-created)
  • CUDA libraries: /usr/lib64/libcuda.so* (via nvidia-driver-cuda-libs package)
  • Container toolkit: nvidia-ctk command available
  • Docker daemon config: /etc/docker/daemon.json (may not help)

Future Reference

When encountering GPU container issues on Fedora-based systems:

  1. Try Podman first - it likely works immediately
  2. Don't waste time troubleshooting Docker Desktop GPU issues
  3. Use the same container images and configurations
  4. Podman commands are nearly identical to Docker commands

This approach saves hours of debugging Docker Desktop GPU integration issues on Fedora systems.