Cal Corum 10c9e0d854 CLAUDE: Migrate to technology-first documentation architecture

Complete restructure from patterns/examples/reference to technology-focused directories:

• Created technology-specific directories with comprehensive documentation:
  - /tdarr/ - Transcoding automation with gaming-aware scheduling
  - /docker/ - Container management with GPU acceleration patterns
  - /vm-management/ - Virtual machine automation and cloud-init
  - /networking/ - SSH infrastructure, reverse proxy, and security
  - /monitoring/ - System health checks and Discord notifications
  - /databases/ - Database patterns and troubleshooting
  - /development/ - Programming language patterns (bash, nodejs, python, vuejs)

• Enhanced CLAUDE.md with intelligent context loading:
  - Technology-first loading rules for automatic context provision
  - Troubleshooting keyword triggers for emergency scenarios
  - Documentation maintenance protocols with automated reminders
  - Context window management for optimal documentation updates

• Preserved valuable content from .claude/tmp/:
  - SSH security improvements and server inventory
  - Tdarr CIFS troubleshooting and Docker iptables solutions
  - Operational scripts with proper technology classification

• Benefits achieved:
  - Self-contained technology directories with complete context
  - Automatic loading of relevant documentation based on keywords
  - Emergency-ready troubleshooting with comprehensive guides
  - Scalable structure for future technology additions
  - Eliminated context bloat through targeted loading

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-08-12 23:20:15 -05:00

5.3 KiB

Raw Blame History

NVIDIA GPU Container Troubleshooting Guide

Key Insights from Fedora/Nobara GPU Container Issues

Problem: Docker Desktop vs Podman GPU Support on Fedora-based Systems

Issue: Docker Desktop on Fedora/Nobara systems has significant compatibility issues with NVIDIA Container Toolkit integration, even when properly configured.

Symptoms:

CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
unknown or invalid runtime name: nvidia
Device nodes created but CUDA runtime fails to initialize
Manual device creation (mknod) works but CUDA still fails

Root Cause: Docker Desktop's virtualization layer interferes with direct hardware access on Fedora-based systems.

Solution: Use Podman Instead of Docker

Why Podman Works Better on Fedora

Native integration: Better integration with systemd and Linux security contexts
Direct hardware access: No VM layer interfering with GPU communication
Superior NVIDIA toolkit support: Works with same nvidia-container-toolkit installation
Built for Fedora: Designed as the default container engine for RHEL/Fedora systems

Verification Commands

# Test basic GPU access with Podman (should work)
podman run --rm --device nvidia.com/gpu=all ubuntu:20.04 nvidia-smi

# Test basic GPU access with Docker (often fails on Fedora)
docker run --rm --gpus all ubuntu:20.04 nvidia-smi

Complete GPU Container Setup for Fedora/Nobara

Prerequisites

NVIDIA drivers installed and working (nvidia-smi functional)
nvidia-container-toolkit installed via DNF
Podman installed (dnf install podman)

NVIDIA Container Toolkit Installation

# Install NVIDIA container toolkit
sudo dnf install nvidia-container-toolkit

# Configure Docker runtime (may not work but worth trying)
sudo nvidia-ctk runtime configure --runtime=docker

# The key insight: Podman works without additional configuration!

Working Podman Command Template

podman run -d --name container-name \
    --device nvidia.com/gpu=all \
    --restart unless-stopped \
    -e NVIDIA_DRIVER_CAPABILITIES=all \
    -e NVIDIA_VISIBLE_DEVICES=all \
    [other options] \
    image:tag

Troubleshooting Steps (In Order)

1. Verify Host GPU Access

nvidia-smi                    # Should show GPU info
lsmod | grep nvidia          # Should show nvidia modules loaded
ls -la /dev/nvidia*          # Should show device files

2. Test Container Runtime

# Try Podman first (recommended for Fedora)
podman run --rm --device nvidia.com/gpu=all ubuntu:20.04 nvidia-smi

# If Podman works but Docker doesn't, use Podman for production

3. Check NVIDIA Container Toolkit

rpm -qa | grep nvidia-container-toolkit
nvidia-ctk --version

4. Verify CUDA Library Locations

# Find CUDA libraries
rpm -ql nvidia-driver-cuda-libs | grep libcuda
ldconfig -p | grep cuda

# Common locations:
# /usr/lib64/libcuda.so*
# /usr/lib64/libnvidia-encode.so*

Common Misconceptions

❌ Docker Should Always Work

Wrong: Docker Desktop has known issues with GPU access on some Linux distributions, especially Fedora-based systems.

❌ More Privileges = Better GPU Access

Wrong: Adding privileged: true or manual device mounting doesn't solve Docker Desktop's fundamental GPU integration issues.

❌ NVIDIA Container Toolkit Problems

Wrong: The toolkit works fine - the issue is Docker Desktop's compatibility with it on Fedora systems.

Best Practices

For Fedora/RHEL/CentOS Systems

Use Podman by default for GPU containers
Test Docker as fallback, but expect issues
Podman Compose works for orchestration
No special configuration needed beyond nvidia-container-toolkit

For Production Deployments

Test both Docker and Podman in your environment
Use whichever works reliably (often Podman on Fedora)
Document which container runtime is used
Include runtime in deployment scripts

Success Indicators

GPU Container Working Correctly

nvidia-smi runs inside container
NVENC/CUDA applications detect GPU
No "CUDA_ERROR_NO_DEVICE" errors
Hardware encoder shows as available in applications

Example: Successful Tdarr Node

# Container logs should show:
# h264_nvenc-true-true,hevc_nvenc-true-true,av1_nvenc-true-true

# FFmpeg test should succeed:
podman exec container-name ffmpeg -f lavfi -i testsrc2=duration=1:size=320x240:rate=1 -c:v h264_nvenc -t 1 /tmp/test.mp4

System-Specific Notes

Nobara/Fedora 42

Docker Desktop: ❌ GPU support problematic
Podman: ✅ GPU support works out of the box
NVIDIA Driver version: 570.169 (tested working)
Container Toolkit version: 1.17.8 (tested working)

Key Files and Locations

GPU devices: /dev/nvidia* (auto-created)
CUDA libraries: /usr/lib64/libcuda.so* (via nvidia-driver-cuda-libs package)
Container toolkit: nvidia-ctk command available
Docker daemon config: /etc/docker/daemon.json (may not help)

Future Reference

When encountering GPU container issues on Fedora-based systems:

Try Podman first - it likely works immediately
Don't waste time troubleshooting Docker Desktop GPU issues
Use the same container images and configurations
Podman commands are nearly identical to Docker commands

This approach saves hours of debugging Docker Desktop GPU integration issues on Fedora systems.

5.3 KiB Raw Blame History