Complete restructure from patterns/examples/reference to technology-focused directories: • Created technology-specific directories with comprehensive documentation: - /tdarr/ - Transcoding automation with gaming-aware scheduling - /docker/ - Container management with GPU acceleration patterns - /vm-management/ - Virtual machine automation and cloud-init - /networking/ - SSH infrastructure, reverse proxy, and security - /monitoring/ - System health checks and Discord notifications - /databases/ - Database patterns and troubleshooting - /development/ - Programming language patterns (bash, nodejs, python, vuejs) • Enhanced CLAUDE.md with intelligent context loading: - Technology-first loading rules for automatic context provision - Troubleshooting keyword triggers for emergency scenarios - Documentation maintenance protocols with automated reminders - Context window management for optimal documentation updates • Preserved valuable content from .claude/tmp/: - SSH security improvements and server inventory - Tdarr CIFS troubleshooting and Docker iptables solutions - Operational scripts with proper technology classification • Benefits achieved: - Self-contained technology directories with complete context - Automatic loading of relevant documentation based on keywords - Emergency-ready troubleshooting with comprehensive guides - Scalable structure for future technology additions - Eliminated context bloat through targeted loading 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
161 lines
5.3 KiB
Markdown
161 lines
5.3 KiB
Markdown
# NVIDIA GPU Container Troubleshooting Guide
|
|
|
|
## Key Insights from Fedora/Nobara GPU Container Issues
|
|
|
|
### Problem: Docker Desktop vs Podman GPU Support on Fedora-based Systems
|
|
|
|
**Issue**: Docker Desktop on Fedora/Nobara systems has significant compatibility issues with NVIDIA Container Toolkit integration, even when properly configured.
|
|
|
|
**Symptoms**:
|
|
- `CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected`
|
|
- `unknown or invalid runtime name: nvidia`
|
|
- Device nodes created but CUDA runtime fails to initialize
|
|
- Manual device creation (`mknod`) works but CUDA still fails
|
|
|
|
**Root Cause**: Docker Desktop's virtualization layer interferes with direct hardware access on Fedora-based systems.
|
|
|
|
## Solution: Use Podman Instead of Docker
|
|
|
|
### Why Podman Works Better on Fedora
|
|
- **Native integration**: Better integration with systemd and Linux security contexts
|
|
- **Direct hardware access**: No VM layer interfering with GPU communication
|
|
- **Superior NVIDIA toolkit support**: Works with same nvidia-container-toolkit installation
|
|
- **Built for Fedora**: Designed as the default container engine for RHEL/Fedora systems
|
|
|
|
### Verification Commands
|
|
```bash
|
|
# Test basic GPU access with Podman (should work)
|
|
podman run --rm --device nvidia.com/gpu=all ubuntu:20.04 nvidia-smi
|
|
|
|
# Test basic GPU access with Docker (often fails on Fedora)
|
|
docker run --rm --gpus all ubuntu:20.04 nvidia-smi
|
|
```
|
|
|
|
## Complete GPU Container Setup for Fedora/Nobara
|
|
|
|
### Prerequisites
|
|
1. NVIDIA drivers installed and working (`nvidia-smi` functional)
|
|
2. nvidia-container-toolkit installed via DNF
|
|
3. Podman installed (`dnf install podman`)
|
|
|
|
### NVIDIA Container Toolkit Installation
|
|
```bash
|
|
# Install NVIDIA container toolkit
|
|
sudo dnf install nvidia-container-toolkit
|
|
|
|
# Configure Docker runtime (may not work but worth trying)
|
|
sudo nvidia-ctk runtime configure --runtime=docker
|
|
|
|
# The key insight: Podman works without additional configuration!
|
|
```
|
|
|
|
### Working Podman Command Template
|
|
```bash
|
|
podman run -d --name container-name \
|
|
--device nvidia.com/gpu=all \
|
|
--restart unless-stopped \
|
|
-e NVIDIA_DRIVER_CAPABILITIES=all \
|
|
-e NVIDIA_VISIBLE_DEVICES=all \
|
|
[other options] \
|
|
image:tag
|
|
```
|
|
|
|
## Troubleshooting Steps (In Order)
|
|
|
|
### 1. Verify Host GPU Access
|
|
```bash
|
|
nvidia-smi # Should show GPU info
|
|
lsmod | grep nvidia # Should show nvidia modules loaded
|
|
ls -la /dev/nvidia* # Should show device files
|
|
```
|
|
|
|
### 2. Test Container Runtime
|
|
```bash
|
|
# Try Podman first (recommended for Fedora)
|
|
podman run --rm --device nvidia.com/gpu=all ubuntu:20.04 nvidia-smi
|
|
|
|
# If Podman works but Docker doesn't, use Podman for production
|
|
```
|
|
|
|
### 3. Check NVIDIA Container Toolkit
|
|
```bash
|
|
rpm -qa | grep nvidia-container-toolkit
|
|
nvidia-ctk --version
|
|
```
|
|
|
|
### 4. Verify CUDA Library Locations
|
|
```bash
|
|
# Find CUDA libraries
|
|
rpm -ql nvidia-driver-cuda-libs | grep libcuda
|
|
ldconfig -p | grep cuda
|
|
|
|
# Common locations:
|
|
# /usr/lib64/libcuda.so*
|
|
# /usr/lib64/libnvidia-encode.so*
|
|
```
|
|
|
|
## Common Misconceptions
|
|
|
|
### ❌ Docker Should Always Work
|
|
**Wrong**: Docker Desktop has known issues with GPU access on some Linux distributions, especially Fedora-based systems.
|
|
|
|
### ❌ More Privileges = Better GPU Access
|
|
**Wrong**: Adding `privileged: true` or manual device mounting doesn't solve Docker Desktop's fundamental GPU integration issues.
|
|
|
|
### ❌ NVIDIA Container Toolkit Problems
|
|
**Wrong**: The toolkit works fine - the issue is Docker Desktop's compatibility with it on Fedora systems.
|
|
|
|
## Best Practices
|
|
|
|
### For Fedora/RHEL/CentOS Systems
|
|
1. **Use Podman by default** for GPU containers
|
|
2. Test Docker as fallback, but expect issues
|
|
3. Podman Compose works for orchestration
|
|
4. No special configuration needed beyond nvidia-container-toolkit
|
|
|
|
### For Production Deployments
|
|
1. Test both Docker and Podman in your environment
|
|
2. Use whichever works reliably (often Podman on Fedora)
|
|
3. Document which container runtime is used
|
|
4. Include runtime in deployment scripts
|
|
|
|
## Success Indicators
|
|
|
|
### GPU Container Working Correctly
|
|
- `nvidia-smi` runs inside container
|
|
- NVENC/CUDA applications detect GPU
|
|
- No "CUDA_ERROR_NO_DEVICE" errors
|
|
- Hardware encoder shows as available in applications
|
|
|
|
### Example: Successful Tdarr Node
|
|
```bash
|
|
# Container logs should show:
|
|
# h264_nvenc-true-true,hevc_nvenc-true-true,av1_nvenc-true-true
|
|
|
|
# FFmpeg test should succeed:
|
|
podman exec container-name ffmpeg -f lavfi -i testsrc2=duration=1:size=320x240:rate=1 -c:v h264_nvenc -t 1 /tmp/test.mp4
|
|
```
|
|
|
|
## System-Specific Notes
|
|
|
|
### Nobara/Fedora 42
|
|
- Docker Desktop: ❌ GPU support problematic
|
|
- Podman: ✅ GPU support works out of the box
|
|
- NVIDIA Driver version: 570.169 (tested working)
|
|
- Container Toolkit version: 1.17.8 (tested working)
|
|
|
|
### Key Files and Locations
|
|
- GPU devices: `/dev/nvidia*` (auto-created)
|
|
- CUDA libraries: `/usr/lib64/libcuda.so*` (via nvidia-driver-cuda-libs package)
|
|
- Container toolkit: `nvidia-ctk` command available
|
|
- Docker daemon config: `/etc/docker/daemon.json` (may not help)
|
|
|
|
## Future Reference
|
|
|
|
When encountering GPU container issues on Fedora-based systems:
|
|
1. Try Podman first - it likely works immediately
|
|
2. Don't waste time troubleshooting Docker Desktop GPU issues
|
|
3. Use the same container images and configurations
|
|
4. Podman commands are nearly identical to Docker commands
|
|
|
|
This approach saves hours of debugging Docker Desktop GPU integration issues on Fedora systems. |