claude-home/docker/examples/nvidia-troubleshooting.md
Cal Corum 10c9e0d854 CLAUDE: Migrate to technology-first documentation architecture
Complete restructure from patterns/examples/reference to technology-focused directories:

• Created technology-specific directories with comprehensive documentation:
  - /tdarr/ - Transcoding automation with gaming-aware scheduling
  - /docker/ - Container management with GPU acceleration patterns
  - /vm-management/ - Virtual machine automation and cloud-init
  - /networking/ - SSH infrastructure, reverse proxy, and security
  - /monitoring/ - System health checks and Discord notifications
  - /databases/ - Database patterns and troubleshooting
  - /development/ - Programming language patterns (bash, nodejs, python, vuejs)

• Enhanced CLAUDE.md with intelligent context loading:
  - Technology-first loading rules for automatic context provision
  - Troubleshooting keyword triggers for emergency scenarios
  - Documentation maintenance protocols with automated reminders
  - Context window management for optimal documentation updates

• Preserved valuable content from .claude/tmp/:
  - SSH security improvements and server inventory
  - Tdarr CIFS troubleshooting and Docker iptables solutions
  - Operational scripts with proper technology classification

• Benefits achieved:
  - Self-contained technology directories with complete context
  - Automatic loading of relevant documentation based on keywords
  - Emergency-ready troubleshooting with comprehensive guides
  - Scalable structure for future technology additions
  - Eliminated context bloat through targeted loading

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-12 23:20:15 -05:00

2.3 KiB

NVIDIA Container Toolkit Troubleshooting

Installation by Distribution

Fedora/Nobara (DNF)

# Remove conflicting packages
sudo dnf remove golang-github-nvidia-container-toolkit

# Add official repository
curl -s -L https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo | \
  sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo

# Install toolkit
sudo dnf install -y nvidia-container-toolkit

# Configure Docker
sudo nvidia-ctk runtime configure --runtime=docker

Ubuntu/Debian (APT)

# Add repository
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
  sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

echo "deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] \
  https://nvidia.github.io/libnvidia-container/stable/deb/\$(ARCH) /" | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker

Common Issues

Docker Service Won't Start

# Check daemon logs
sudo journalctl -xeu docker.service

# Common fixes:
sudo systemctl stop docker.socket
sudo systemctl start docker.socket
sudo systemctl start docker

# Or reset configuration
sudo mv /etc/docker/daemon.json /etc/docker/daemon.json.backup
sudo systemctl restart docker

GPU Not Detected

# Verify nvidia-smi works
nvidia-smi

# Check runtime registration
docker info | grep -i runtime

# Test with simple container
docker run --rm --gpus all nvidia/cuda:11.8-base-ubuntu20.04 nvidia-smi

CDI Method (Alternative)

# Generate CDI spec
sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

# Use in compose
services:
  app:
    devices:
      - nvidia.com/gpu=all

Configuration Patterns

daemon.json Structure

{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}

Testing GPU Access

# Test with Tdarr node image
docker run --rm --gpus all ghcr.io/haveagitgat/tdarr_node:latest nvidia-smi

# Expected output: GPU information table

Fallback Strategies

  1. Start with CPU-only configuration
  2. Verify container functionality first
  3. Add GPU support incrementally
  4. Keep Intel/AMD GPU fallback enabled