claude-home/docker/troubleshooting.md
Cal Corum 10c9e0d854 CLAUDE: Migrate to technology-first documentation architecture
Complete restructure from patterns/examples/reference to technology-focused directories:

• Created technology-specific directories with comprehensive documentation:
  - /tdarr/ - Transcoding automation with gaming-aware scheduling
  - /docker/ - Container management with GPU acceleration patterns
  - /vm-management/ - Virtual machine automation and cloud-init
  - /networking/ - SSH infrastructure, reverse proxy, and security
  - /monitoring/ - System health checks and Discord notifications
  - /databases/ - Database patterns and troubleshooting
  - /development/ - Programming language patterns (bash, nodejs, python, vuejs)

• Enhanced CLAUDE.md with intelligent context loading:
  - Technology-first loading rules for automatic context provision
  - Troubleshooting keyword triggers for emergency scenarios
  - Documentation maintenance protocols with automated reminders
  - Context window management for optimal documentation updates

• Preserved valuable content from .claude/tmp/:
  - SSH security improvements and server inventory
  - Tdarr CIFS troubleshooting and Docker iptables solutions
  - Operational scripts with proper technology classification

• Benefits achieved:
  - Self-contained technology directories with complete context
  - Automatic loading of relevant documentation based on keywords
  - Emergency-ready troubleshooting with comprehensive guides
  - Scalable structure for future technology additions
  - Eliminated context bloat through targeted loading

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-12 23:20:15 -05:00

10 KiB

Docker Container Troubleshooting Guide

Container Startup Issues

Container Won't Start

Check container logs first:

# Docker
docker logs <container_name>
docker logs --tail 50 -f <container_name>

# Podman  
podman logs <container_name>
podman logs --tail 50 -f <container_name>

Common Startup Failures

Port Conflicts

Symptoms: bind: address already in use error Solution:

# Find conflicting process
sudo netstat -tulpn | grep <port>
docker ps | grep <port>

# Change port mapping
docker run -p 8081:8080 myapp  # Use different host port

Permission Errors

Symptoms: permission denied when accessing files/volumes Solutions:

# Check file ownership
ls -la /host/volume/path

# Fix ownership (match container user)
sudo chown -R 1000:1000 /host/volume/path

# Use correct UID/GID in container
docker run -e PUID=1000 -e PGID=1000 myapp

Missing Environment Variables

Symptoms: Application fails with configuration errors Diagnostic:

# Check container environment
docker exec -it <container> env
docker exec -it <container> printenv

# Verify required variables are set
docker inspect <container> | grep -A 20 "Env"

Resource Constraints

Symptoms: Container killed or OOM errors Solutions:

# Check resource usage
docker stats <container>

# Increase memory limit
docker run -m 4g myapp

# Check system resources
free -h
df -h

Debug Running Containers

# Access container shell
docker exec -it <container> /bin/bash
docker exec -it <container> /bin/sh  # if bash not available

# Check container processes
docker exec <container> ps aux

# Check container filesystem
docker exec <container> ls -la /app

Build Issues

Build Failures

Clear build cache when encountering issues:

# Docker
docker system prune -a
docker builder prune

# Podman
podman system prune -a
podman image prune -a

Verbose Build Output

# Docker
docker build --progress=plain --no-cache .

# Podman  
podman build --layers=false .

Common Build Problems

COPY/ADD Errors

Issue: Files not found during build Solutions:

# Check .dockerignore file
# Verify file paths relative to build context
COPY ./src /app/src  # ✅ Correct
COPY /absolute/path /app  # ❌ Wrong - no absolute paths

Package Installation Failures

Issue: apt/yum/dnf package installation fails Solutions:

# Update package lists first
RUN apt-get update && apt-get install -y package-name

# Combine RUN commands to reduce layers
RUN apt-get update && \
    apt-get install -y package1 package2 && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

Network Issues During Build

Issue: Cannot reach package repositories Solutions:

# Check DNS resolution
docker build --network host .

# Use custom DNS
docker build --dns 8.8.8.8 .

GPU Container Issues

NVIDIA GPU Support Problems

Docker Desktop vs Podman on Fedora/Nobara

Issue: Docker Desktop has GPU compatibility issues on Fedora-based systems Symptoms:

  • CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
  • unknown or invalid runtime name: nvidia
  • Device nodes exist but CUDA fails to initialize

Solution: Use Podman instead of Docker on Fedora systems

# Verify host GPU works
nvidia-smi

# Test with Podman (recommended)
podman run --rm --device nvidia.com/gpu=all ubuntu:20.04 nvidia-smi

# Test with Docker (may fail on Fedora)
docker run --rm --gpus all ubuntu:20.04 nvidia-smi

GPU Container Configuration

Working Podman GPU template:

podman run -d --name gpu-container \
    --device nvidia.com/gpu=all \
    --restart unless-stopped \
    -e NVIDIA_DRIVER_CAPABILITIES=all \
    -e NVIDIA_VISIBLE_DEVICES=all \
    myapp:latest

Working Docker GPU template:

docker run -d --name gpu-container \
    --gpus all \
    --restart unless-stopped \
    -e NVIDIA_DRIVER_CAPABILITIES=all \
    -e NVIDIA_VISIBLE_DEVICES=all \
    myapp:latest

GPU Troubleshooting Steps

  1. Verify Host GPU Access:

    nvidia-smi                    # Should show GPU info
    lsmod | grep nvidia          # Should show nvidia modules
    ls -la /dev/nvidia*          # Should show device files
    
  2. Check NVIDIA Container Toolkit:

    rpm -qa | grep nvidia-container-toolkit  # Fedora/RHEL
    dpkg -l | grep nvidia-container-toolkit  # Ubuntu/Debian
    nvidia-ctk --version
    
  3. Test GPU in Container:

    # Should show GPU information
    podman exec gpu-container nvidia-smi
    
    # Test CUDA functionality
    podman exec gpu-container nvidia-ml-py
    

Platform-Specific GPU Notes

Fedora/Nobara/RHEL:

  • Podman: Works out-of-the-box with GPU support
  • Docker Desktop: Known GPU integration issues
  • Solution: Use Podman for GPU workloads

Ubuntu/Debian:

  • Docker: Generally works well with proper NVIDIA toolkit setup
  • Podman: Also works well
  • Solution: Either runtime typically works

Performance Issues

Resource Monitoring

Real-time resource usage:

# Overall container stats
docker stats
podman stats

# Inside container analysis
docker exec <container> top
docker exec <container> free -h
docker exec <container> df -h

# Network usage
docker exec <container> netstat -i

Image Size Optimization

Analyze image layers:

# Check image sizes
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}"

# Analyze layer history
docker history <image>

# Find large files in container
docker exec <container> du -sh /* | sort -hr

Optimization strategies:

# Use multi-stage builds
FROM node:18 AS builder
# ... build steps ...

FROM node:18-alpine AS production
COPY --from=builder /app/dist /app
# Smaller final image

# Combine RUN commands
RUN apt-get update && \
    apt-get install -y package && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Use .dockerignore
# .dockerignore
node_modules
.git
*.log

Storage Performance Issues

Slow volume performance:

# Test volume I/O performance
docker exec <container> dd if=/dev/zero of=/volume/test bs=1M count=1000

# Check volume mount options
docker inspect <container> | grep -A 10 "Mounts"

# Consider using tmpfs for temporary data
docker run --tmpfs /tmp myapp

Network Debugging

Network Connectivity Issues

Inspect network configuration:

# List networks
docker network ls
podman network ls

# Inspect specific network
docker network inspect <network_name>

# Check container networking
docker exec <container> ip addr show
docker exec <container> ip route show

Service Discovery Problems

Test connectivity between containers:

# Test by container name (same network)
docker exec container1 ping container2

# Test by IP address
docker exec container1 ping 172.17.0.3

# Check DNS resolution
docker exec container1 nslookup container2

Port Binding Issues

Verify port mappings:

# Check exposed ports
docker port <container>

# Test external connectivity
curl localhost:8080

# Check if port is bound to all interfaces
netstat -tulpn | grep :8080

Emergency Recovery

Complete Container Reset

Remove all containers and start fresh:

# Stop all containers
docker stop $(docker ps -q)
podman stop --all

# Remove all containers
docker container prune -f
podman container prune -f

# Remove all images
docker image prune -a -f
podman image prune -a -f

# Remove all volumes (CAUTION: data loss)
docker volume prune -f
podman volume prune -f

# Complete system cleanup
docker system prune -a --volumes -f
podman system prune -a --volumes -f

Container Recovery

Recover from corrupted container:

# Create backup of container data
docker cp <container>:/important/data ./backup/

# Export container filesystem
docker export <container> > container-backup.tar

# Import and restart
docker import container-backup.tar new-image:latest
docker run -d --name new-container new-image:latest

Data Recovery

Recover data from volumes:

# List volumes
docker volume ls

# Inspect volume location
docker volume inspect <volume_name>

# Access volume data directly
sudo ls -la /var/lib/docker/volumes/<volume_name>/_data

# Mount volume to temporary container
docker run --rm -v <volume_name>:/data alpine ls -la /data

Health Check Issues

Container Health Checks

Implement health checks:

# Dockerfile health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:3000/health || exit 1

Debug health check failures:

# Check health status
docker inspect <container> | grep -A 10 Health

# Manual health check test
docker exec <container> curl -f http://localhost:3000/health

# Check health check logs
docker events --filter container=<container>

Log Analysis

Log Management

View and manage container logs:

# View recent logs
docker logs --tail 100 <container>

# Follow logs in real-time
docker logs -f <container>

# Logs with timestamps
docker logs -t <container>

# Search logs for errors
docker logs <container> 2>&1 | grep ERROR

Log Rotation Issues

Configure log rotation to prevent disk filling:

# Run with log size limits
docker run --log-opt max-size=10m --log-opt max-file=3 myapp

# Check log file sizes
sudo du -sh /var/lib/docker/containers/*/

Platform-Specific Issues

Fedora/Nobara/RHEL Systems

  • GPU Support: Use Podman instead of Docker Desktop
  • SELinux: May require container contexts (-Z flag)
  • Firewall: Configure firewalld for container networking

Ubuntu/Debian Systems

  • AppArmor: May restrict container operations
  • Snap Docker: May have permission issues vs native package

General Linux Issues

  • cgroups v2: Some older containers need cgroups v1
  • User namespaces: May cause UID/GID mapping issues
  • systemd: Integration differences between Docker/Podman

Prevention Best Practices

  1. Resource Limits: Always set memory and CPU limits
  2. Health Checks: Implement application health monitoring
  3. Log Rotation: Configure to prevent disk space issues
  4. Security Scanning: Regular vulnerability scans
  5. Backup Strategy: Regular data and configuration backups
  6. Testing: Test containers in staging before production
  7. Documentation: Document container configurations and dependencies

This troubleshooting guide covers the most common Docker and Podman container issues encountered in home lab and production environments.