# Docker Container Troubleshooting Guide ## Container Startup Issues ### Container Won't Start **Check container logs first**: ```bash # Docker docker logs docker logs --tail 50 -f # Podman podman logs podman logs --tail 50 -f ``` ### Common Startup Failures #### Port Conflicts **Symptoms**: `bind: address already in use` error **Solution**: ```bash # Find conflicting process sudo netstat -tulpn | grep docker ps | grep # Change port mapping docker run -p 8081:8080 myapp # Use different host port ``` #### Permission Errors **Symptoms**: `permission denied` when accessing files/volumes **Solutions**: ```bash # Check file ownership ls -la /host/volume/path # Fix ownership (match container user) sudo chown -R 1000:1000 /host/volume/path # Use correct UID/GID in container docker run -e PUID=1000 -e PGID=1000 myapp ``` #### Missing Environment Variables **Symptoms**: Application fails with configuration errors **Diagnostic**: ```bash # Check container environment docker exec -it env docker exec -it printenv # Verify required variables are set docker inspect | grep -A 20 "Env" ``` #### Resource Constraints **Symptoms**: Container killed or OOM errors **Solutions**: ```bash # Check resource usage docker stats # Increase memory limit docker run -m 4g myapp # Check system resources free -h df -h ``` ### Debug Running Containers ```bash # Access container shell docker exec -it /bin/bash docker exec -it /bin/sh # if bash not available # Check container processes docker exec ps aux # Check container filesystem docker exec ls -la /app ``` ## Build Issues ### Build Failures **Clear build cache when encountering issues**: ```bash # Docker docker system prune -a docker builder prune # Podman podman system prune -a podman image prune -a ``` ### Verbose Build Output ```bash # Docker docker build --progress=plain --no-cache . # Podman podman build --layers=false . ``` ### Common Build Problems #### COPY/ADD Errors **Issue**: Files not found during build **Solutions**: ```dockerfile # Check .dockerignore file # Verify file paths relative to build context COPY ./src /app/src # ✅ Correct COPY /absolute/path /app # ❌ Wrong - no absolute paths ``` #### Package Installation Failures **Issue**: apt/yum/dnf package installation fails **Solutions**: ```dockerfile # Update package lists first RUN apt-get update && apt-get install -y package-name # Combine RUN commands to reduce layers RUN apt-get update && \ apt-get install -y package1 package2 && \ apt-get clean && \ rm -rf /var/lib/apt/lists/* ``` #### Network Issues During Build **Issue**: Cannot reach package repositories **Solutions**: ```bash # Check DNS resolution docker build --network host . # Use custom DNS docker build --dns 8.8.8.8 . ``` ## GPU Container Issues ### NVIDIA GPU Support Problems #### Docker Desktop vs Podman on Fedora/Nobara **Issue**: Docker Desktop has GPU compatibility issues on Fedora-based systems **Symptoms**: - `CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected` - `unknown or invalid runtime name: nvidia` - Device nodes exist but CUDA fails to initialize **Solution**: Use Podman instead of Docker on Fedora systems ```bash # Verify host GPU works nvidia-smi # Test with Podman (recommended) podman run --rm --device nvidia.com/gpu=all ubuntu:20.04 nvidia-smi # Test with Docker (may fail on Fedora) docker run --rm --gpus all ubuntu:20.04 nvidia-smi ``` #### GPU Container Configuration **Working Podman GPU template**: ```bash podman run -d --name gpu-container \ --device nvidia.com/gpu=all \ --restart unless-stopped \ -e NVIDIA_DRIVER_CAPABILITIES=all \ -e NVIDIA_VISIBLE_DEVICES=all \ myapp:latest ``` **Working Docker GPU template**: ```bash docker run -d --name gpu-container \ --gpus all \ --restart unless-stopped \ -e NVIDIA_DRIVER_CAPABILITIES=all \ -e NVIDIA_VISIBLE_DEVICES=all \ myapp:latest ``` #### GPU Troubleshooting Steps 1. **Verify Host GPU Access**: ```bash nvidia-smi # Should show GPU info lsmod | grep nvidia # Should show nvidia modules ls -la /dev/nvidia* # Should show device files ``` 2. **Check NVIDIA Container Toolkit**: ```bash rpm -qa | grep nvidia-container-toolkit # Fedora/RHEL dpkg -l | grep nvidia-container-toolkit # Ubuntu/Debian nvidia-ctk --version ``` 3. **Test GPU in Container**: ```bash # Should show GPU information podman exec gpu-container nvidia-smi # Test CUDA functionality podman exec gpu-container nvidia-ml-py ``` #### Platform-Specific GPU Notes **Fedora/Nobara/RHEL**: - ✅ Podman: Works out-of-the-box with GPU support - ❌ Docker Desktop: Known GPU integration issues - Solution: Use Podman for GPU workloads **Ubuntu/Debian**: - ✅ Docker: Generally works well with proper NVIDIA toolkit setup - ✅ Podman: Also works well - Solution: Either runtime typically works ## Performance Issues ### Resource Monitoring **Real-time resource usage**: ```bash # Overall container stats docker stats podman stats # Inside container analysis docker exec top docker exec free -h docker exec df -h # Network usage docker exec netstat -i ``` ### Image Size Optimization **Analyze image layers**: ```bash # Check image sizes docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}" # Analyze layer history docker history # Find large files in container docker exec du -sh /* | sort -hr ``` **Optimization strategies**: ```dockerfile # Use multi-stage builds FROM node:18 AS builder # ... build steps ... FROM node:18-alpine AS production COPY --from=builder /app/dist /app # Smaller final image # Combine RUN commands RUN apt-get update && \ apt-get install -y package && \ apt-get clean && \ rm -rf /var/lib/apt/lists/* # Use .dockerignore # .dockerignore node_modules .git *.log ``` ### Storage Performance Issues **Slow volume performance**: ```bash # Test volume I/O performance docker exec dd if=/dev/zero of=/volume/test bs=1M count=1000 # Check volume mount options docker inspect | grep -A 10 "Mounts" # Consider using tmpfs for temporary data docker run --tmpfs /tmp myapp ``` ## Network Debugging ### Network Connectivity Issues **Inspect network configuration**: ```bash # List networks docker network ls podman network ls # Inspect specific network docker network inspect # Check container networking docker exec ip addr show docker exec ip route show ``` ### Service Discovery Problems **Test connectivity between containers**: ```bash # Test by container name (same network) docker exec container1 ping container2 # Test by IP address docker exec container1 ping 172.17.0.3 # Check DNS resolution docker exec container1 nslookup container2 ``` ### Port Binding Issues **Verify port mappings**: ```bash # Check exposed ports docker port # Test external connectivity curl localhost:8080 # Check if port is bound to all interfaces netstat -tulpn | grep :8080 ``` ## Emergency Recovery ### Complete Container Reset **Remove all containers and start fresh**: ```bash # Stop all containers docker stop $(docker ps -q) podman stop --all # Remove all containers docker container prune -f podman container prune -f # Remove all images docker image prune -a -f podman image prune -a -f # Remove all volumes (CAUTION: data loss) docker volume prune -f podman volume prune -f # Complete system cleanup docker system prune -a --volumes -f podman system prune -a --volumes -f ``` ### Container Recovery **Recover from corrupted container**: ```bash # Create backup of container data docker cp :/important/data ./backup/ # Export container filesystem docker export > container-backup.tar # Import and restart docker import container-backup.tar new-image:latest docker run -d --name new-container new-image:latest ``` ### Data Recovery **Recover data from volumes**: ```bash # List volumes docker volume ls # Inspect volume location docker volume inspect # Access volume data directly sudo ls -la /var/lib/docker/volumes//_data # Mount volume to temporary container docker run --rm -v :/data alpine ls -la /data ``` ## Health Check Issues ### Container Health Checks **Implement health checks**: ```dockerfile # Dockerfile health check HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \ CMD curl -f http://localhost:3000/health || exit 1 ``` **Debug health check failures**: ```bash # Check health status docker inspect | grep -A 10 Health # Manual health check test docker exec curl -f http://localhost:3000/health # Check health check logs docker events --filter container= ``` ## Log Analysis ### Log Management **View and manage container logs**: ```bash # View recent logs docker logs --tail 100 # Follow logs in real-time docker logs -f # Logs with timestamps docker logs -t # Search logs for errors docker logs 2>&1 | grep ERROR ``` ### Log Rotation Issues **Configure log rotation to prevent disk filling**: ```bash # Run with log size limits docker run --log-opt max-size=10m --log-opt max-file=3 myapp # Check log file sizes sudo du -sh /var/lib/docker/containers/*/ ``` ## Platform-Specific Issues ### Fedora/Nobara/RHEL Systems - **GPU Support**: Use Podman instead of Docker Desktop - **SELinux**: May require container contexts (`-Z` flag) - **Firewall**: Configure firewalld for container networking ### Ubuntu/Debian Systems - **AppArmor**: May restrict container operations - **Snap Docker**: May have permission issues vs native package ### General Linux Issues - **cgroups v2**: Some older containers need cgroups v1 - **User namespaces**: May cause UID/GID mapping issues - **systemd**: Integration differences between Docker/Podman ## Prevention Best Practices 1. **Resource Limits**: Always set memory and CPU limits 2. **Health Checks**: Implement application health monitoring 3. **Log Rotation**: Configure to prevent disk space issues 4. **Security Scanning**: Regular vulnerability scans 5. **Backup Strategy**: Regular data and configuration backups 6. **Testing**: Test containers in staging before production 7. **Documentation**: Document container configurations and dependencies This troubleshooting guide covers the most common Docker and Podman container issues encountered in home lab and production environments.