claude-home/docker/troubleshooting.md
Cal Corum 4b7eca8a46
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
docs: add YAML frontmatter to all 151 markdown files
Adds title, description, type, domain, and tags frontmatter to every
doc for improved KB semantic search. The description field is prepended
to every search chunk, and domain/type/tags enable filtered queries.

Type values: context, guide, runbook, reference, troubleshooting
Domain values match directory structure (networking, docker, etc.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 09:00:44 -05:00

11 KiB

title description type domain tags
Docker Container Troubleshooting Troubleshooting guide for Docker and Podman containers covering startup failures, GPU issues, build errors, networking, performance, and emergency recovery procedures. troubleshooting docker
docker
podman
troubleshooting
gpu
nvidia
networking
performance
fedora

Docker Container Troubleshooting Guide

Container Startup Issues

Container Won't Start

Check container logs first:

# Docker
docker logs <container_name>
docker logs --tail 50 -f <container_name>

# Podman  
podman logs <container_name>
podman logs --tail 50 -f <container_name>

Common Startup Failures

Port Conflicts

Symptoms: bind: address already in use error Solution:

# Find conflicting process
sudo netstat -tulpn | grep <port>
docker ps | grep <port>

# Change port mapping
docker run -p 8081:8080 myapp  # Use different host port

Permission Errors

Symptoms: permission denied when accessing files/volumes Solutions:

# Check file ownership
ls -la /host/volume/path

# Fix ownership (match container user)
sudo chown -R 1000:1000 /host/volume/path

# Use correct UID/GID in container
docker run -e PUID=1000 -e PGID=1000 myapp

Missing Environment Variables

Symptoms: Application fails with configuration errors Diagnostic:

# Check container environment
docker exec -it <container> env
docker exec -it <container> printenv

# Verify required variables are set
docker inspect <container> | grep -A 20 "Env"

Resource Constraints

Symptoms: Container killed or OOM errors Solutions:

# Check resource usage
docker stats <container>

# Increase memory limit
docker run -m 4g myapp

# Check system resources
free -h
df -h

Debug Running Containers

# Access container shell
docker exec -it <container> /bin/bash
docker exec -it <container> /bin/sh  # if bash not available

# Check container processes
docker exec <container> ps aux

# Check container filesystem
docker exec <container> ls -la /app

Build Issues

Build Failures

Clear build cache when encountering issues:

# Docker
docker system prune -a
docker builder prune

# Podman
podman system prune -a
podman image prune -a

Verbose Build Output

# Docker
docker build --progress=plain --no-cache .

# Podman  
podman build --layers=false .

Common Build Problems

COPY/ADD Errors

Issue: Files not found during build Solutions:

# Check .dockerignore file
# Verify file paths relative to build context
COPY ./src /app/src  # ✅ Correct
COPY /absolute/path /app  # ❌ Wrong - no absolute paths

Package Installation Failures

Issue: apt/yum/dnf package installation fails Solutions:

# Update package lists first
RUN apt-get update && apt-get install -y package-name

# Combine RUN commands to reduce layers
RUN apt-get update && \
    apt-get install -y package1 package2 && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

Network Issues During Build

Issue: Cannot reach package repositories Solutions:

# Check DNS resolution
docker build --network host .

# Use custom DNS
docker build --dns 8.8.8.8 .

GPU Container Issues

NVIDIA GPU Support Problems

Docker Desktop vs Podman on Fedora/Nobara

Issue: Docker Desktop has GPU compatibility issues on Fedora-based systems Symptoms:

  • CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
  • unknown or invalid runtime name: nvidia
  • Device nodes exist but CUDA fails to initialize

Solution: Use Podman instead of Docker on Fedora systems

# Verify host GPU works
nvidia-smi

# Test with Podman (recommended)
podman run --rm --device nvidia.com/gpu=all ubuntu:20.04 nvidia-smi

# Test with Docker (may fail on Fedora)
docker run --rm --gpus all ubuntu:20.04 nvidia-smi

GPU Container Configuration

Working Podman GPU template:

podman run -d --name gpu-container \
    --device nvidia.com/gpu=all \
    --restart unless-stopped \
    -e NVIDIA_DRIVER_CAPABILITIES=all \
    -e NVIDIA_VISIBLE_DEVICES=all \
    myapp:latest

Working Docker GPU template:

docker run -d --name gpu-container \
    --gpus all \
    --restart unless-stopped \
    -e NVIDIA_DRIVER_CAPABILITIES=all \
    -e NVIDIA_VISIBLE_DEVICES=all \
    myapp:latest

GPU Troubleshooting Steps

  1. Verify Host GPU Access:

    nvidia-smi                    # Should show GPU info
    lsmod | grep nvidia          # Should show nvidia modules
    ls -la /dev/nvidia*          # Should show device files
    
  2. Check NVIDIA Container Toolkit:

    rpm -qa | grep nvidia-container-toolkit  # Fedora/RHEL
    dpkg -l | grep nvidia-container-toolkit  # Ubuntu/Debian
    nvidia-ctk --version
    
  3. Test GPU in Container:

    # Should show GPU information
    podman exec gpu-container nvidia-smi
    
    # Test CUDA functionality
    podman exec gpu-container nvidia-ml-py
    

Platform-Specific GPU Notes

Fedora/Nobara/RHEL:

  • Podman: Works out-of-the-box with GPU support
  • Docker Desktop: Known GPU integration issues
  • Solution: Use Podman for GPU workloads

Ubuntu/Debian:

  • Docker: Generally works well with proper NVIDIA toolkit setup
  • Podman: Also works well
  • Solution: Either runtime typically works

Performance Issues

Resource Monitoring

Real-time resource usage:

# Overall container stats
docker stats
podman stats

# Inside container analysis
docker exec <container> top
docker exec <container> free -h
docker exec <container> df -h

# Network usage
docker exec <container> netstat -i

Image Size Optimization

Analyze image layers:

# Check image sizes
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}"

# Analyze layer history
docker history <image>

# Find large files in container
docker exec <container> du -sh /* | sort -hr

Optimization strategies:

# Use multi-stage builds
FROM node:18 AS builder
# ... build steps ...

FROM node:18-alpine AS production
COPY --from=builder /app/dist /app
# Smaller final image

# Combine RUN commands
RUN apt-get update && \
    apt-get install -y package && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Use .dockerignore
# .dockerignore
node_modules
.git
*.log

Storage Performance Issues

Slow volume performance:

# Test volume I/O performance
docker exec <container> dd if=/dev/zero of=/volume/test bs=1M count=1000

# Check volume mount options
docker inspect <container> | grep -A 10 "Mounts"

# Consider using tmpfs for temporary data
docker run --tmpfs /tmp myapp

Network Debugging

Network Connectivity Issues

Inspect network configuration:

# List networks
docker network ls
podman network ls

# Inspect specific network
docker network inspect <network_name>

# Check container networking
docker exec <container> ip addr show
docker exec <container> ip route show

Service Discovery Problems

Test connectivity between containers:

# Test by container name (same network)
docker exec container1 ping container2

# Test by IP address
docker exec container1 ping 172.17.0.3

# Check DNS resolution
docker exec container1 nslookup container2

Port Binding Issues

Verify port mappings:

# Check exposed ports
docker port <container>

# Test external connectivity
curl localhost:8080

# Check if port is bound to all interfaces
netstat -tulpn | grep :8080

Emergency Recovery

Complete Container Reset

Remove all containers and start fresh:

# Stop all containers
docker stop $(docker ps -q)
podman stop --all

# Remove all containers
docker container prune -f
podman container prune -f

# Remove all images
docker image prune -a -f
podman image prune -a -f

# Remove all volumes (CAUTION: data loss)
docker volume prune -f
podman volume prune -f

# Complete system cleanup
docker system prune -a --volumes -f
podman system prune -a --volumes -f

Container Recovery

Recover from corrupted container:

# Create backup of container data
docker cp <container>:/important/data ./backup/

# Export container filesystem
docker export <container> > container-backup.tar

# Import and restart
docker import container-backup.tar new-image:latest
docker run -d --name new-container new-image:latest

Data Recovery

Recover data from volumes:

# List volumes
docker volume ls

# Inspect volume location
docker volume inspect <volume_name>

# Access volume data directly
sudo ls -la /var/lib/docker/volumes/<volume_name>/_data

# Mount volume to temporary container
docker run --rm -v <volume_name>:/data alpine ls -la /data

Health Check Issues

Container Health Checks

Implement health checks:

# Dockerfile health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:3000/health || exit 1

Debug health check failures:

# Check health status
docker inspect <container> | grep -A 10 Health

# Manual health check test
docker exec <container> curl -f http://localhost:3000/health

# Check health check logs
docker events --filter container=<container>

Log Analysis

Log Management

View and manage container logs:

# View recent logs
docker logs --tail 100 <container>

# Follow logs in real-time
docker logs -f <container>

# Logs with timestamps
docker logs -t <container>

# Search logs for errors
docker logs <container> 2>&1 | grep ERROR

Log Rotation Issues

Configure log rotation to prevent disk filling:

# Run with log size limits
docker run --log-opt max-size=10m --log-opt max-file=3 myapp

# Check log file sizes
sudo du -sh /var/lib/docker/containers/*/

Platform-Specific Issues

Fedora/Nobara/RHEL Systems

  • GPU Support: Use Podman instead of Docker Desktop
  • SELinux: May require container contexts (-Z flag)
  • Firewall: Configure firewalld for container networking

Ubuntu/Debian Systems

  • AppArmor: May restrict container operations
  • Snap Docker: May have permission issues vs native package

General Linux Issues

  • cgroups v2: Some older containers need cgroups v1
  • User namespaces: May cause UID/GID mapping issues
  • systemd: Integration differences between Docker/Podman

Prevention Best Practices

  1. Resource Limits: Always set memory and CPU limits
  2. Health Checks: Implement application health monitoring
  3. Log Rotation: Configure to prevent disk space issues
  4. Security Scanning: Regular vulnerability scans
  5. Backup Strategy: Regular data and configuration backups
  6. Testing: Test containers in staging before production
  7. Documentation: Document container configurations and dependencies

This troubleshooting guide covers the most common Docker and Podman container issues encountered in home lab and production environments.