claude-home/docker/troubleshooting.md
Cal Corum 4b7eca8a46
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
docs: add YAML frontmatter to all 151 markdown files
Adds title, description, type, domain, and tags frontmatter to every
doc for improved KB semantic search. The description field is prepended
to every search chunk, and domain/type/tags enable filtered queries.

Type values: context, guide, runbook, reference, troubleshooting
Domain values match directory structure (networking, docker, etc.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 09:00:44 -05:00

474 lines
11 KiB
Markdown

---
title: "Docker Container Troubleshooting"
description: "Troubleshooting guide for Docker and Podman containers covering startup failures, GPU issues, build errors, networking, performance, and emergency recovery procedures."
type: troubleshooting
domain: docker
tags: [docker, podman, troubleshooting, gpu, nvidia, networking, performance, fedora]
---
# Docker Container Troubleshooting Guide
## Container Startup Issues
### Container Won't Start
**Check container logs first**:
```bash
# Docker
docker logs <container_name>
docker logs --tail 50 -f <container_name>
# Podman
podman logs <container_name>
podman logs --tail 50 -f <container_name>
```
### Common Startup Failures
#### Port Conflicts
**Symptoms**: `bind: address already in use` error
**Solution**:
```bash
# Find conflicting process
sudo netstat -tulpn | grep <port>
docker ps | grep <port>
# Change port mapping
docker run -p 8081:8080 myapp # Use different host port
```
#### Permission Errors
**Symptoms**: `permission denied` when accessing files/volumes
**Solutions**:
```bash
# Check file ownership
ls -la /host/volume/path
# Fix ownership (match container user)
sudo chown -R 1000:1000 /host/volume/path
# Use correct UID/GID in container
docker run -e PUID=1000 -e PGID=1000 myapp
```
#### Missing Environment Variables
**Symptoms**: Application fails with configuration errors
**Diagnostic**:
```bash
# Check container environment
docker exec -it <container> env
docker exec -it <container> printenv
# Verify required variables are set
docker inspect <container> | grep -A 20 "Env"
```
#### Resource Constraints
**Symptoms**: Container killed or OOM errors
**Solutions**:
```bash
# Check resource usage
docker stats <container>
# Increase memory limit
docker run -m 4g myapp
# Check system resources
free -h
df -h
```
### Debug Running Containers
```bash
# Access container shell
docker exec -it <container> /bin/bash
docker exec -it <container> /bin/sh # if bash not available
# Check container processes
docker exec <container> ps aux
# Check container filesystem
docker exec <container> ls -la /app
```
## Build Issues
### Build Failures
**Clear build cache when encountering issues**:
```bash
# Docker
docker system prune -a
docker builder prune
# Podman
podman system prune -a
podman image prune -a
```
### Verbose Build Output
```bash
# Docker
docker build --progress=plain --no-cache .
# Podman
podman build --layers=false .
```
### Common Build Problems
#### COPY/ADD Errors
**Issue**: Files not found during build
**Solutions**:
```dockerfile
# Check .dockerignore file
# Verify file paths relative to build context
COPY ./src /app/src # ✅ Correct
COPY /absolute/path /app # ❌ Wrong - no absolute paths
```
#### Package Installation Failures
**Issue**: apt/yum/dnf package installation fails
**Solutions**:
```dockerfile
# Update package lists first
RUN apt-get update && apt-get install -y package-name
# Combine RUN commands to reduce layers
RUN apt-get update && \
apt-get install -y package1 package2 && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
```
#### Network Issues During Build
**Issue**: Cannot reach package repositories
**Solutions**:
```bash
# Check DNS resolution
docker build --network host .
# Use custom DNS
docker build --dns 8.8.8.8 .
```
## GPU Container Issues
### NVIDIA GPU Support Problems
#### Docker Desktop vs Podman on Fedora/Nobara
**Issue**: Docker Desktop has GPU compatibility issues on Fedora-based systems
**Symptoms**:
- `CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected`
- `unknown or invalid runtime name: nvidia`
- Device nodes exist but CUDA fails to initialize
**Solution**: Use Podman instead of Docker on Fedora systems
```bash
# Verify host GPU works
nvidia-smi
# Test with Podman (recommended)
podman run --rm --device nvidia.com/gpu=all ubuntu:20.04 nvidia-smi
# Test with Docker (may fail on Fedora)
docker run --rm --gpus all ubuntu:20.04 nvidia-smi
```
#### GPU Container Configuration
**Working Podman GPU template**:
```bash
podman run -d --name gpu-container \
--device nvidia.com/gpu=all \
--restart unless-stopped \
-e NVIDIA_DRIVER_CAPABILITIES=all \
-e NVIDIA_VISIBLE_DEVICES=all \
myapp:latest
```
**Working Docker GPU template**:
```bash
docker run -d --name gpu-container \
--gpus all \
--restart unless-stopped \
-e NVIDIA_DRIVER_CAPABILITIES=all \
-e NVIDIA_VISIBLE_DEVICES=all \
myapp:latest
```
#### GPU Troubleshooting Steps
1. **Verify Host GPU Access**:
```bash
nvidia-smi # Should show GPU info
lsmod | grep nvidia # Should show nvidia modules
ls -la /dev/nvidia* # Should show device files
```
2. **Check NVIDIA Container Toolkit**:
```bash
rpm -qa | grep nvidia-container-toolkit # Fedora/RHEL
dpkg -l | grep nvidia-container-toolkit # Ubuntu/Debian
nvidia-ctk --version
```
3. **Test GPU in Container**:
```bash
# Should show GPU information
podman exec gpu-container nvidia-smi
# Test CUDA functionality
podman exec gpu-container nvidia-ml-py
```
#### Platform-Specific GPU Notes
**Fedora/Nobara/RHEL**:
- ✅ Podman: Works out-of-the-box with GPU support
- ❌ Docker Desktop: Known GPU integration issues
- Solution: Use Podman for GPU workloads
**Ubuntu/Debian**:
- ✅ Docker: Generally works well with proper NVIDIA toolkit setup
- ✅ Podman: Also works well
- Solution: Either runtime typically works
## Performance Issues
### Resource Monitoring
**Real-time resource usage**:
```bash
# Overall container stats
docker stats
podman stats
# Inside container analysis
docker exec <container> top
docker exec <container> free -h
docker exec <container> df -h
# Network usage
docker exec <container> netstat -i
```
### Image Size Optimization
**Analyze image layers**:
```bash
# Check image sizes
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}"
# Analyze layer history
docker history <image>
# Find large files in container
docker exec <container> du -sh /* | sort -hr
```
**Optimization strategies**:
```dockerfile
# Use multi-stage builds
FROM node:18 AS builder
# ... build steps ...
FROM node:18-alpine AS production
COPY --from=builder /app/dist /app
# Smaller final image
# Combine RUN commands
RUN apt-get update && \
apt-get install -y package && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Use .dockerignore
# .dockerignore
node_modules
.git
*.log
```
### Storage Performance Issues
**Slow volume performance**:
```bash
# Test volume I/O performance
docker exec <container> dd if=/dev/zero of=/volume/test bs=1M count=1000
# Check volume mount options
docker inspect <container> | grep -A 10 "Mounts"
# Consider using tmpfs for temporary data
docker run --tmpfs /tmp myapp
```
## Network Debugging
### Network Connectivity Issues
**Inspect network configuration**:
```bash
# List networks
docker network ls
podman network ls
# Inspect specific network
docker network inspect <network_name>
# Check container networking
docker exec <container> ip addr show
docker exec <container> ip route show
```
### Service Discovery Problems
**Test connectivity between containers**:
```bash
# Test by container name (same network)
docker exec container1 ping container2
# Test by IP address
docker exec container1 ping 172.17.0.3
# Check DNS resolution
docker exec container1 nslookup container2
```
### Port Binding Issues
**Verify port mappings**:
```bash
# Check exposed ports
docker port <container>
# Test external connectivity
curl localhost:8080
# Check if port is bound to all interfaces
netstat -tulpn | grep :8080
```
## Emergency Recovery
### Complete Container Reset
**Remove all containers and start fresh**:
```bash
# Stop all containers
docker stop $(docker ps -q)
podman stop --all
# Remove all containers
docker container prune -f
podman container prune -f
# Remove all images
docker image prune -a -f
podman image prune -a -f
# Remove all volumes (CAUTION: data loss)
docker volume prune -f
podman volume prune -f
# Complete system cleanup
docker system prune -a --volumes -f
podman system prune -a --volumes -f
```
### Container Recovery
**Recover from corrupted container**:
```bash
# Create backup of container data
docker cp <container>:/important/data ./backup/
# Export container filesystem
docker export <container> > container-backup.tar
# Import and restart
docker import container-backup.tar new-image:latest
docker run -d --name new-container new-image:latest
```
### Data Recovery
**Recover data from volumes**:
```bash
# List volumes
docker volume ls
# Inspect volume location
docker volume inspect <volume_name>
# Access volume data directly
sudo ls -la /var/lib/docker/volumes/<volume_name>/_data
# Mount volume to temporary container
docker run --rm -v <volume_name>:/data alpine ls -la /data
```
## Health Check Issues
### Container Health Checks
**Implement health checks**:
```dockerfile
# Dockerfile health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD curl -f http://localhost:3000/health || exit 1
```
**Debug health check failures**:
```bash
# Check health status
docker inspect <container> | grep -A 10 Health
# Manual health check test
docker exec <container> curl -f http://localhost:3000/health
# Check health check logs
docker events --filter container=<container>
```
## Log Analysis
### Log Management
**View and manage container logs**:
```bash
# View recent logs
docker logs --tail 100 <container>
# Follow logs in real-time
docker logs -f <container>
# Logs with timestamps
docker logs -t <container>
# Search logs for errors
docker logs <container> 2>&1 | grep ERROR
```
### Log Rotation Issues
**Configure log rotation to prevent disk filling**:
```bash
# Run with log size limits
docker run --log-opt max-size=10m --log-opt max-file=3 myapp
# Check log file sizes
sudo du -sh /var/lib/docker/containers/*/
```
## Platform-Specific Issues
### Fedora/Nobara/RHEL Systems
- **GPU Support**: Use Podman instead of Docker Desktop
- **SELinux**: May require container contexts (`-Z` flag)
- **Firewall**: Configure firewalld for container networking
### Ubuntu/Debian Systems
- **AppArmor**: May restrict container operations
- **Snap Docker**: May have permission issues vs native package
### General Linux Issues
- **cgroups v2**: Some older containers need cgroups v1
- **User namespaces**: May cause UID/GID mapping issues
- **systemd**: Integration differences between Docker/Podman
## Prevention Best Practices
1. **Resource Limits**: Always set memory and CPU limits
2. **Health Checks**: Implement application health monitoring
3. **Log Rotation**: Configure to prevent disk space issues
4. **Security Scanning**: Regular vulnerability scans
5. **Backup Strategy**: Regular data and configuration backups
6. **Testing**: Test containers in staging before production
7. **Documentation**: Document container configurations and dependencies
This troubleshooting guide covers the most common Docker and Podman container issues encountered in home lab and production environments.