All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
Adds title, description, type, domain, and tags frontmatter to every doc for improved KB semantic search. The description field is prepended to every search chunk, and domain/type/tags enable filtered queries. Type values: context, guide, runbook, reference, troubleshooting Domain values match directory structure (networking, docker, etc.) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
474 lines
11 KiB
Markdown
474 lines
11 KiB
Markdown
---
|
|
title: "Docker Container Troubleshooting"
|
|
description: "Troubleshooting guide for Docker and Podman containers covering startup failures, GPU issues, build errors, networking, performance, and emergency recovery procedures."
|
|
type: troubleshooting
|
|
domain: docker
|
|
tags: [docker, podman, troubleshooting, gpu, nvidia, networking, performance, fedora]
|
|
---
|
|
|
|
# Docker Container Troubleshooting Guide
|
|
|
|
## Container Startup Issues
|
|
|
|
### Container Won't Start
|
|
**Check container logs first**:
|
|
```bash
|
|
# Docker
|
|
docker logs <container_name>
|
|
docker logs --tail 50 -f <container_name>
|
|
|
|
# Podman
|
|
podman logs <container_name>
|
|
podman logs --tail 50 -f <container_name>
|
|
```
|
|
|
|
### Common Startup Failures
|
|
|
|
#### Port Conflicts
|
|
**Symptoms**: `bind: address already in use` error
|
|
**Solution**:
|
|
```bash
|
|
# Find conflicting process
|
|
sudo netstat -tulpn | grep <port>
|
|
docker ps | grep <port>
|
|
|
|
# Change port mapping
|
|
docker run -p 8081:8080 myapp # Use different host port
|
|
```
|
|
|
|
#### Permission Errors
|
|
**Symptoms**: `permission denied` when accessing files/volumes
|
|
**Solutions**:
|
|
```bash
|
|
# Check file ownership
|
|
ls -la /host/volume/path
|
|
|
|
# Fix ownership (match container user)
|
|
sudo chown -R 1000:1000 /host/volume/path
|
|
|
|
# Use correct UID/GID in container
|
|
docker run -e PUID=1000 -e PGID=1000 myapp
|
|
```
|
|
|
|
#### Missing Environment Variables
|
|
**Symptoms**: Application fails with configuration errors
|
|
**Diagnostic**:
|
|
```bash
|
|
# Check container environment
|
|
docker exec -it <container> env
|
|
docker exec -it <container> printenv
|
|
|
|
# Verify required variables are set
|
|
docker inspect <container> | grep -A 20 "Env"
|
|
```
|
|
|
|
#### Resource Constraints
|
|
**Symptoms**: Container killed or OOM errors
|
|
**Solutions**:
|
|
```bash
|
|
# Check resource usage
|
|
docker stats <container>
|
|
|
|
# Increase memory limit
|
|
docker run -m 4g myapp
|
|
|
|
# Check system resources
|
|
free -h
|
|
df -h
|
|
```
|
|
|
|
### Debug Running Containers
|
|
```bash
|
|
# Access container shell
|
|
docker exec -it <container> /bin/bash
|
|
docker exec -it <container> /bin/sh # if bash not available
|
|
|
|
# Check container processes
|
|
docker exec <container> ps aux
|
|
|
|
# Check container filesystem
|
|
docker exec <container> ls -la /app
|
|
```
|
|
|
|
## Build Issues
|
|
|
|
### Build Failures
|
|
**Clear build cache when encountering issues**:
|
|
```bash
|
|
# Docker
|
|
docker system prune -a
|
|
docker builder prune
|
|
|
|
# Podman
|
|
podman system prune -a
|
|
podman image prune -a
|
|
```
|
|
|
|
### Verbose Build Output
|
|
```bash
|
|
# Docker
|
|
docker build --progress=plain --no-cache .
|
|
|
|
# Podman
|
|
podman build --layers=false .
|
|
```
|
|
|
|
### Common Build Problems
|
|
|
|
#### COPY/ADD Errors
|
|
**Issue**: Files not found during build
|
|
**Solutions**:
|
|
```dockerfile
|
|
# Check .dockerignore file
|
|
# Verify file paths relative to build context
|
|
COPY ./src /app/src # ✅ Correct
|
|
COPY /absolute/path /app # ❌ Wrong - no absolute paths
|
|
```
|
|
|
|
#### Package Installation Failures
|
|
**Issue**: apt/yum/dnf package installation fails
|
|
**Solutions**:
|
|
```dockerfile
|
|
# Update package lists first
|
|
RUN apt-get update && apt-get install -y package-name
|
|
|
|
# Combine RUN commands to reduce layers
|
|
RUN apt-get update && \
|
|
apt-get install -y package1 package2 && \
|
|
apt-get clean && \
|
|
rm -rf /var/lib/apt/lists/*
|
|
```
|
|
|
|
#### Network Issues During Build
|
|
**Issue**: Cannot reach package repositories
|
|
**Solutions**:
|
|
```bash
|
|
# Check DNS resolution
|
|
docker build --network host .
|
|
|
|
# Use custom DNS
|
|
docker build --dns 8.8.8.8 .
|
|
```
|
|
|
|
## GPU Container Issues
|
|
|
|
### NVIDIA GPU Support Problems
|
|
|
|
#### Docker Desktop vs Podman on Fedora/Nobara
|
|
**Issue**: Docker Desktop has GPU compatibility issues on Fedora-based systems
|
|
**Symptoms**:
|
|
- `CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected`
|
|
- `unknown or invalid runtime name: nvidia`
|
|
- Device nodes exist but CUDA fails to initialize
|
|
|
|
**Solution**: Use Podman instead of Docker on Fedora systems
|
|
```bash
|
|
# Verify host GPU works
|
|
nvidia-smi
|
|
|
|
# Test with Podman (recommended)
|
|
podman run --rm --device nvidia.com/gpu=all ubuntu:20.04 nvidia-smi
|
|
|
|
# Test with Docker (may fail on Fedora)
|
|
docker run --rm --gpus all ubuntu:20.04 nvidia-smi
|
|
```
|
|
|
|
#### GPU Container Configuration
|
|
**Working Podman GPU template**:
|
|
```bash
|
|
podman run -d --name gpu-container \
|
|
--device nvidia.com/gpu=all \
|
|
--restart unless-stopped \
|
|
-e NVIDIA_DRIVER_CAPABILITIES=all \
|
|
-e NVIDIA_VISIBLE_DEVICES=all \
|
|
myapp:latest
|
|
```
|
|
|
|
**Working Docker GPU template**:
|
|
```bash
|
|
docker run -d --name gpu-container \
|
|
--gpus all \
|
|
--restart unless-stopped \
|
|
-e NVIDIA_DRIVER_CAPABILITIES=all \
|
|
-e NVIDIA_VISIBLE_DEVICES=all \
|
|
myapp:latest
|
|
```
|
|
|
|
#### GPU Troubleshooting Steps
|
|
1. **Verify Host GPU Access**:
|
|
```bash
|
|
nvidia-smi # Should show GPU info
|
|
lsmod | grep nvidia # Should show nvidia modules
|
|
ls -la /dev/nvidia* # Should show device files
|
|
```
|
|
|
|
2. **Check NVIDIA Container Toolkit**:
|
|
```bash
|
|
rpm -qa | grep nvidia-container-toolkit # Fedora/RHEL
|
|
dpkg -l | grep nvidia-container-toolkit # Ubuntu/Debian
|
|
nvidia-ctk --version
|
|
```
|
|
|
|
3. **Test GPU in Container**:
|
|
```bash
|
|
# Should show GPU information
|
|
podman exec gpu-container nvidia-smi
|
|
|
|
# Test CUDA functionality
|
|
podman exec gpu-container nvidia-ml-py
|
|
```
|
|
|
|
#### Platform-Specific GPU Notes
|
|
**Fedora/Nobara/RHEL**:
|
|
- ✅ Podman: Works out-of-the-box with GPU support
|
|
- ❌ Docker Desktop: Known GPU integration issues
|
|
- Solution: Use Podman for GPU workloads
|
|
|
|
**Ubuntu/Debian**:
|
|
- ✅ Docker: Generally works well with proper NVIDIA toolkit setup
|
|
- ✅ Podman: Also works well
|
|
- Solution: Either runtime typically works
|
|
|
|
## Performance Issues
|
|
|
|
### Resource Monitoring
|
|
**Real-time resource usage**:
|
|
```bash
|
|
# Overall container stats
|
|
docker stats
|
|
podman stats
|
|
|
|
# Inside container analysis
|
|
docker exec <container> top
|
|
docker exec <container> free -h
|
|
docker exec <container> df -h
|
|
|
|
# Network usage
|
|
docker exec <container> netstat -i
|
|
```
|
|
|
|
### Image Size Optimization
|
|
**Analyze image layers**:
|
|
```bash
|
|
# Check image sizes
|
|
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}"
|
|
|
|
# Analyze layer history
|
|
docker history <image>
|
|
|
|
# Find large files in container
|
|
docker exec <container> du -sh /* | sort -hr
|
|
```
|
|
|
|
**Optimization strategies**:
|
|
```dockerfile
|
|
# Use multi-stage builds
|
|
FROM node:18 AS builder
|
|
# ... build steps ...
|
|
|
|
FROM node:18-alpine AS production
|
|
COPY --from=builder /app/dist /app
|
|
# Smaller final image
|
|
|
|
# Combine RUN commands
|
|
RUN apt-get update && \
|
|
apt-get install -y package && \
|
|
apt-get clean && \
|
|
rm -rf /var/lib/apt/lists/*
|
|
|
|
# Use .dockerignore
|
|
# .dockerignore
|
|
node_modules
|
|
.git
|
|
*.log
|
|
```
|
|
|
|
### Storage Performance Issues
|
|
**Slow volume performance**:
|
|
```bash
|
|
# Test volume I/O performance
|
|
docker exec <container> dd if=/dev/zero of=/volume/test bs=1M count=1000
|
|
|
|
# Check volume mount options
|
|
docker inspect <container> | grep -A 10 "Mounts"
|
|
|
|
# Consider using tmpfs for temporary data
|
|
docker run --tmpfs /tmp myapp
|
|
```
|
|
|
|
## Network Debugging
|
|
|
|
### Network Connectivity Issues
|
|
**Inspect network configuration**:
|
|
```bash
|
|
# List networks
|
|
docker network ls
|
|
podman network ls
|
|
|
|
# Inspect specific network
|
|
docker network inspect <network_name>
|
|
|
|
# Check container networking
|
|
docker exec <container> ip addr show
|
|
docker exec <container> ip route show
|
|
```
|
|
|
|
### Service Discovery Problems
|
|
**Test connectivity between containers**:
|
|
```bash
|
|
# Test by container name (same network)
|
|
docker exec container1 ping container2
|
|
|
|
# Test by IP address
|
|
docker exec container1 ping 172.17.0.3
|
|
|
|
# Check DNS resolution
|
|
docker exec container1 nslookup container2
|
|
```
|
|
|
|
### Port Binding Issues
|
|
**Verify port mappings**:
|
|
```bash
|
|
# Check exposed ports
|
|
docker port <container>
|
|
|
|
# Test external connectivity
|
|
curl localhost:8080
|
|
|
|
# Check if port is bound to all interfaces
|
|
netstat -tulpn | grep :8080
|
|
```
|
|
|
|
## Emergency Recovery
|
|
|
|
### Complete Container Reset
|
|
**Remove all containers and start fresh**:
|
|
```bash
|
|
# Stop all containers
|
|
docker stop $(docker ps -q)
|
|
podman stop --all
|
|
|
|
# Remove all containers
|
|
docker container prune -f
|
|
podman container prune -f
|
|
|
|
# Remove all images
|
|
docker image prune -a -f
|
|
podman image prune -a -f
|
|
|
|
# Remove all volumes (CAUTION: data loss)
|
|
docker volume prune -f
|
|
podman volume prune -f
|
|
|
|
# Complete system cleanup
|
|
docker system prune -a --volumes -f
|
|
podman system prune -a --volumes -f
|
|
```
|
|
|
|
### Container Recovery
|
|
**Recover from corrupted container**:
|
|
```bash
|
|
# Create backup of container data
|
|
docker cp <container>:/important/data ./backup/
|
|
|
|
# Export container filesystem
|
|
docker export <container> > container-backup.tar
|
|
|
|
# Import and restart
|
|
docker import container-backup.tar new-image:latest
|
|
docker run -d --name new-container new-image:latest
|
|
```
|
|
|
|
### Data Recovery
|
|
**Recover data from volumes**:
|
|
```bash
|
|
# List volumes
|
|
docker volume ls
|
|
|
|
# Inspect volume location
|
|
docker volume inspect <volume_name>
|
|
|
|
# Access volume data directly
|
|
sudo ls -la /var/lib/docker/volumes/<volume_name>/_data
|
|
|
|
# Mount volume to temporary container
|
|
docker run --rm -v <volume_name>:/data alpine ls -la /data
|
|
```
|
|
|
|
## Health Check Issues
|
|
|
|
### Container Health Checks
|
|
**Implement health checks**:
|
|
```dockerfile
|
|
# Dockerfile health check
|
|
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
|
|
CMD curl -f http://localhost:3000/health || exit 1
|
|
```
|
|
|
|
**Debug health check failures**:
|
|
```bash
|
|
# Check health status
|
|
docker inspect <container> | grep -A 10 Health
|
|
|
|
# Manual health check test
|
|
docker exec <container> curl -f http://localhost:3000/health
|
|
|
|
# Check health check logs
|
|
docker events --filter container=<container>
|
|
```
|
|
|
|
## Log Analysis
|
|
|
|
### Log Management
|
|
**View and manage container logs**:
|
|
```bash
|
|
# View recent logs
|
|
docker logs --tail 100 <container>
|
|
|
|
# Follow logs in real-time
|
|
docker logs -f <container>
|
|
|
|
# Logs with timestamps
|
|
docker logs -t <container>
|
|
|
|
# Search logs for errors
|
|
docker logs <container> 2>&1 | grep ERROR
|
|
```
|
|
|
|
### Log Rotation Issues
|
|
**Configure log rotation to prevent disk filling**:
|
|
```bash
|
|
# Run with log size limits
|
|
docker run --log-opt max-size=10m --log-opt max-file=3 myapp
|
|
|
|
# Check log file sizes
|
|
sudo du -sh /var/lib/docker/containers/*/
|
|
```
|
|
|
|
## Platform-Specific Issues
|
|
|
|
### Fedora/Nobara/RHEL Systems
|
|
- **GPU Support**: Use Podman instead of Docker Desktop
|
|
- **SELinux**: May require container contexts (`-Z` flag)
|
|
- **Firewall**: Configure firewalld for container networking
|
|
|
|
### Ubuntu/Debian Systems
|
|
- **AppArmor**: May restrict container operations
|
|
- **Snap Docker**: May have permission issues vs native package
|
|
|
|
### General Linux Issues
|
|
- **cgroups v2**: Some older containers need cgroups v1
|
|
- **User namespaces**: May cause UID/GID mapping issues
|
|
- **systemd**: Integration differences between Docker/Podman
|
|
|
|
## Prevention Best Practices
|
|
|
|
1. **Resource Limits**: Always set memory and CPU limits
|
|
2. **Health Checks**: Implement application health monitoring
|
|
3. **Log Rotation**: Configure to prevent disk space issues
|
|
4. **Security Scanning**: Regular vulnerability scans
|
|
5. **Backup Strategy**: Regular data and configuration backups
|
|
6. **Testing**: Test containers in staging before production
|
|
7. **Documentation**: Document container configurations and dependencies
|
|
|
|
This troubleshooting guide covers the most common Docker and Podman container issues encountered in home lab and production environments. |