Complete restructure from patterns/examples/reference to technology-focused directories: • Created technology-specific directories with comprehensive documentation: - /tdarr/ - Transcoding automation with gaming-aware scheduling - /docker/ - Container management with GPU acceleration patterns - /vm-management/ - Virtual machine automation and cloud-init - /networking/ - SSH infrastructure, reverse proxy, and security - /monitoring/ - System health checks and Discord notifications - /databases/ - Database patterns and troubleshooting - /development/ - Programming language patterns (bash, nodejs, python, vuejs) • Enhanced CLAUDE.md with intelligent context loading: - Technology-first loading rules for automatic context provision - Troubleshooting keyword triggers for emergency scenarios - Documentation maintenance protocols with automated reminders - Context window management for optimal documentation updates • Preserved valuable content from .claude/tmp/: - SSH security improvements and server inventory - Tdarr CIFS troubleshooting and Docker iptables solutions - Operational scripts with proper technology classification • Benefits achieved: - Self-contained technology directories with complete context - Automatic loading of relevant documentation based on keywords - Emergency-ready troubleshooting with comprehensive guides - Scalable structure for future technology additions - Eliminated context bloat through targeted loading 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
466 lines
10 KiB
Markdown
466 lines
10 KiB
Markdown
# Docker Container Troubleshooting Guide
|
|
|
|
## Container Startup Issues
|
|
|
|
### Container Won't Start
|
|
**Check container logs first**:
|
|
```bash
|
|
# Docker
|
|
docker logs <container_name>
|
|
docker logs --tail 50 -f <container_name>
|
|
|
|
# Podman
|
|
podman logs <container_name>
|
|
podman logs --tail 50 -f <container_name>
|
|
```
|
|
|
|
### Common Startup Failures
|
|
|
|
#### Port Conflicts
|
|
**Symptoms**: `bind: address already in use` error
|
|
**Solution**:
|
|
```bash
|
|
# Find conflicting process
|
|
sudo netstat -tulpn | grep <port>
|
|
docker ps | grep <port>
|
|
|
|
# Change port mapping
|
|
docker run -p 8081:8080 myapp # Use different host port
|
|
```
|
|
|
|
#### Permission Errors
|
|
**Symptoms**: `permission denied` when accessing files/volumes
|
|
**Solutions**:
|
|
```bash
|
|
# Check file ownership
|
|
ls -la /host/volume/path
|
|
|
|
# Fix ownership (match container user)
|
|
sudo chown -R 1000:1000 /host/volume/path
|
|
|
|
# Use correct UID/GID in container
|
|
docker run -e PUID=1000 -e PGID=1000 myapp
|
|
```
|
|
|
|
#### Missing Environment Variables
|
|
**Symptoms**: Application fails with configuration errors
|
|
**Diagnostic**:
|
|
```bash
|
|
# Check container environment
|
|
docker exec -it <container> env
|
|
docker exec -it <container> printenv
|
|
|
|
# Verify required variables are set
|
|
docker inspect <container> | grep -A 20 "Env"
|
|
```
|
|
|
|
#### Resource Constraints
|
|
**Symptoms**: Container killed or OOM errors
|
|
**Solutions**:
|
|
```bash
|
|
# Check resource usage
|
|
docker stats <container>
|
|
|
|
# Increase memory limit
|
|
docker run -m 4g myapp
|
|
|
|
# Check system resources
|
|
free -h
|
|
df -h
|
|
```
|
|
|
|
### Debug Running Containers
|
|
```bash
|
|
# Access container shell
|
|
docker exec -it <container> /bin/bash
|
|
docker exec -it <container> /bin/sh # if bash not available
|
|
|
|
# Check container processes
|
|
docker exec <container> ps aux
|
|
|
|
# Check container filesystem
|
|
docker exec <container> ls -la /app
|
|
```
|
|
|
|
## Build Issues
|
|
|
|
### Build Failures
|
|
**Clear build cache when encountering issues**:
|
|
```bash
|
|
# Docker
|
|
docker system prune -a
|
|
docker builder prune
|
|
|
|
# Podman
|
|
podman system prune -a
|
|
podman image prune -a
|
|
```
|
|
|
|
### Verbose Build Output
|
|
```bash
|
|
# Docker
|
|
docker build --progress=plain --no-cache .
|
|
|
|
# Podman
|
|
podman build --layers=false .
|
|
```
|
|
|
|
### Common Build Problems
|
|
|
|
#### COPY/ADD Errors
|
|
**Issue**: Files not found during build
|
|
**Solutions**:
|
|
```dockerfile
|
|
# Check .dockerignore file
|
|
# Verify file paths relative to build context
|
|
COPY ./src /app/src # ✅ Correct
|
|
COPY /absolute/path /app # ❌ Wrong - no absolute paths
|
|
```
|
|
|
|
#### Package Installation Failures
|
|
**Issue**: apt/yum/dnf package installation fails
|
|
**Solutions**:
|
|
```dockerfile
|
|
# Update package lists first
|
|
RUN apt-get update && apt-get install -y package-name
|
|
|
|
# Combine RUN commands to reduce layers
|
|
RUN apt-get update && \
|
|
apt-get install -y package1 package2 && \
|
|
apt-get clean && \
|
|
rm -rf /var/lib/apt/lists/*
|
|
```
|
|
|
|
#### Network Issues During Build
|
|
**Issue**: Cannot reach package repositories
|
|
**Solutions**:
|
|
```bash
|
|
# Check DNS resolution
|
|
docker build --network host .
|
|
|
|
# Use custom DNS
|
|
docker build --dns 8.8.8.8 .
|
|
```
|
|
|
|
## GPU Container Issues
|
|
|
|
### NVIDIA GPU Support Problems
|
|
|
|
#### Docker Desktop vs Podman on Fedora/Nobara
|
|
**Issue**: Docker Desktop has GPU compatibility issues on Fedora-based systems
|
|
**Symptoms**:
|
|
- `CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected`
|
|
- `unknown or invalid runtime name: nvidia`
|
|
- Device nodes exist but CUDA fails to initialize
|
|
|
|
**Solution**: Use Podman instead of Docker on Fedora systems
|
|
```bash
|
|
# Verify host GPU works
|
|
nvidia-smi
|
|
|
|
# Test with Podman (recommended)
|
|
podman run --rm --device nvidia.com/gpu=all ubuntu:20.04 nvidia-smi
|
|
|
|
# Test with Docker (may fail on Fedora)
|
|
docker run --rm --gpus all ubuntu:20.04 nvidia-smi
|
|
```
|
|
|
|
#### GPU Container Configuration
|
|
**Working Podman GPU template**:
|
|
```bash
|
|
podman run -d --name gpu-container \
|
|
--device nvidia.com/gpu=all \
|
|
--restart unless-stopped \
|
|
-e NVIDIA_DRIVER_CAPABILITIES=all \
|
|
-e NVIDIA_VISIBLE_DEVICES=all \
|
|
myapp:latest
|
|
```
|
|
|
|
**Working Docker GPU template**:
|
|
```bash
|
|
docker run -d --name gpu-container \
|
|
--gpus all \
|
|
--restart unless-stopped \
|
|
-e NVIDIA_DRIVER_CAPABILITIES=all \
|
|
-e NVIDIA_VISIBLE_DEVICES=all \
|
|
myapp:latest
|
|
```
|
|
|
|
#### GPU Troubleshooting Steps
|
|
1. **Verify Host GPU Access**:
|
|
```bash
|
|
nvidia-smi # Should show GPU info
|
|
lsmod | grep nvidia # Should show nvidia modules
|
|
ls -la /dev/nvidia* # Should show device files
|
|
```
|
|
|
|
2. **Check NVIDIA Container Toolkit**:
|
|
```bash
|
|
rpm -qa | grep nvidia-container-toolkit # Fedora/RHEL
|
|
dpkg -l | grep nvidia-container-toolkit # Ubuntu/Debian
|
|
nvidia-ctk --version
|
|
```
|
|
|
|
3. **Test GPU in Container**:
|
|
```bash
|
|
# Should show GPU information
|
|
podman exec gpu-container nvidia-smi
|
|
|
|
# Test CUDA functionality
|
|
podman exec gpu-container nvidia-ml-py
|
|
```
|
|
|
|
#### Platform-Specific GPU Notes
|
|
**Fedora/Nobara/RHEL**:
|
|
- ✅ Podman: Works out-of-the-box with GPU support
|
|
- ❌ Docker Desktop: Known GPU integration issues
|
|
- Solution: Use Podman for GPU workloads
|
|
|
|
**Ubuntu/Debian**:
|
|
- ✅ Docker: Generally works well with proper NVIDIA toolkit setup
|
|
- ✅ Podman: Also works well
|
|
- Solution: Either runtime typically works
|
|
|
|
## Performance Issues
|
|
|
|
### Resource Monitoring
|
|
**Real-time resource usage**:
|
|
```bash
|
|
# Overall container stats
|
|
docker stats
|
|
podman stats
|
|
|
|
# Inside container analysis
|
|
docker exec <container> top
|
|
docker exec <container> free -h
|
|
docker exec <container> df -h
|
|
|
|
# Network usage
|
|
docker exec <container> netstat -i
|
|
```
|
|
|
|
### Image Size Optimization
|
|
**Analyze image layers**:
|
|
```bash
|
|
# Check image sizes
|
|
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}"
|
|
|
|
# Analyze layer history
|
|
docker history <image>
|
|
|
|
# Find large files in container
|
|
docker exec <container> du -sh /* | sort -hr
|
|
```
|
|
|
|
**Optimization strategies**:
|
|
```dockerfile
|
|
# Use multi-stage builds
|
|
FROM node:18 AS builder
|
|
# ... build steps ...
|
|
|
|
FROM node:18-alpine AS production
|
|
COPY --from=builder /app/dist /app
|
|
# Smaller final image
|
|
|
|
# Combine RUN commands
|
|
RUN apt-get update && \
|
|
apt-get install -y package && \
|
|
apt-get clean && \
|
|
rm -rf /var/lib/apt/lists/*
|
|
|
|
# Use .dockerignore
|
|
# .dockerignore
|
|
node_modules
|
|
.git
|
|
*.log
|
|
```
|
|
|
|
### Storage Performance Issues
|
|
**Slow volume performance**:
|
|
```bash
|
|
# Test volume I/O performance
|
|
docker exec <container> dd if=/dev/zero of=/volume/test bs=1M count=1000
|
|
|
|
# Check volume mount options
|
|
docker inspect <container> | grep -A 10 "Mounts"
|
|
|
|
# Consider using tmpfs for temporary data
|
|
docker run --tmpfs /tmp myapp
|
|
```
|
|
|
|
## Network Debugging
|
|
|
|
### Network Connectivity Issues
|
|
**Inspect network configuration**:
|
|
```bash
|
|
# List networks
|
|
docker network ls
|
|
podman network ls
|
|
|
|
# Inspect specific network
|
|
docker network inspect <network_name>
|
|
|
|
# Check container networking
|
|
docker exec <container> ip addr show
|
|
docker exec <container> ip route show
|
|
```
|
|
|
|
### Service Discovery Problems
|
|
**Test connectivity between containers**:
|
|
```bash
|
|
# Test by container name (same network)
|
|
docker exec container1 ping container2
|
|
|
|
# Test by IP address
|
|
docker exec container1 ping 172.17.0.3
|
|
|
|
# Check DNS resolution
|
|
docker exec container1 nslookup container2
|
|
```
|
|
|
|
### Port Binding Issues
|
|
**Verify port mappings**:
|
|
```bash
|
|
# Check exposed ports
|
|
docker port <container>
|
|
|
|
# Test external connectivity
|
|
curl localhost:8080
|
|
|
|
# Check if port is bound to all interfaces
|
|
netstat -tulpn | grep :8080
|
|
```
|
|
|
|
## Emergency Recovery
|
|
|
|
### Complete Container Reset
|
|
**Remove all containers and start fresh**:
|
|
```bash
|
|
# Stop all containers
|
|
docker stop $(docker ps -q)
|
|
podman stop --all
|
|
|
|
# Remove all containers
|
|
docker container prune -f
|
|
podman container prune -f
|
|
|
|
# Remove all images
|
|
docker image prune -a -f
|
|
podman image prune -a -f
|
|
|
|
# Remove all volumes (CAUTION: data loss)
|
|
docker volume prune -f
|
|
podman volume prune -f
|
|
|
|
# Complete system cleanup
|
|
docker system prune -a --volumes -f
|
|
podman system prune -a --volumes -f
|
|
```
|
|
|
|
### Container Recovery
|
|
**Recover from corrupted container**:
|
|
```bash
|
|
# Create backup of container data
|
|
docker cp <container>:/important/data ./backup/
|
|
|
|
# Export container filesystem
|
|
docker export <container> > container-backup.tar
|
|
|
|
# Import and restart
|
|
docker import container-backup.tar new-image:latest
|
|
docker run -d --name new-container new-image:latest
|
|
```
|
|
|
|
### Data Recovery
|
|
**Recover data from volumes**:
|
|
```bash
|
|
# List volumes
|
|
docker volume ls
|
|
|
|
# Inspect volume location
|
|
docker volume inspect <volume_name>
|
|
|
|
# Access volume data directly
|
|
sudo ls -la /var/lib/docker/volumes/<volume_name>/_data
|
|
|
|
# Mount volume to temporary container
|
|
docker run --rm -v <volume_name>:/data alpine ls -la /data
|
|
```
|
|
|
|
## Health Check Issues
|
|
|
|
### Container Health Checks
|
|
**Implement health checks**:
|
|
```dockerfile
|
|
# Dockerfile health check
|
|
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
|
|
CMD curl -f http://localhost:3000/health || exit 1
|
|
```
|
|
|
|
**Debug health check failures**:
|
|
```bash
|
|
# Check health status
|
|
docker inspect <container> | grep -A 10 Health
|
|
|
|
# Manual health check test
|
|
docker exec <container> curl -f http://localhost:3000/health
|
|
|
|
# Check health check logs
|
|
docker events --filter container=<container>
|
|
```
|
|
|
|
## Log Analysis
|
|
|
|
### Log Management
|
|
**View and manage container logs**:
|
|
```bash
|
|
# View recent logs
|
|
docker logs --tail 100 <container>
|
|
|
|
# Follow logs in real-time
|
|
docker logs -f <container>
|
|
|
|
# Logs with timestamps
|
|
docker logs -t <container>
|
|
|
|
# Search logs for errors
|
|
docker logs <container> 2>&1 | grep ERROR
|
|
```
|
|
|
|
### Log Rotation Issues
|
|
**Configure log rotation to prevent disk filling**:
|
|
```bash
|
|
# Run with log size limits
|
|
docker run --log-opt max-size=10m --log-opt max-file=3 myapp
|
|
|
|
# Check log file sizes
|
|
sudo du -sh /var/lib/docker/containers/*/
|
|
```
|
|
|
|
## Platform-Specific Issues
|
|
|
|
### Fedora/Nobara/RHEL Systems
|
|
- **GPU Support**: Use Podman instead of Docker Desktop
|
|
- **SELinux**: May require container contexts (`-Z` flag)
|
|
- **Firewall**: Configure firewalld for container networking
|
|
|
|
### Ubuntu/Debian Systems
|
|
- **AppArmor**: May restrict container operations
|
|
- **Snap Docker**: May have permission issues vs native package
|
|
|
|
### General Linux Issues
|
|
- **cgroups v2**: Some older containers need cgroups v1
|
|
- **User namespaces**: May cause UID/GID mapping issues
|
|
- **systemd**: Integration differences between Docker/Podman
|
|
|
|
## Prevention Best Practices
|
|
|
|
1. **Resource Limits**: Always set memory and CPU limits
|
|
2. **Health Checks**: Implement application health monitoring
|
|
3. **Log Rotation**: Configure to prevent disk space issues
|
|
4. **Security Scanning**: Regular vulnerability scans
|
|
5. **Backup Strategy**: Regular data and configuration backups
|
|
6. **Testing**: Test containers in staging before production
|
|
7. **Documentation**: Document container configurations and dependencies
|
|
|
|
This troubleshooting guide covers the most common Docker and Podman container issues encountered in home lab and production environments. |