claude-home/docker/troubleshooting.md
Cal Corum 10c9e0d854 CLAUDE: Migrate to technology-first documentation architecture
Complete restructure from patterns/examples/reference to technology-focused directories:

• Created technology-specific directories with comprehensive documentation:
  - /tdarr/ - Transcoding automation with gaming-aware scheduling
  - /docker/ - Container management with GPU acceleration patterns
  - /vm-management/ - Virtual machine automation and cloud-init
  - /networking/ - SSH infrastructure, reverse proxy, and security
  - /monitoring/ - System health checks and Discord notifications
  - /databases/ - Database patterns and troubleshooting
  - /development/ - Programming language patterns (bash, nodejs, python, vuejs)

• Enhanced CLAUDE.md with intelligent context loading:
  - Technology-first loading rules for automatic context provision
  - Troubleshooting keyword triggers for emergency scenarios
  - Documentation maintenance protocols with automated reminders
  - Context window management for optimal documentation updates

• Preserved valuable content from .claude/tmp/:
  - SSH security improvements and server inventory
  - Tdarr CIFS troubleshooting and Docker iptables solutions
  - Operational scripts with proper technology classification

• Benefits achieved:
  - Self-contained technology directories with complete context
  - Automatic loading of relevant documentation based on keywords
  - Emergency-ready troubleshooting with comprehensive guides
  - Scalable structure for future technology additions
  - Eliminated context bloat through targeted loading

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-12 23:20:15 -05:00

466 lines
10 KiB
Markdown

# Docker Container Troubleshooting Guide
## Container Startup Issues
### Container Won't Start
**Check container logs first**:
```bash
# Docker
docker logs <container_name>
docker logs --tail 50 -f <container_name>
# Podman
podman logs <container_name>
podman logs --tail 50 -f <container_name>
```
### Common Startup Failures
#### Port Conflicts
**Symptoms**: `bind: address already in use` error
**Solution**:
```bash
# Find conflicting process
sudo netstat -tulpn | grep <port>
docker ps | grep <port>
# Change port mapping
docker run -p 8081:8080 myapp # Use different host port
```
#### Permission Errors
**Symptoms**: `permission denied` when accessing files/volumes
**Solutions**:
```bash
# Check file ownership
ls -la /host/volume/path
# Fix ownership (match container user)
sudo chown -R 1000:1000 /host/volume/path
# Use correct UID/GID in container
docker run -e PUID=1000 -e PGID=1000 myapp
```
#### Missing Environment Variables
**Symptoms**: Application fails with configuration errors
**Diagnostic**:
```bash
# Check container environment
docker exec -it <container> env
docker exec -it <container> printenv
# Verify required variables are set
docker inspect <container> | grep -A 20 "Env"
```
#### Resource Constraints
**Symptoms**: Container killed or OOM errors
**Solutions**:
```bash
# Check resource usage
docker stats <container>
# Increase memory limit
docker run -m 4g myapp
# Check system resources
free -h
df -h
```
### Debug Running Containers
```bash
# Access container shell
docker exec -it <container> /bin/bash
docker exec -it <container> /bin/sh # if bash not available
# Check container processes
docker exec <container> ps aux
# Check container filesystem
docker exec <container> ls -la /app
```
## Build Issues
### Build Failures
**Clear build cache when encountering issues**:
```bash
# Docker
docker system prune -a
docker builder prune
# Podman
podman system prune -a
podman image prune -a
```
### Verbose Build Output
```bash
# Docker
docker build --progress=plain --no-cache .
# Podman
podman build --layers=false .
```
### Common Build Problems
#### COPY/ADD Errors
**Issue**: Files not found during build
**Solutions**:
```dockerfile
# Check .dockerignore file
# Verify file paths relative to build context
COPY ./src /app/src # ✅ Correct
COPY /absolute/path /app # ❌ Wrong - no absolute paths
```
#### Package Installation Failures
**Issue**: apt/yum/dnf package installation fails
**Solutions**:
```dockerfile
# Update package lists first
RUN apt-get update && apt-get install -y package-name
# Combine RUN commands to reduce layers
RUN apt-get update && \
apt-get install -y package1 package2 && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
```
#### Network Issues During Build
**Issue**: Cannot reach package repositories
**Solutions**:
```bash
# Check DNS resolution
docker build --network host .
# Use custom DNS
docker build --dns 8.8.8.8 .
```
## GPU Container Issues
### NVIDIA GPU Support Problems
#### Docker Desktop vs Podman on Fedora/Nobara
**Issue**: Docker Desktop has GPU compatibility issues on Fedora-based systems
**Symptoms**:
- `CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected`
- `unknown or invalid runtime name: nvidia`
- Device nodes exist but CUDA fails to initialize
**Solution**: Use Podman instead of Docker on Fedora systems
```bash
# Verify host GPU works
nvidia-smi
# Test with Podman (recommended)
podman run --rm --device nvidia.com/gpu=all ubuntu:20.04 nvidia-smi
# Test with Docker (may fail on Fedora)
docker run --rm --gpus all ubuntu:20.04 nvidia-smi
```
#### GPU Container Configuration
**Working Podman GPU template**:
```bash
podman run -d --name gpu-container \
--device nvidia.com/gpu=all \
--restart unless-stopped \
-e NVIDIA_DRIVER_CAPABILITIES=all \
-e NVIDIA_VISIBLE_DEVICES=all \
myapp:latest
```
**Working Docker GPU template**:
```bash
docker run -d --name gpu-container \
--gpus all \
--restart unless-stopped \
-e NVIDIA_DRIVER_CAPABILITIES=all \
-e NVIDIA_VISIBLE_DEVICES=all \
myapp:latest
```
#### GPU Troubleshooting Steps
1. **Verify Host GPU Access**:
```bash
nvidia-smi # Should show GPU info
lsmod | grep nvidia # Should show nvidia modules
ls -la /dev/nvidia* # Should show device files
```
2. **Check NVIDIA Container Toolkit**:
```bash
rpm -qa | grep nvidia-container-toolkit # Fedora/RHEL
dpkg -l | grep nvidia-container-toolkit # Ubuntu/Debian
nvidia-ctk --version
```
3. **Test GPU in Container**:
```bash
# Should show GPU information
podman exec gpu-container nvidia-smi
# Test CUDA functionality
podman exec gpu-container nvidia-ml-py
```
#### Platform-Specific GPU Notes
**Fedora/Nobara/RHEL**:
- ✅ Podman: Works out-of-the-box with GPU support
- ❌ Docker Desktop: Known GPU integration issues
- Solution: Use Podman for GPU workloads
**Ubuntu/Debian**:
- ✅ Docker: Generally works well with proper NVIDIA toolkit setup
- ✅ Podman: Also works well
- Solution: Either runtime typically works
## Performance Issues
### Resource Monitoring
**Real-time resource usage**:
```bash
# Overall container stats
docker stats
podman stats
# Inside container analysis
docker exec <container> top
docker exec <container> free -h
docker exec <container> df -h
# Network usage
docker exec <container> netstat -i
```
### Image Size Optimization
**Analyze image layers**:
```bash
# Check image sizes
docker images --format "table {{.Repository}}\t{{.Tag}}\t{{.Size}}"
# Analyze layer history
docker history <image>
# Find large files in container
docker exec <container> du -sh /* | sort -hr
```
**Optimization strategies**:
```dockerfile
# Use multi-stage builds
FROM node:18 AS builder
# ... build steps ...
FROM node:18-alpine AS production
COPY --from=builder /app/dist /app
# Smaller final image
# Combine RUN commands
RUN apt-get update && \
apt-get install -y package && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Use .dockerignore
# .dockerignore
node_modules
.git
*.log
```
### Storage Performance Issues
**Slow volume performance**:
```bash
# Test volume I/O performance
docker exec <container> dd if=/dev/zero of=/volume/test bs=1M count=1000
# Check volume mount options
docker inspect <container> | grep -A 10 "Mounts"
# Consider using tmpfs for temporary data
docker run --tmpfs /tmp myapp
```
## Network Debugging
### Network Connectivity Issues
**Inspect network configuration**:
```bash
# List networks
docker network ls
podman network ls
# Inspect specific network
docker network inspect <network_name>
# Check container networking
docker exec <container> ip addr show
docker exec <container> ip route show
```
### Service Discovery Problems
**Test connectivity between containers**:
```bash
# Test by container name (same network)
docker exec container1 ping container2
# Test by IP address
docker exec container1 ping 172.17.0.3
# Check DNS resolution
docker exec container1 nslookup container2
```
### Port Binding Issues
**Verify port mappings**:
```bash
# Check exposed ports
docker port <container>
# Test external connectivity
curl localhost:8080
# Check if port is bound to all interfaces
netstat -tulpn | grep :8080
```
## Emergency Recovery
### Complete Container Reset
**Remove all containers and start fresh**:
```bash
# Stop all containers
docker stop $(docker ps -q)
podman stop --all
# Remove all containers
docker container prune -f
podman container prune -f
# Remove all images
docker image prune -a -f
podman image prune -a -f
# Remove all volumes (CAUTION: data loss)
docker volume prune -f
podman volume prune -f
# Complete system cleanup
docker system prune -a --volumes -f
podman system prune -a --volumes -f
```
### Container Recovery
**Recover from corrupted container**:
```bash
# Create backup of container data
docker cp <container>:/important/data ./backup/
# Export container filesystem
docker export <container> > container-backup.tar
# Import and restart
docker import container-backup.tar new-image:latest
docker run -d --name new-container new-image:latest
```
### Data Recovery
**Recover data from volumes**:
```bash
# List volumes
docker volume ls
# Inspect volume location
docker volume inspect <volume_name>
# Access volume data directly
sudo ls -la /var/lib/docker/volumes/<volume_name>/_data
# Mount volume to temporary container
docker run --rm -v <volume_name>:/data alpine ls -la /data
```
## Health Check Issues
### Container Health Checks
**Implement health checks**:
```dockerfile
# Dockerfile health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD curl -f http://localhost:3000/health || exit 1
```
**Debug health check failures**:
```bash
# Check health status
docker inspect <container> | grep -A 10 Health
# Manual health check test
docker exec <container> curl -f http://localhost:3000/health
# Check health check logs
docker events --filter container=<container>
```
## Log Analysis
### Log Management
**View and manage container logs**:
```bash
# View recent logs
docker logs --tail 100 <container>
# Follow logs in real-time
docker logs -f <container>
# Logs with timestamps
docker logs -t <container>
# Search logs for errors
docker logs <container> 2>&1 | grep ERROR
```
### Log Rotation Issues
**Configure log rotation to prevent disk filling**:
```bash
# Run with log size limits
docker run --log-opt max-size=10m --log-opt max-file=3 myapp
# Check log file sizes
sudo du -sh /var/lib/docker/containers/*/
```
## Platform-Specific Issues
### Fedora/Nobara/RHEL Systems
- **GPU Support**: Use Podman instead of Docker Desktop
- **SELinux**: May require container contexts (`-Z` flag)
- **Firewall**: Configure firewalld for container networking
### Ubuntu/Debian Systems
- **AppArmor**: May restrict container operations
- **Snap Docker**: May have permission issues vs native package
### General Linux Issues
- **cgroups v2**: Some older containers need cgroups v1
- **User namespaces**: May cause UID/GID mapping issues
- **systemd**: Integration differences between Docker/Podman
## Prevention Best Practices
1. **Resource Limits**: Always set memory and CPU limits
2. **Health Checks**: Implement application health monitoring
3. **Log Rotation**: Configure to prevent disk space issues
4. **Security Scanning**: Regular vulnerability scans
5. **Backup Strategy**: Regular data and configuration backups
6. **Testing**: Test containers in staging before production
7. **Documentation**: Document container configurations and dependencies
This troubleshooting guide covers the most common Docker and Podman container issues encountered in home lab and production environments.