- Update patterns/vm-management/README.md: Add comprehensive automation workflows - Cloud-init deployment strategies and post-install automation - SSH key management integration and security hardening patterns - Implementation workflows for new and existing VM provisioning - Add complete VM management examples and reference documentation - examples/vm-management/: Proxmox automation and provisioning examples - reference/vm-management/: Troubleshooting guides and best practices - scripts/vm-management/: Operational scripts for automated VM setup - Update reference/docker/tdarr-monitoring-configuration.md: API monitoring integration - Document new tdarr_monitor.py integration with existing Discord monitoring - Add API-based health checks and cron scheduling examples - Enhanced gaming scheduler integration with health verification - Update Tdarr operational scripts with stability improvements - scripts/tdarr/start-tdarr-gpu-podman-clean.sh: Resource limits and CDI GPU access - scripts/tdarr/tdarr-schedule-manager.sh: Updated container name references - scripts/monitoring/tdarr-timeout-monitor.sh: Enhanced completion monitoring 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
11 KiB
11 KiB
VM Management Troubleshooting Guide
Complete troubleshooting guide for Proxmox VM provisioning, SSH connectivity, Docker installation, and common configuration issues.
Common Issues and Solutions
1. VM Provisioning Failures
Cloud-Init Not Working
Symptoms:
- VM starts but cloud-init configuration not applied
- User account not created
- SSH keys not installed
Diagnosis:
# Check cloud-init status
ssh root@<vm-ip> 'cloud-init status --long'
# View cloud-init logs
ssh root@<vm-ip> 'cat /var/log/cloud-init.log'
ssh root@<vm-ip> 'cat /var/log/cloud-init-output.log'
# Check cloud-init configuration
ssh root@<vm-ip> 'cloud-init query userdata'
Solutions:
# Re-run cloud-init (if safe to do so)
ssh root@<vm-ip> 'cloud-init clean --logs'
ssh root@<vm-ip> 'cloud-init init --local'
ssh root@<vm-ip> 'cloud-init init'
ssh root@<vm-ip> 'cloud-init modules --mode=config'
ssh root@<vm-ip> 'cloud-init modules --mode=final'
# Force user creation if missing
ssh root@<vm-ip> 'useradd -m -s /bin/bash -G sudo,docker cal'
# Fix YAML syntax in cloud-init if needed
# Common issues: incorrect indentation, missing quotes
VM Won't Start
Symptoms:
- VM fails to boot
- Kernel panic or boot errors
- Hangs during startup
Diagnosis:
# Check VM configuration in Proxmox
pvesh get /nodes/pve/qemu/<vmid>/config
# View console output
# Use Proxmox web interface Console tab
# Check VM resource allocation
pvesh get /nodes/pve/qemu/<vmid>/status/current
Solutions:
# Increase memory if low
pvesh set /nodes/pve/qemu/<vmid>/config -memory 2048
# Check disk space and format
pvesh get /nodes/pve/storage
# Reset to safe configuration
pvesh set /nodes/pve/qemu/<vmid>/config -cpu host -cores 2
2. SSH Connection Issues
Cannot Connect to VM
Symptoms:
- Connection timeout
- Connection refused
- Host unreachable
Diagnosis:
# Test network connectivity
ping <vm-ip>
# Check SSH port
nc -zv <vm-ip> 22
nmap -p 22 <vm-ip>
# Check from Proxmox console
# Use Proxmox web interface -> VM -> Console
systemctl status sshd
netstat -tlnp | grep :22
Solutions:
# Start SSH service (via console)
systemctl start sshd
systemctl enable sshd
# Check firewall (via console)
ufw status
# If active and blocking SSH:
ufw allow ssh
# Reset network configuration
ip addr show
dhclient # If using DHCP
systemctl restart networking
SSH Key Authentication Fails
Symptoms:
- Password prompts despite keys being installed
- Permission denied (publickey)
- "No more authentication methods to try"
Diagnosis:
# Verbose SSH connection
ssh -vvv cal@<vm-ip>
# Check authorized_keys file (via console or password auth)
ls -la ~/.ssh/
cat ~/.ssh/authorized_keys
Solutions:
# Fix file permissions
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
# Verify key content
cat ~/.ssh/authorized_keys | wc -l # Should show 2 keys
# Re-deploy keys manually
cat ~/.ssh/homelab_rsa.pub >> ~/.ssh/authorized_keys
cat ~/.ssh/emergency_homelab_rsa.pub >> ~/.ssh/authorized_keys
# Check SSH configuration
sudo grep -E "(PubkeyAuth|PasswordAuth)" /etc/ssh/sshd_config
sudo systemctl restart sshd
SSH Configuration Problems
Symptoms:
- SSH works but with wrong settings
- Root access when it should be disabled
- Password authentication enabled
Diagnosis:
# Check effective SSH configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)"
# View SSH configuration files
cat /etc/ssh/sshd_config
ls /etc/ssh/sshd_config.d/
Solutions:
# Apply security hardening manually
sudo tee /etc/ssh/sshd_config.d/99-homelab-security.conf << 'EOF'
PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin no
AllowUsers cal
Protocol 2
ClientAliveInterval 300
ClientAliveCountMax 2
MaxAuthTries 3
EOF
sudo systemctl restart sshd
3. Docker Installation Issues
Docker Installation Fails
Symptoms:
- Docker packages not found
- GPG key verification fails
- Permission denied errors
Diagnosis:
# Check internet connectivity
ping google.com
curl -I https://download.docker.com
# Check repository configuration
cat /etc/apt/sources.list.d/docker.list
apt-cache policy docker-ce
# Check for conflicting packages
dpkg -l | grep docker
Solutions:
# Remove conflicting packages
sudo apt remove -y docker docker-engine docker.io containerd runc
# Re-add Docker repository
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list
# Install Docker
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
Docker Permission Issues
Symptoms:
- "Permission denied" when running docker commands
- Must use sudo for docker commands
- User not in docker group
Diagnosis:
# Check user groups
groups
groups cal
# Check docker group exists
getent group docker
# Check docker service
systemctl status docker
Solutions:
# Add user to docker group
sudo usermod -aG docker cal
# Create docker group if missing
sudo groupadd docker
sudo usermod -aG docker cal
# Apply group membership (logout/login or)
newgrp docker
# Fix socket permissions
sudo chown root:docker /var/run/docker.sock
sudo chmod 664 /var/run/docker.sock
Docker Service Won't Start
Symptoms:
- Docker daemon not running
- Socket connection errors
- systemctl shows failed status
Diagnosis:
# Check service status
systemctl status docker
journalctl -u docker.service -f
# Check daemon logs
sudo dockerd --debug
# Check system resources
df -h
free -h
Solutions:
# Restart Docker service
sudo systemctl restart docker
sudo systemctl enable docker
# Clear Docker data if corrupted
sudo systemctl stop docker
sudo rm -rf /var/lib/docker/tmp/*
sudo systemctl start docker
# Reset Docker configuration
sudo systemctl stop docker
sudo mv /etc/docker/daemon.json /etc/docker/daemon.json.bak
sudo systemctl start docker
4. System Update Issues
Package Update Failures
Symptoms:
- apt update fails
- Repository errors
- Dependency conflicts
Diagnosis:
# Check repository status
sudo apt update
cat /etc/apt/sources.list
ls /etc/apt/sources.list.d/
# Check disk space
df -h /
df -h /var
Solutions:
# Fix broken packages
sudo apt --fix-broken install
sudo dpkg --configure -a
# Clean package cache
sudo apt clean
sudo apt autoclean
sudo apt autoremove
# Reset sources if needed
sudo cp /etc/apt/sources.list /etc/apt/sources.list.backup
# Manually edit to use main Ubuntu repositories
5. Network Configuration Problems
IP Configuration Issues
Symptoms:
- VM has wrong IP address
- No network connectivity
- DNS resolution fails
Diagnosis:
# Check network configuration
ip addr show
ip route show
cat /etc/netplan/*.yaml
# Test connectivity
ping 10.10.0.1 # Gateway
ping 8.8.8.8 # External DNS
nslookup google.com
Solutions:
# Fix netplan configuration
sudo nano /etc/netplan/00-installer-config.yaml
# Example correct configuration:
network:
version: 2
ethernets:
ens18:
dhcp4: false
addresses: [10.10.0.200/24]
gateway4: 10.10.0.1
nameservers:
addresses: [10.10.0.16, 8.8.8.8]
# Apply configuration
sudo netplan apply
DNS Resolution Problems
Symptoms:
- Cannot resolve domain names
- Package installation fails
- Hostname lookups fail
Diagnosis:
# Check DNS configuration
cat /etc/resolv.conf
systemd-resolve --status
# Test DNS
nslookup google.com
dig google.com
Solutions:
# Fix DNS in netplan
sudo nano /etc/netplan/00-installer-config.yaml
# Add nameservers section as shown above
# Temporary DNS fix
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
# Restart networking
sudo netplan apply
sudo systemctl restart systemd-resolved
6. Storage and Disk Issues
Disk Space Problems
Symptoms:
- VM runs out of disk space
- Cannot install packages
- Docker images won't download
Diagnosis:
# Check disk usage
df -h
du -sh /home/*
du -sh /var/*
# Check for large files
find / -size +100M 2>/dev/null | head -10
Solutions:
# Clean system
sudo apt clean
sudo apt autoremove
docker system prune -a
# Extend disk in Proxmox (if needed)
# Use Proxmox web interface: VM -> Hardware -> Hard Disk -> Resize
# Extend filesystem after disk resize
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1
Advanced Troubleshooting
Post-Install Script Debug Mode
# Run script with debug output
bash -x ./scripts/vm-management/vm-post-install.sh <vm-ip> <user>
# Check specific steps manually
ssh cal@<vm-ip> 'docker --version'
ssh cal@<vm-ip> 'sudo systemctl status sshd'
ssh cal@<vm-ip> 'cat ~/.ssh/authorized_keys | wc -l'
Recovery Procedures
Emergency SSH Access
# If primary SSH key fails, use emergency key
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
# If all SSH access fails, use Proxmox console
# VM -> Console in Proxmox web interface
# Reset SSH configuration
sudo cp /etc/ssh/sshd_config.backup /etc/ssh/sshd_config
sudo systemctl restart sshd
Complete VM Reset
# If VM is completely broken, restore from template
pvesh delete /nodes/pve/qemu/<vmid>
pvesh create /nodes/pve/qemu/<template-id>/clone -newid <vmid> -name <new-name>
# Or re-run cloud-init provisioning
# Delete VM and recreate with same cloud-init configuration
Prevention Best Practices
Pre-Deployment Checks
# Verify SSH keys exist
ls -la ~/.ssh/homelab_rsa*
ls -la ~/.ssh/emergency_homelab_rsa*
# Test network connectivity to target subnet
ping 10.10.0.1
# Verify Proxmox storage space
pvesh get /nodes/pve/storage
Monitoring and Alerts
# Create health check script
#!/bin/bash
# vm-health-monitor.sh
for ip in 10.10.0.{200..210}; do
if ssh -o ConnectTimeout=5 cal@$ip 'uptime' >/dev/null 2>&1; then
echo "✅ $ip: OK"
else
echo "❌ $ip: FAILED"
fi
done
# Schedule regular checks
# Add to crontab: */15 * * * * /path/to/vm-health-monitor.sh
Emergency Contacts and Resources
Documentation Links
- Proxmox Documentation: https://pve.proxmox.com/wiki/
- Cloud-Init Documentation: https://cloud-init.readthedocs.io/
- Docker Installation Guide: https://docs.docker.com/engine/install/ubuntu/
Recovery Information
- SSH Keys Location:
/mnt/NV2/ssh-keys/backup-*/ - Emergency Access: Use Proxmox console for direct VM access
- Backup Strategy: VM snapshots before major changes
Quick Reference Commands
# VM Status
pvesh get /nodes/pve/qemu/<vmid>/status/current
# Start/Stop VM
pvesh create /nodes/pve/qemu/<vmid>/status/start
pvesh create /nodes/pve/qemu/<vmid>/status/stop
# SSH with different key
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
# Docker system info
docker system info
docker system df