# VM Management Troubleshooting Guide Complete troubleshooting guide for Proxmox VM provisioning, SSH connectivity, Docker installation, and common configuration issues. ## Common Issues and Solutions ### 1. VM Provisioning Failures #### Cloud-Init Not Working **Symptoms:** - VM starts but cloud-init configuration not applied - User account not created - SSH keys not installed **Diagnosis:** ```bash # Check cloud-init status ssh root@ 'cloud-init status --long' # View cloud-init logs ssh root@ 'cat /var/log/cloud-init.log' ssh root@ 'cat /var/log/cloud-init-output.log' # Check cloud-init configuration ssh root@ 'cloud-init query userdata' ``` **Solutions:** ```bash # Re-run cloud-init (if safe to do so) ssh root@ 'cloud-init clean --logs' ssh root@ 'cloud-init init --local' ssh root@ 'cloud-init init' ssh root@ 'cloud-init modules --mode=config' ssh root@ 'cloud-init modules --mode=final' # Force user creation if missing ssh root@ 'useradd -m -s /bin/bash -G sudo,docker cal' # Fix YAML syntax in cloud-init if needed # Common issues: incorrect indentation, missing quotes ``` #### VM Won't Start **Symptoms:** - VM fails to boot - Kernel panic or boot errors - Hangs during startup **Diagnosis:** ```bash # Check VM configuration in Proxmox pvesh get /nodes/pve/qemu//config # View console output # Use Proxmox web interface Console tab # Check VM resource allocation pvesh get /nodes/pve/qemu//status/current ``` **Solutions:** ```bash # Increase memory if low pvesh set /nodes/pve/qemu//config -memory 2048 # Check disk space and format pvesh get /nodes/pve/storage # Reset to safe configuration pvesh set /nodes/pve/qemu//config -cpu host -cores 2 ``` ### 2. SSH Connection Issues #### Cannot Connect to VM **Symptoms:** - Connection timeout - Connection refused - Host unreachable **Diagnosis:** ```bash # Test network connectivity ping # Check SSH port nc -zv 22 nmap -p 22 # Check from Proxmox console # Use Proxmox web interface -> VM -> Console systemctl status sshd netstat -tlnp | grep :22 ``` **Solutions:** ```bash # Start SSH service (via console) systemctl start sshd systemctl enable sshd # Check firewall (via console) ufw status # If active and blocking SSH: ufw allow ssh # Reset network configuration ip addr show dhclient # If using DHCP systemctl restart networking ``` #### SSH Key Authentication Fails **Symptoms:** - Password prompts despite keys being installed - Permission denied (publickey) - "No more authentication methods to try" **Diagnosis:** ```bash # Verbose SSH connection ssh -vvv cal@ # Check authorized_keys file (via console or password auth) ls -la ~/.ssh/ cat ~/.ssh/authorized_keys ``` **Solutions:** ```bash # Fix file permissions chmod 700 ~/.ssh chmod 600 ~/.ssh/authorized_keys # Verify key content cat ~/.ssh/authorized_keys | wc -l # Should show 2 keys # Re-deploy keys manually cat ~/.ssh/homelab_rsa.pub >> ~/.ssh/authorized_keys cat ~/.ssh/emergency_homelab_rsa.pub >> ~/.ssh/authorized_keys # Check SSH configuration sudo grep -E "(PubkeyAuth|PasswordAuth)" /etc/ssh/sshd_config sudo systemctl restart sshd ``` #### SSH Configuration Problems **Symptoms:** - SSH works but with wrong settings - Root access when it should be disabled - Password authentication enabled **Diagnosis:** ```bash # Check effective SSH configuration sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)" # View SSH configuration files cat /etc/ssh/sshd_config ls /etc/ssh/sshd_config.d/ ``` **Solutions:** ```bash # Apply security hardening manually sudo tee /etc/ssh/sshd_config.d/99-homelab-security.conf << 'EOF' PasswordAuthentication no PubkeyAuthentication yes PermitRootLogin no AllowUsers cal Protocol 2 ClientAliveInterval 300 ClientAliveCountMax 2 MaxAuthTries 3 EOF sudo systemctl restart sshd ``` ### 3. Docker Installation Issues #### Docker Installation Fails **Symptoms:** - Docker packages not found - GPG key verification fails - Permission denied errors **Diagnosis:** ```bash # Check internet connectivity ping google.com curl -I https://download.docker.com # Check repository configuration cat /etc/apt/sources.list.d/docker.list apt-cache policy docker-ce # Check for conflicting packages dpkg -l | grep docker ``` **Solutions:** ```bash # Remove conflicting packages sudo apt remove -y docker docker-engine docker.io containerd runc # Re-add Docker repository sudo mkdir -p /etc/apt/keyrings curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list # Install Docker sudo apt update sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin ``` #### Docker Permission Issues **Symptoms:** - "Permission denied" when running docker commands - Must use sudo for docker commands - User not in docker group **Diagnosis:** ```bash # Check user groups groups groups cal # Check docker group exists getent group docker # Check docker service systemctl status docker ``` **Solutions:** ```bash # Add user to docker group sudo usermod -aG docker cal # Create docker group if missing sudo groupadd docker sudo usermod -aG docker cal # Apply group membership (logout/login or) newgrp docker # Fix socket permissions sudo chown root:docker /var/run/docker.sock sudo chmod 664 /var/run/docker.sock ``` #### Docker Service Won't Start **Symptoms:** - Docker daemon not running - Socket connection errors - systemctl shows failed status **Diagnosis:** ```bash # Check service status systemctl status docker journalctl -u docker.service -f # Check daemon logs sudo dockerd --debug # Check system resources df -h free -h ``` **Solutions:** ```bash # Restart Docker service sudo systemctl restart docker sudo systemctl enable docker # Clear Docker data if corrupted sudo systemctl stop docker sudo rm -rf /var/lib/docker/tmp/* sudo systemctl start docker # Reset Docker configuration sudo systemctl stop docker sudo mv /etc/docker/daemon.json /etc/docker/daemon.json.bak sudo systemctl start docker ``` ### 4. System Update Issues #### Package Update Failures **Symptoms:** - apt update fails - Repository errors - Dependency conflicts **Diagnosis:** ```bash # Check repository status sudo apt update cat /etc/apt/sources.list ls /etc/apt/sources.list.d/ # Check disk space df -h / df -h /var ``` **Solutions:** ```bash # Fix broken packages sudo apt --fix-broken install sudo dpkg --configure -a # Clean package cache sudo apt clean sudo apt autoclean sudo apt autoremove # Reset sources if needed sudo cp /etc/apt/sources.list /etc/apt/sources.list.backup # Manually edit to use main Ubuntu repositories ``` ### 5. Network Configuration Problems #### IP Configuration Issues **Symptoms:** - VM has wrong IP address - No network connectivity - DNS resolution fails **Diagnosis:** ```bash # Check network configuration ip addr show ip route show cat /etc/netplan/*.yaml # Test connectivity ping 10.10.0.1 # Gateway ping 8.8.8.8 # External DNS nslookup google.com ``` **Solutions:** ```bash # Fix netplan configuration sudo nano /etc/netplan/00-installer-config.yaml # Example correct configuration: network: version: 2 ethernets: ens18: dhcp4: false addresses: [10.10.0.200/24] gateway4: 10.10.0.1 nameservers: addresses: [10.10.0.16, 8.8.8.8] # Apply configuration sudo netplan apply ``` #### DNS Resolution Problems **Symptoms:** - Cannot resolve domain names - Package installation fails - Hostname lookups fail **Diagnosis:** ```bash # Check DNS configuration cat /etc/resolv.conf systemd-resolve --status # Test DNS nslookup google.com dig google.com ``` **Solutions:** ```bash # Fix DNS in netplan sudo nano /etc/netplan/00-installer-config.yaml # Add nameservers section as shown above # Temporary DNS fix echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf # Restart networking sudo netplan apply sudo systemctl restart systemd-resolved ``` ### 6. Storage and Disk Issues #### Disk Space Problems **Symptoms:** - VM runs out of disk space - Cannot install packages - Docker images won't download **Diagnosis:** ```bash # Check disk usage df -h du -sh /home/* du -sh /var/* # Check for large files find / -size +100M 2>/dev/null | head -10 ``` **Solutions:** ```bash # Clean system sudo apt clean sudo apt autoremove docker system prune -a # Extend disk in Proxmox (if needed) # Use Proxmox web interface: VM -> Hardware -> Hard Disk -> Resize # Extend filesystem after disk resize sudo growpart /dev/sda 1 sudo resize2fs /dev/sda1 ``` ## Advanced Troubleshooting ### Post-Install Script Debug Mode ```bash # Run script with debug output bash -x ./scripts/vm-management/vm-post-install.sh # Check specific steps manually ssh cal@ 'docker --version' ssh cal@ 'sudo systemctl status sshd' ssh cal@ 'cat ~/.ssh/authorized_keys | wc -l' ``` ### Recovery Procedures #### Emergency SSH Access ```bash # If primary SSH key fails, use emergency key ssh -i ~/.ssh/emergency_homelab_rsa cal@ # If all SSH access fails, use Proxmox console # VM -> Console in Proxmox web interface # Reset SSH configuration sudo cp /etc/ssh/sshd_config.backup /etc/ssh/sshd_config sudo systemctl restart sshd ``` #### Complete VM Reset ```bash # If VM is completely broken, restore from template pvesh delete /nodes/pve/qemu/ pvesh create /nodes/pve/qemu//clone -newid -name # Or re-run cloud-init provisioning # Delete VM and recreate with same cloud-init configuration ``` ## Prevention Best Practices ### Pre-Deployment Checks ```bash # Verify SSH keys exist ls -la ~/.ssh/homelab_rsa* ls -la ~/.ssh/emergency_homelab_rsa* # Test network connectivity to target subnet ping 10.10.0.1 # Verify Proxmox storage space pvesh get /nodes/pve/storage ``` ### Monitoring and Alerts ```bash # Create health check script #!/bin/bash # vm-health-monitor.sh for ip in 10.10.0.{200..210}; do if ssh -o ConnectTimeout=5 cal@$ip 'uptime' >/dev/null 2>&1; then echo "✅ $ip: OK" else echo "❌ $ip: FAILED" fi done # Schedule regular checks # Add to crontab: */15 * * * * /path/to/vm-health-monitor.sh ``` ## Emergency Contacts and Resources ### Documentation Links - **Proxmox Documentation**: https://pve.proxmox.com/wiki/ - **Cloud-Init Documentation**: https://cloud-init.readthedocs.io/ - **Docker Installation Guide**: https://docs.docker.com/engine/install/ubuntu/ ### Recovery Information - **SSH Keys Location**: `/mnt/NV2/ssh-keys/backup-*/` - **Emergency Access**: Use Proxmox console for direct VM access - **Backup Strategy**: VM snapshots before major changes ### Quick Reference Commands ```bash # VM Status pvesh get /nodes/pve/qemu//status/current # Start/Stop VM pvesh create /nodes/pve/qemu//status/start pvesh create /nodes/pve/qemu//status/stop # SSH with different key ssh -i ~/.ssh/emergency_homelab_rsa cal@ # Docker system info docker system info docker system df ```