# Virtual Machine Management Troubleshooting Guide ## VM Provisioning Issues ### Cloud-Init Configuration Problems #### Cloud-Init Not Executing **Symptoms**: - VM starts but user accounts not created - SSH keys not deployed - Packages not installed - Configuration not applied **Diagnosis**: ```bash # Check cloud-init status and logs ssh root@ 'cloud-init status --long' ssh root@ 'cat /var/log/cloud-init.log' ssh root@ 'cat /var/log/cloud-init-output.log' # Verify cloud-init configuration ssh root@ 'cloud-init query userdata' # Check for YAML syntax errors ssh root@ 'cloud-init devel schema --config-file /var/lib/cloud/instance/user-data.txt' ``` **Solutions**: ```bash # Re-run cloud-init (CAUTION: may overwrite changes) ssh root@ 'cloud-init clean --logs' ssh root@ 'cloud-init init --local' ssh root@ 'cloud-init init' ssh root@ 'cloud-init modules --mode=config' ssh root@ 'cloud-init modules --mode=final' # Manual user creation if cloud-init fails ssh root@ 'useradd -m -s /bin/bash -G sudo,docker cal' ssh root@ 'mkdir -p /home/cal/.ssh' ssh root@ 'chown cal:cal /home/cal/.ssh' ssh root@ 'chmod 700 /home/cal/.ssh' ``` #### Invalid Cloud-Init YAML **Symptoms**: - Cloud-init fails with syntax errors - Parser errors in cloud-init logs - Partial configuration application **Common YAML Issues**: ```yaml # ❌ Incorrect indentation users: - name: cal groups: [sudo, docker] # Wrong indentation # ✅ Correct indentation users: - name: cal groups: [sudo, docker] # Proper indentation # ❌ Missing quotes for special characters ssh_authorized_keys: - ssh-rsa AAAAB3NzaC1... user@host # May fail with special chars # ✅ Quoted strings ssh_authorized_keys: - "ssh-rsa AAAAB3NzaC1... user@host" ``` ### VM Boot and Startup Issues #### VM Won't Start **Symptoms**: - VM fails to boot from Proxmox - Kernel panic messages - Boot loop or hanging **Diagnosis**: ```bash # Check VM configuration pvesh get /nodes/pve/qemu//config # Check resource allocation pvesh get /nodes/pve/qemu//status/current # Review VM logs via Proxmox console # Use Proxmox web interface -> VM -> Console # Check Proxmox host resources pvesh get /nodes/pve/status ``` **Solutions**: ```bash # Increase memory allocation pvesh set /nodes/pve/qemu//config -memory 4096 # Reset CPU configuration pvesh set /nodes/pve/qemu//config -cpu host -cores 2 # Check and repair disk # Stop VM, then: pvesh get /nodes/pve/qemu//config | grep scsi0 # Use fsck on the disk image if needed ``` #### Resource Constraints **Symptoms**: - VM extremely slow performance - Out-of-memory kills - Disk I/O bottlenecks **Diagnosis**: ```bash # Inside VM resource check free -h df -h iostat 1 5 vmstat 1 5 # Proxmox host resource check pvesh get /nodes/pve/status cat /proc/meminfo df -h /var/lib/vz ``` **Solutions**: ```bash # Increase VM resources via Proxmox pvesh set /nodes/pve/qemu//config -memory 8192 pvesh set /nodes/pve/qemu//config -cores 4 # Resize VM disk # Proxmox GUI: Hardware -> Hard Disk -> Resize # Then extend filesystem: sudo growpart /dev/sda 1 sudo resize2fs /dev/sda1 ``` ## SSH Access Issues ### SSH Connection Failures #### Cannot Connect to VM **Symptoms**: - Connection timeout - Connection refused - Host unreachable **Diagnosis**: ```bash # Network connectivity tests ping traceroute # SSH service tests nc -zv 22 nmap -p 22 # From Proxmox console, check SSH service systemctl status sshd ss -tlnp | grep :22 ``` **Solutions**: ```bash # Via Proxmox console - restart SSH systemctl start sshd systemctl enable sshd # Check and configure firewall ufw status # If blocking SSH: ufw allow ssh ufw allow 22/tcp # Network configuration reset ip addr show dhclient # For DHCP systemctl restart networking ``` #### SSH Key Authentication Failures **Symptoms**: - Password prompts despite key installation - "Permission denied (publickey)" - "No more authentication methods" **Diagnosis**: ```bash # Verbose SSH debugging ssh -vvv cal@ # Check key files locally ls -la ~/.ssh/homelab_rsa* ls -la ~/.ssh/emergency_homelab_rsa* # Via console or password auth, check VM ls -la ~/.ssh/ cat ~/.ssh/authorized_keys ``` **Solutions**: ```bash # Fix SSH directory permissions chmod 700 ~/.ssh chmod 600 ~/.ssh/authorized_keys chown -R cal:cal ~/.ssh # Re-deploy SSH keys cat > ~/.ssh/authorized_keys << 'EOF' ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... # primary key ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQD... # emergency key EOF # Verify SSH server configuration sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)" ``` #### SSH Security Configuration Issues **Symptoms**: - Password authentication still enabled - Root login allowed - Insecure SSH settings **Diagnosis**: ```bash # Check effective SSH configuration sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot|allowusers)" # Review SSH config files cat /etc/ssh/sshd_config ls /etc/ssh/sshd_config.d/ ``` **Solutions**: ```bash # Apply security hardening sudo tee /etc/ssh/sshd_config.d/99-homelab-security.conf << 'EOF' PasswordAuthentication no PubkeyAuthentication yes PermitRootLogin no AllowUsers cal Protocol 2 ClientAliveInterval 300 ClientAliveCountMax 2 MaxAuthTries 3 X11Forwarding no EOF sudo systemctl restart sshd ``` ## Docker Installation and Configuration Issues ### Docker Installation Failures #### Package Installation Fails **Symptoms**: - Docker packages not found - GPG key verification errors - Repository access failures **Diagnosis**: ```bash # Test internet connectivity ping google.com curl -I https://download.docker.com # Check repository configuration cat /etc/apt/sources.list.d/docker.list apt-cache policy docker-ce # Check for package conflicts dpkg -l | grep docker ``` **Solutions**: ```bash # Remove conflicting packages sudo apt remove -y docker docker-engine docker.io containerd runc # Reinstall Docker repository sudo mkdir -p /etc/apt/keyrings curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list # Install Docker sudo apt update sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin ``` #### Docker Service Issues **Symptoms**: - Docker daemon won't start - Socket connection errors - Service failure on boot **Diagnosis**: ```bash # Check service status systemctl status docker journalctl -u docker.service -f # Check system resources df -h free -h # Test daemon manually sudo dockerd --debug ``` **Solutions**: ```bash # Restart Docker service sudo systemctl stop docker sudo systemctl start docker sudo systemctl enable docker # Clear corrupted Docker data sudo systemctl stop docker sudo rm -rf /var/lib/docker/tmp/* sudo systemctl start docker # Reset Docker configuration sudo mv /etc/docker/daemon.json /etc/docker/daemon.json.bak 2>/dev/null || true sudo systemctl restart docker ``` ### Docker Permission and Access Issues #### Permission Denied Errors **Symptoms**: - Must use sudo for Docker commands - "Permission denied" when accessing Docker socket - User not in docker group **Diagnosis**: ```bash # Check user groups groups groups cal getent group docker # Check Docker socket permissions ls -la /var/run/docker.sock # Verify Docker service is running systemctl status docker ``` **Solutions**: ```bash # Add user to docker group sudo usermod -aG docker cal # Create docker group if missing sudo groupadd docker 2>/dev/null || true sudo usermod -aG docker cal # Apply group membership (requires logout/login or): newgrp docker # Fix socket permissions sudo chown root:docker /var/run/docker.sock sudo chmod 664 /var/run/docker.sock ``` ## Network Configuration Problems ### IP Address and Connectivity Issues #### Incorrect IP Configuration **Symptoms**: - VM has wrong IP address - No network connectivity - Cannot reach default gateway **Diagnosis**: ```bash # Check network configuration ip addr show ip route show cat /etc/netplan/*.yaml # Test connectivity ping $(ip route | grep default | awk '{print $3}') # Gateway ping 8.8.8.8 # External connectivity ``` **Solutions**: ```bash # Fix netplan configuration sudo tee /etc/netplan/00-installer-config.yaml << 'EOF' network: version: 2 ethernets: ens18: dhcp4: false addresses: [10.10.0.200/24] gateway4: 10.10.0.1 nameservers: addresses: [10.10.0.16, 8.8.8.8] EOF # Apply network configuration sudo netplan apply ``` #### DNS Resolution Problems **Symptoms**: - Cannot resolve domain names - Package downloads fail - Host lookup failures **Diagnosis**: ```bash # Check DNS configuration cat /etc/resolv.conf systemd-resolve --status # Test DNS resolution nslookup google.com dig google.com @8.8.8.8 ``` **Solutions**: ```bash # Fix DNS in netplan (see above example) sudo netplan apply # Temporary DNS fix echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf # Restart DNS services sudo systemctl restart systemd-resolved sudo systemctl restart networking ``` ## System Maintenance Issues ### Package Management Problems #### Update Failures **Symptoms**: - apt update fails - Repository signature errors - Dependency conflicts **Diagnosis**: ```bash # Check repository status sudo apt update apt-cache policy # Check disk space df -h / df -h /var # Check for held packages apt-mark showhold ``` **Solutions**: ```bash # Fix broken packages sudo apt --fix-broken install sudo dpkg --configure -a # Clean package cache sudo apt clean sudo apt autoclean sudo apt autoremove # Reset problematic repositories sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys sudo apt update ``` ### Storage and Disk Space Issues #### Disk Space Exhaustion **Symptoms**: - Cannot install packages - Docker operations fail - System becomes unresponsive **Diagnosis**: ```bash # Check disk usage df -h du -sh /home/* /var/* /opt/* 2>/dev/null # Find large files find / -size +100M 2>/dev/null | head -20 ``` **Solutions**: ```bash # Clean system files sudo apt clean sudo apt autoremove sudo journalctl --vacuum-time=7d # Clean Docker data docker system prune -a -f docker volume prune -f # Extend disk (Proxmox GUI: Hardware -> Resize) # Then extend filesystem: sudo growpart /dev/sda 1 sudo resize2fs /dev/sda1 ``` ## Emergency Recovery Procedures ### SSH Access Recovery #### Complete SSH Lockout **Recovery Steps**: 1. **Use Proxmox console** for direct VM access 2. **Reset SSH configuration**: ```bash # Via console sudo cp /etc/ssh/sshd_config.backup /etc/ssh/sshd_config 2>/dev/null || true sudo systemctl restart sshd ``` 3. **Re-enable emergency access**: ```bash # Temporary password access for recovery sudo passwd cal sudo sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config sudo systemctl restart sshd ``` #### Emergency SSH Key Deployment **If primary keys fail**: ```bash # Use emergency key ssh -i ~/.ssh/emergency_homelab_rsa cal@ # Or deploy keys via console mkdir -p ~/.ssh chmod 700 ~/.ssh cat > ~/.ssh/authorized_keys << 'EOF' ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... # primary key ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQD... # emergency key EOF chmod 600 ~/.ssh/authorized_keys ``` ### VM Recovery and Rebuild #### Corrupt VM Recovery **Steps**: 1. **Create snapshot** before attempting recovery 2. **Export VM data**: ```bash # Backup important data rsync -av cal@:/home/cal/ ./vm-backup/ ``` 3. **Restore from template**: ```bash # Delete corrupt VM pvesh delete /nodes/pve/qemu/ # Clone from template pvesh create /nodes/pve/qemu//clone -newid -name ``` #### Post-Install Script Recovery **If automation fails**: ```bash # Run in debug mode bash -x ./scripts/vm-management/vm-post-install.sh # Manual step execution ssh cal@ 'sudo apt update && sudo apt upgrade -y' ssh cal@ 'curl -fsSL https://get.docker.com | sh' ssh cal@ 'sudo usermod -aG docker cal' ``` ## Prevention and Monitoring ### Pre-Deployment Validation ```bash # Verify prerequisites ls -la ~/.ssh/homelab_rsa* ls -la ~/.ssh/emergency_homelab_rsa* ping 10.10.0.1 # Test cloud-init YAML python3 -c "import yaml; yaml.safe_load(open('cloud-init-user-data.yaml'))" ``` ### Health Monitoring Script ```bash #!/bin/bash # vm-health-check.sh VM_IPS="10.10.0.200 10.10.0.201 10.10.0.202" for ip in $VM_IPS; do if ssh -o ConnectTimeout=5 -o BatchMode=yes cal@$ip 'uptime' >/dev/null 2>&1; then echo "✅ $ip: SSH OK" # Check Docker if ssh cal@$ip 'docker info >/dev/null 2>&1'; then echo "✅ $ip: Docker OK" else echo "❌ $ip: Docker FAILED" fi else echo "❌ $ip: SSH FAILED" fi done ``` ### Automated Backup ```bash # Schedule in crontab: 0 2 * * * /path/to/vm-backup.sh #!/bin/bash for vm_ip in 10.10.0.{200..210}; do if ping -c1 $vm_ip >/dev/null 2>&1; then rsync -av --exclude='.cache' cal@$vm_ip:/home/cal/ ./backups/$vm_ip/ fi done ``` ## Quick Reference Commands ### Essential VM Management ```bash # VM control via Proxmox pvesh get /nodes/pve/qemu//status/current pvesh create /nodes/pve/qemu//status/start pvesh create /nodes/pve/qemu//status/stop # SSH with alternative keys ssh -i ~/.ssh/emergency_homelab_rsa cal@ # System health checks free -h && df -h && systemctl status docker docker system info && docker system df ``` ### Recovery Resources - **SSH Keys Backup**: `/mnt/NV2/ssh-keys/backup-*/` - **Proxmox Console**: Direct VM access when SSH fails - **Emergency Contact**: Use Discord notifications for critical issues This troubleshooting guide covers comprehensive recovery procedures for VM management issues in home lab environments.