17 KiB
| title | description | type | domain | tags | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| VM Management Troubleshooting | Troubleshooting guide for Proxmox VM issues: cloud-init failures, SSH access problems, Docker installation errors, network configuration, disk space, and emergency recovery procedures. | troubleshooting | vm-management |
|
Virtual Machine Management Troubleshooting Guide
VM Provisioning Issues
Cloud-Init Configuration Problems
Cloud-Init Not Executing
Symptoms:
- VM starts but user accounts not created
- SSH keys not deployed
- Packages not installed
- Configuration not applied
Diagnosis:
# Check cloud-init status and logs
ssh root@<vm-ip> 'cloud-init status --long'
ssh root@<vm-ip> 'cat /var/log/cloud-init.log'
ssh root@<vm-ip> 'cat /var/log/cloud-init-output.log'
# Verify cloud-init configuration
ssh root@<vm-ip> 'cloud-init query userdata'
# Check for YAML syntax errors
ssh root@<vm-ip> 'cloud-init devel schema --config-file /var/lib/cloud/instance/user-data.txt'
Solutions:
# Re-run cloud-init (CAUTION: may overwrite changes)
ssh root@<vm-ip> 'cloud-init clean --logs'
ssh root@<vm-ip> 'cloud-init init --local'
ssh root@<vm-ip> 'cloud-init init'
ssh root@<vm-ip> 'cloud-init modules --mode=config'
ssh root@<vm-ip> 'cloud-init modules --mode=final'
# Manual user creation if cloud-init fails
ssh root@<vm-ip> 'useradd -m -s /bin/bash -G sudo,docker cal'
ssh root@<vm-ip> 'mkdir -p /home/cal/.ssh'
ssh root@<vm-ip> 'chown cal:cal /home/cal/.ssh'
ssh root@<vm-ip> 'chmod 700 /home/cal/.ssh'
Invalid Cloud-Init YAML
Symptoms:
- Cloud-init fails with syntax errors
- Parser errors in cloud-init logs
- Partial configuration application
Common YAML Issues:
# ❌ Incorrect indentation
users:
- name: cal
groups: [sudo, docker] # Wrong indentation
# ✅ Correct indentation
users:
- name: cal
groups: [sudo, docker] # Proper indentation
# ❌ Missing quotes for special characters
ssh_authorized_keys:
- ssh-rsa AAAAB3NzaC1... user@host # May fail with special chars
# ✅ Quoted strings
ssh_authorized_keys:
- "ssh-rsa AAAAB3NzaC1... user@host"
VM Boot and Startup Issues
VM Won't Start
Symptoms:
- VM fails to boot from Proxmox
- Kernel panic messages
- Boot loop or hanging
Diagnosis:
# Check VM configuration
pvesh get /nodes/pve/qemu/<vmid>/config
# Check resource allocation
pvesh get /nodes/pve/qemu/<vmid>/status/current
# Review VM logs via Proxmox console
# Use Proxmox web interface -> VM -> Console
# Check Proxmox host resources
pvesh get /nodes/pve/status
Solutions:
# Increase memory allocation
pvesh set /nodes/pve/qemu/<vmid>/config -memory 4096
# Reset CPU configuration
pvesh set /nodes/pve/qemu/<vmid>/config -cpu host -cores 2
# Check and repair disk
# Stop VM, then:
pvesh get /nodes/pve/qemu/<vmid>/config | grep scsi0
# Use fsck on the disk image if needed
Resource Constraints
Symptoms:
- VM extremely slow performance
- Out-of-memory kills
- Disk I/O bottlenecks
Diagnosis:
# Inside VM resource check
free -h
df -h
iostat 1 5
vmstat 1 5
# Proxmox host resource check
pvesh get /nodes/pve/status
cat /proc/meminfo
df -h /var/lib/vz
Solutions:
# Increase VM resources via Proxmox
pvesh set /nodes/pve/qemu/<vmid>/config -memory 8192
pvesh set /nodes/pve/qemu/<vmid>/config -cores 4
# Resize VM disk
# Proxmox GUI: Hardware -> Hard Disk -> Resize
# Then extend filesystem:
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1
SSH Access Issues
SSH Connection Failures
Cannot Connect to VM
Symptoms:
- Connection timeout
- Connection refused
- Host unreachable
Diagnosis:
# Network connectivity tests
ping <vm-ip>
traceroute <vm-ip>
# SSH service tests
nc -zv <vm-ip> 22
nmap -p 22 <vm-ip>
# From Proxmox console, check SSH service
systemctl status sshd
ss -tlnp | grep :22
Solutions:
# Via Proxmox console - restart SSH
systemctl start sshd
systemctl enable sshd
# Check and configure firewall
ufw status
# If blocking SSH:
ufw allow ssh
ufw allow 22/tcp
# Network configuration reset
ip addr show
dhclient # For DHCP
systemctl restart networking
SSH Key Authentication Failures
Symptoms:
- Password prompts despite key installation
- "Permission denied (publickey)"
- "No more authentication methods"
Diagnosis:
# Verbose SSH debugging
ssh -vvv cal@<vm-ip>
# Check key files locally
ls -la ~/.ssh/homelab_rsa*
ls -la ~/.ssh/emergency_homelab_rsa*
# Via console or password auth, check VM
ls -la ~/.ssh/
cat ~/.ssh/authorized_keys
Solutions:
# Fix SSH directory permissions
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
chown -R cal:cal ~/.ssh
# Re-deploy SSH keys
cat > ~/.ssh/authorized_keys << 'EOF'
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... # primary key
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQD... # emergency key
EOF
# Verify SSH server configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)"
SSH Security Configuration Issues
Symptoms:
- Password authentication still enabled
- Root login allowed
- Insecure SSH settings
Diagnosis:
# Check effective SSH configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot|allowusers)"
# Review SSH config files
cat /etc/ssh/sshd_config
ls /etc/ssh/sshd_config.d/
Solutions:
# Apply security hardening
sudo tee /etc/ssh/sshd_config.d/99-homelab-security.conf << 'EOF'
PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin no
AllowUsers cal
Protocol 2
ClientAliveInterval 300
ClientAliveCountMax 2
MaxAuthTries 3
X11Forwarding no
EOF
sudo systemctl restart sshd
Docker Installation and Configuration Issues
Docker Installation Failures
Package Installation Fails
Symptoms:
- Docker packages not found
- GPG key verification errors
- Repository access failures
Diagnosis:
# Test internet connectivity
ping google.com
curl -I https://download.docker.com
# Check repository configuration
cat /etc/apt/sources.list.d/docker.list
apt-cache policy docker-ce
# Check for package conflicts
dpkg -l | grep docker
Solutions:
# Remove conflicting packages
sudo apt remove -y docker docker-engine docker.io containerd runc
# Reinstall Docker repository
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list
# Install Docker
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
Docker Service Issues
Symptoms:
- Docker daemon won't start
- Socket connection errors
- Service failure on boot
Diagnosis:
# Check service status
systemctl status docker
journalctl -u docker.service -f
# Check system resources
df -h
free -h
# Test daemon manually
sudo dockerd --debug
Solutions:
# Restart Docker service
sudo systemctl stop docker
sudo systemctl start docker
sudo systemctl enable docker
# Clear corrupted Docker data
sudo systemctl stop docker
sudo rm -rf /var/lib/docker/tmp/*
sudo systemctl start docker
# Reset Docker configuration
sudo mv /etc/docker/daemon.json /etc/docker/daemon.json.bak 2>/dev/null || true
sudo systemctl restart docker
Docker Permission and Access Issues
Permission Denied Errors
Symptoms:
- Must use sudo for Docker commands
- "Permission denied" when accessing Docker socket
- User not in docker group
Diagnosis:
# Check user groups
groups
groups cal
getent group docker
# Check Docker socket permissions
ls -la /var/run/docker.sock
# Verify Docker service is running
systemctl status docker
Solutions:
# Add user to docker group
sudo usermod -aG docker cal
# Create docker group if missing
sudo groupadd docker 2>/dev/null || true
sudo usermod -aG docker cal
# Apply group membership (requires logout/login or):
newgrp docker
# Fix socket permissions
sudo chown root:docker /var/run/docker.sock
sudo chmod 664 /var/run/docker.sock
Network Configuration Problems
IP Address and Connectivity Issues
Incorrect IP Configuration
Symptoms:
- VM has wrong IP address
- No network connectivity
- Cannot reach default gateway
Diagnosis:
# Check network configuration
ip addr show
ip route show
cat /etc/netplan/*.yaml
# Test connectivity
ping $(ip route | grep default | awk '{print $3}') # Gateway
ping 8.8.8.8 # External connectivity
Solutions:
# Fix netplan configuration
sudo tee /etc/netplan/00-installer-config.yaml << 'EOF'
network:
version: 2
ethernets:
ens18:
dhcp4: false
addresses: [10.10.0.200/24]
gateway4: 10.10.0.1
nameservers:
addresses: [10.10.0.16, 8.8.8.8]
EOF
# Apply network configuration
sudo netplan apply
DNS Resolution Problems
Symptoms:
- Cannot resolve domain names
- Package downloads fail
- Host lookup failures
Diagnosis:
# Check DNS configuration
cat /etc/resolv.conf
systemd-resolve --status
# Test DNS resolution
nslookup google.com
dig google.com @8.8.8.8
Solutions:
# Fix DNS in netplan (see above example)
sudo netplan apply
# Temporary DNS fix
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
# Restart DNS services
sudo systemctl restart systemd-resolved
sudo systemctl restart networking
System Maintenance Issues
Package Management Problems
Update Failures
Symptoms:
- apt update fails
- Repository signature errors
- Dependency conflicts
Diagnosis:
# Check repository status
sudo apt update
apt-cache policy
# Check disk space
df -h /
df -h /var
# Check for held packages
apt-mark showhold
Solutions:
# Fix broken packages
sudo apt --fix-broken install
sudo dpkg --configure -a
# Clean package cache
sudo apt clean
sudo apt autoclean
sudo apt autoremove
# Reset problematic repositories
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys <KEYID>
sudo apt update
Storage and Disk Space Issues
Disk Space Exhaustion
Symptoms:
- Cannot install packages
- Docker operations fail
- System becomes unresponsive
Diagnosis:
# Check disk usage
df -h
du -sh /home/* /var/* /opt/* 2>/dev/null
# Find large files
find / -size +100M 2>/dev/null | head -20
Solutions:
# Clean system files
sudo apt clean
sudo apt autoremove
sudo journalctl --vacuum-time=7d
# Clean Docker data
docker system prune -a -f
docker volume prune -f
# Extend disk (Proxmox GUI: Hardware -> Resize)
# Then extend filesystem:
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1
Emergency Recovery Procedures
SSH Access Recovery
Complete SSH Lockout
Recovery Steps:
- Use Proxmox console for direct VM access
- Reset SSH configuration:
# Via console sudo cp /etc/ssh/sshd_config.backup /etc/ssh/sshd_config 2>/dev/null || true sudo systemctl restart sshd - Re-enable emergency access:
# Temporary password access for recovery sudo passwd cal sudo sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config sudo systemctl restart sshd
Emergency SSH Key Deployment
If primary keys fail:
# Use emergency key
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
# Or deploy keys via console
mkdir -p ~/.ssh
chmod 700 ~/.ssh
cat > ~/.ssh/authorized_keys << 'EOF'
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... # primary key
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQD... # emergency key
EOF
chmod 600 ~/.ssh/authorized_keys
VM Recovery and Rebuild
Corrupt VM Recovery
Steps:
- Create snapshot before attempting recovery
- Export VM data:
# Backup important data rsync -av cal@<vm-ip>:/home/cal/ ./vm-backup/ - Restore from template:
# Delete corrupt VM pvesh delete /nodes/pve/qemu/<vmid> # Clone from template pvesh create /nodes/pve/qemu/<template-id>/clone -newid <vmid> -name <vm-name>
Post-Install Script Recovery
If automation fails:
# Run in debug mode
bash -x ./scripts/vm-management/vm-post-install.sh <vm-ip> <user>
# Manual step execution
ssh cal@<vm-ip> 'sudo apt update && sudo apt upgrade -y'
ssh cal@<vm-ip> 'curl -fsSL https://get.docker.com | sh'
ssh cal@<vm-ip> 'sudo usermod -aG docker cal'
Prevention and Monitoring
Pre-Deployment Validation
# Verify prerequisites
ls -la ~/.ssh/homelab_rsa*
ls -la ~/.ssh/emergency_homelab_rsa*
ping 10.10.0.1
# Test cloud-init YAML
python3 -c "import yaml; yaml.safe_load(open('cloud-init-user-data.yaml'))"
Health Monitoring Script
#!/bin/bash
# vm-health-check.sh
VM_IPS="10.10.0.200 10.10.0.201 10.10.0.202"
for ip in $VM_IPS; do
if ssh -o ConnectTimeout=5 -o BatchMode=yes cal@$ip 'uptime' >/dev/null 2>&1; then
echo "✅ $ip: SSH OK"
# Check Docker
if ssh cal@$ip 'docker info >/dev/null 2>&1'; then
echo "✅ $ip: Docker OK"
else
echo "❌ $ip: Docker FAILED"
fi
else
echo "❌ $ip: SSH FAILED"
fi
done
Automated Backup
# Schedule in crontab: 0 2 * * * /path/to/vm-backup.sh
#!/bin/bash
for vm_ip in 10.10.0.{200..210}; do
if ping -c1 $vm_ip >/dev/null 2>&1; then
rsync -av --exclude='.cache' cal@$vm_ip:/home/cal/ ./backups/$vm_ip/
fi
done
Quick Reference Commands
Essential VM Management
# VM control via Proxmox
pvesh get /nodes/pve/qemu/<vmid>/status/current
pvesh create /nodes/pve/qemu/<vmid>/status/start
pvesh create /nodes/pve/qemu/<vmid>/status/stop
# SSH with alternative keys
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
# System health checks
free -h && df -h && systemctl status docker
docker system info && docker system df
Recovery Resources
- SSH Keys Backup:
/mnt/NV2/ssh-keys/backup-*/ - Proxmox Console: Direct VM access when SSH fails
- Emergency Contact: Use Discord notifications for critical issues
This troubleshooting guide covers comprehensive recovery procedures for VM management issues in home lab environments.
ubuntu-manticore crash recovery — initramfs fsck on wrong device (2026-03-24)
Severity: High — server unbootable, all services down (pihole DNS, jellyfin, tdarr, kb-rag)
Problem: After a physical server crash, ubuntu-manticore dropped to initramfs shell. Boot fsck targeted /dev/nvme0n1p2 (an NTFS data partition labeled "2TB 970") instead of the actual ext4 root on /dev/nvme1n1p2. The generic busybox fsck -y wrapper didn't invoke the ext4 backend.
Root Cause: Two issues compounded: (1) The crash corrupted the ext4 root filesystem (block/inode count mismatches across ~15 groups). (2) The initramfs resolved the root device UUID to the wrong NVMe drive — nvme0n1p2 instead of nvme1n1p2. NVMe device enumeration order can shift between boots; fstab uses UUIDs correctly but the initramfs got confused during this boot.
Fix: Ran /usr/sbin/fsck.ext4 -y /dev/nvme1n1p2 directly from initramfs (identified correct partition via blkid). After exit, boot completed normally and all 9 Docker containers came up automatically via restart policies.
Crash cause investigation:
- Kernel panic:
BUG: unable to handle page fault for address: fffffb2320041d50— supervisor write to not-present page - PCIe AER correctable errors (Data Link Layer Timeout) on port
0000:00:01.2(AMD X470/B450 root port) logged on Mar 19 - Nvidia proprietary driver loaded, kernel tainted — common source of page faults
- AMD Zen1 DIV0 bug flagged at boot (Ryzen 5 2600)
- SMART data: both Samsung 970s healthy (PASSED), but nvme1 (250GB root drive, 22k hours) has 1 Media and Data Integrity Error — monitor for growth
- nvme0 (2TB data): 0 media errors, 2% used, 1,571 hours — clean
- Most likely cause: Nvidia driver panic or PCIe timeout on NVMe controller
Remediation: Upgraded Nvidia driver 570.211.01 → 580.126.09. The 570 packages were in a broken state (partial upgrade left metapackage pinned at .24.04.1 while deps moved to .24.04.2), requiring explicit removal of nvidia-driver-570 nvidia-dkms-570 nvidia-kernel-source-570 nvidia-kernel-common-570 libnvidia-common-570 libnvidia-gl-570 with --allow-change-held-packages before 580 could install cleanly. Note: 590 drivers reported unstable — avoid.
Lesson:
- Always use
blkidin initramfs to confirm the actual root partition before running fsck — NVMe device ordering is not stable across boots - Use
/usr/sbin/fsck.ext4 -ydirectly rather than the busyboxfsckwrapper, which may not invoke the correct backend - Docker containers with restart policies recovered without intervention — validates that approach
- Install
smartmontoolson bare-metal servers proactively — wasn't available during initial investigation - Monitor nvme1 media integrity error count; if it increments, plan replacement
- When upgrading Nvidia driver major versions on Ubuntu, apt often can't resolve conflicts automatically — explicitly remove the old driver packages with
--allow-change-held-packagesfirst