claude-home/vm-management/troubleshooting.md
Cal Corum 4b7eca8a46
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
docs: add YAML frontmatter to all 151 markdown files
Adds title, description, type, domain, and tags frontmatter to every
doc for improved KB semantic search. The description field is prepended
to every search chunk, and domain/type/tags enable filtered queries.

Type values: context, guide, runbook, reference, troubleshooting
Domain values match directory structure (networking, docker, etc.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 09:00:44 -05:00

14 KiB

title description type domain tags
VM Management Troubleshooting Troubleshooting guide for Proxmox VM issues: cloud-init failures, SSH access problems, Docker installation errors, network configuration, disk space, and emergency recovery procedures. troubleshooting vm-management
proxmox
vm
ssh
docker
cloud-init
networking
recovery

Virtual Machine Management Troubleshooting Guide

VM Provisioning Issues

Cloud-Init Configuration Problems

Cloud-Init Not Executing

Symptoms:

  • VM starts but user accounts not created
  • SSH keys not deployed
  • Packages not installed
  • Configuration not applied

Diagnosis:

# Check cloud-init status and logs
ssh root@<vm-ip> 'cloud-init status --long'
ssh root@<vm-ip> 'cat /var/log/cloud-init.log'
ssh root@<vm-ip> 'cat /var/log/cloud-init-output.log'

# Verify cloud-init configuration
ssh root@<vm-ip> 'cloud-init query userdata'

# Check for YAML syntax errors
ssh root@<vm-ip> 'cloud-init devel schema --config-file /var/lib/cloud/instance/user-data.txt'

Solutions:

# Re-run cloud-init (CAUTION: may overwrite changes)
ssh root@<vm-ip> 'cloud-init clean --logs'
ssh root@<vm-ip> 'cloud-init init --local'
ssh root@<vm-ip> 'cloud-init init'
ssh root@<vm-ip> 'cloud-init modules --mode=config'
ssh root@<vm-ip> 'cloud-init modules --mode=final'

# Manual user creation if cloud-init fails
ssh root@<vm-ip> 'useradd -m -s /bin/bash -G sudo,docker cal'
ssh root@<vm-ip> 'mkdir -p /home/cal/.ssh'
ssh root@<vm-ip> 'chown cal:cal /home/cal/.ssh'
ssh root@<vm-ip> 'chmod 700 /home/cal/.ssh'

Invalid Cloud-Init YAML

Symptoms:

  • Cloud-init fails with syntax errors
  • Parser errors in cloud-init logs
  • Partial configuration application

Common YAML Issues:

# ❌ Incorrect indentation
users:
- name: cal
groups: [sudo, docker]  # Wrong indentation

# ✅ Correct indentation  
users:
  - name: cal
    groups: [sudo, docker]  # Proper indentation

# ❌ Missing quotes for special characters
ssh_authorized_keys:
  - ssh-rsa AAAAB3NzaC1... user@host  # May fail with special chars

# ✅ Quoted strings
ssh_authorized_keys:
  - "ssh-rsa AAAAB3NzaC1... user@host"

VM Boot and Startup Issues

VM Won't Start

Symptoms:

  • VM fails to boot from Proxmox
  • Kernel panic messages
  • Boot loop or hanging

Diagnosis:

# Check VM configuration
pvesh get /nodes/pve/qemu/<vmid>/config

# Check resource allocation
pvesh get /nodes/pve/qemu/<vmid>/status/current

# Review VM logs via Proxmox console
# Use Proxmox web interface -> VM -> Console

# Check Proxmox host resources
pvesh get /nodes/pve/status

Solutions:

# Increase memory allocation
pvesh set /nodes/pve/qemu/<vmid>/config -memory 4096

# Reset CPU configuration
pvesh set /nodes/pve/qemu/<vmid>/config -cpu host -cores 2

# Check and repair disk
# Stop VM, then:
pvesh get /nodes/pve/qemu/<vmid>/config | grep scsi0
# Use fsck on the disk image if needed

Resource Constraints

Symptoms:

  • VM extremely slow performance
  • Out-of-memory kills
  • Disk I/O bottlenecks

Diagnosis:

# Inside VM resource check
free -h
df -h
iostat 1 5
vmstat 1 5

# Proxmox host resource check
pvesh get /nodes/pve/status
cat /proc/meminfo
df -h /var/lib/vz

Solutions:

# Increase VM resources via Proxmox
pvesh set /nodes/pve/qemu/<vmid>/config -memory 8192
pvesh set /nodes/pve/qemu/<vmid>/config -cores 4

# Resize VM disk
# Proxmox GUI: Hardware -> Hard Disk -> Resize
# Then extend filesystem:
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1

SSH Access Issues

SSH Connection Failures

Cannot Connect to VM

Symptoms:

  • Connection timeout
  • Connection refused
  • Host unreachable

Diagnosis:

# Network connectivity tests
ping <vm-ip>
traceroute <vm-ip>

# SSH service tests
nc -zv <vm-ip> 22
nmap -p 22 <vm-ip>

# From Proxmox console, check SSH service
systemctl status sshd
ss -tlnp | grep :22

Solutions:

# Via Proxmox console - restart SSH
systemctl start sshd
systemctl enable sshd

# Check and configure firewall
ufw status
# If blocking SSH:
ufw allow ssh
ufw allow 22/tcp

# Network configuration reset
ip addr show
dhclient  # For DHCP
systemctl restart networking

SSH Key Authentication Failures

Symptoms:

  • Password prompts despite key installation
  • "Permission denied (publickey)"
  • "No more authentication methods"

Diagnosis:

# Verbose SSH debugging
ssh -vvv cal@<vm-ip>

# Check key files locally
ls -la ~/.ssh/homelab_rsa*
ls -la ~/.ssh/emergency_homelab_rsa*

# Via console or password auth, check VM
ls -la ~/.ssh/
cat ~/.ssh/authorized_keys

Solutions:

# Fix SSH directory permissions
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
chown -R cal:cal ~/.ssh

# Re-deploy SSH keys
cat > ~/.ssh/authorized_keys << 'EOF'
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... # primary key
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQD... # emergency key  
EOF

# Verify SSH server configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)"

SSH Security Configuration Issues

Symptoms:

  • Password authentication still enabled
  • Root login allowed
  • Insecure SSH settings

Diagnosis:

# Check effective SSH configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot|allowusers)"

# Review SSH config files
cat /etc/ssh/sshd_config
ls /etc/ssh/sshd_config.d/

Solutions:

# Apply security hardening
sudo tee /etc/ssh/sshd_config.d/99-homelab-security.conf << 'EOF'
PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin no
AllowUsers cal
Protocol 2
ClientAliveInterval 300
ClientAliveCountMax 2
MaxAuthTries 3
X11Forwarding no
EOF

sudo systemctl restart sshd

Docker Installation and Configuration Issues

Docker Installation Failures

Package Installation Fails

Symptoms:

  • Docker packages not found
  • GPG key verification errors
  • Repository access failures

Diagnosis:

# Test internet connectivity
ping google.com
curl -I https://download.docker.com

# Check repository configuration
cat /etc/apt/sources.list.d/docker.list
apt-cache policy docker-ce

# Check for package conflicts
dpkg -l | grep docker

Solutions:

# Remove conflicting packages
sudo apt remove -y docker docker-engine docker.io containerd runc

# Reinstall Docker repository
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list

# Install Docker
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

Docker Service Issues

Symptoms:

  • Docker daemon won't start
  • Socket connection errors
  • Service failure on boot

Diagnosis:

# Check service status
systemctl status docker
journalctl -u docker.service -f

# Check system resources
df -h
free -h

# Test daemon manually
sudo dockerd --debug

Solutions:

# Restart Docker service
sudo systemctl stop docker
sudo systemctl start docker
sudo systemctl enable docker

# Clear corrupted Docker data
sudo systemctl stop docker
sudo rm -rf /var/lib/docker/tmp/*
sudo systemctl start docker

# Reset Docker configuration
sudo mv /etc/docker/daemon.json /etc/docker/daemon.json.bak 2>/dev/null || true
sudo systemctl restart docker

Docker Permission and Access Issues

Permission Denied Errors

Symptoms:

  • Must use sudo for Docker commands
  • "Permission denied" when accessing Docker socket
  • User not in docker group

Diagnosis:

# Check user groups
groups
groups cal
getent group docker

# Check Docker socket permissions
ls -la /var/run/docker.sock

# Verify Docker service is running
systemctl status docker

Solutions:

# Add user to docker group
sudo usermod -aG docker cal

# Create docker group if missing
sudo groupadd docker 2>/dev/null || true
sudo usermod -aG docker cal

# Apply group membership (requires logout/login or):
newgrp docker

# Fix socket permissions
sudo chown root:docker /var/run/docker.sock
sudo chmod 664 /var/run/docker.sock

Network Configuration Problems

IP Address and Connectivity Issues

Incorrect IP Configuration

Symptoms:

  • VM has wrong IP address
  • No network connectivity
  • Cannot reach default gateway

Diagnosis:

# Check network configuration
ip addr show
ip route show
cat /etc/netplan/*.yaml

# Test connectivity
ping $(ip route | grep default | awk '{print $3}')  # Gateway
ping 8.8.8.8  # External connectivity

Solutions:

# Fix netplan configuration
sudo tee /etc/netplan/00-installer-config.yaml << 'EOF'
network:
  version: 2
  ethernets:
    ens18:
      dhcp4: false
      addresses: [10.10.0.200/24]
      gateway4: 10.10.0.1
      nameservers:
        addresses: [10.10.0.16, 8.8.8.8]
EOF

# Apply network configuration
sudo netplan apply

DNS Resolution Problems

Symptoms:

  • Cannot resolve domain names
  • Package downloads fail
  • Host lookup failures

Diagnosis:

# Check DNS configuration
cat /etc/resolv.conf
systemd-resolve --status

# Test DNS resolution
nslookup google.com
dig google.com @8.8.8.8

Solutions:

# Fix DNS in netplan (see above example)
sudo netplan apply

# Temporary DNS fix
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf

# Restart DNS services
sudo systemctl restart systemd-resolved
sudo systemctl restart networking

System Maintenance Issues

Package Management Problems

Update Failures

Symptoms:

  • apt update fails
  • Repository signature errors
  • Dependency conflicts

Diagnosis:

# Check repository status
sudo apt update
apt-cache policy

# Check disk space
df -h /
df -h /var

# Check for held packages
apt-mark showhold

Solutions:

# Fix broken packages
sudo apt --fix-broken install
sudo dpkg --configure -a

# Clean package cache
sudo apt clean
sudo apt autoclean
sudo apt autoremove

# Reset problematic repositories
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys <KEYID>
sudo apt update

Storage and Disk Space Issues

Disk Space Exhaustion

Symptoms:

  • Cannot install packages
  • Docker operations fail
  • System becomes unresponsive

Diagnosis:

# Check disk usage
df -h
du -sh /home/* /var/* /opt/* 2>/dev/null

# Find large files
find / -size +100M 2>/dev/null | head -20

Solutions:

# Clean system files
sudo apt clean
sudo apt autoremove
sudo journalctl --vacuum-time=7d

# Clean Docker data
docker system prune -a -f
docker volume prune -f

# Extend disk (Proxmox GUI: Hardware -> Resize)
# Then extend filesystem:
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1

Emergency Recovery Procedures

SSH Access Recovery

Complete SSH Lockout

Recovery Steps:

  1. Use Proxmox console for direct VM access
  2. Reset SSH configuration:
    # Via console
    sudo cp /etc/ssh/sshd_config.backup /etc/ssh/sshd_config 2>/dev/null || true
    sudo systemctl restart sshd
    
  3. Re-enable emergency access:
    # Temporary password access for recovery
    sudo passwd cal
    sudo sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config
    sudo systemctl restart sshd
    

Emergency SSH Key Deployment

If primary keys fail:

# Use emergency key
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>

# Or deploy keys via console
mkdir -p ~/.ssh
chmod 700 ~/.ssh
cat > ~/.ssh/authorized_keys << 'EOF'
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... # primary key
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQD... # emergency key
EOF
chmod 600 ~/.ssh/authorized_keys

VM Recovery and Rebuild

Corrupt VM Recovery

Steps:

  1. Create snapshot before attempting recovery
  2. Export VM data:
    # Backup important data
    rsync -av cal@<vm-ip>:/home/cal/ ./vm-backup/
    
  3. Restore from template:
    # Delete corrupt VM
    pvesh delete /nodes/pve/qemu/<vmid>
    
    # Clone from template
    pvesh create /nodes/pve/qemu/<template-id>/clone -newid <vmid> -name <vm-name>
    

Post-Install Script Recovery

If automation fails:

# Run in debug mode
bash -x ./scripts/vm-management/vm-post-install.sh <vm-ip> <user>

# Manual step execution
ssh cal@<vm-ip> 'sudo apt update && sudo apt upgrade -y'
ssh cal@<vm-ip> 'curl -fsSL https://get.docker.com | sh'
ssh cal@<vm-ip> 'sudo usermod -aG docker cal'

Prevention and Monitoring

Pre-Deployment Validation

# Verify prerequisites
ls -la ~/.ssh/homelab_rsa*
ls -la ~/.ssh/emergency_homelab_rsa*
ping 10.10.0.1

# Test cloud-init YAML
python3 -c "import yaml; yaml.safe_load(open('cloud-init-user-data.yaml'))"

Health Monitoring Script

#!/bin/bash
# vm-health-check.sh
VM_IPS="10.10.0.200 10.10.0.201 10.10.0.202"

for ip in $VM_IPS; do
    if ssh -o ConnectTimeout=5 -o BatchMode=yes cal@$ip 'uptime' >/dev/null 2>&1; then
        echo "✅ $ip: SSH OK"
        # Check Docker
        if ssh cal@$ip 'docker info >/dev/null 2>&1'; then
            echo "✅ $ip: Docker OK"
        else
            echo "❌ $ip: Docker FAILED"
        fi
    else
        echo "❌ $ip: SSH FAILED"
    fi
done

Automated Backup

# Schedule in crontab: 0 2 * * * /path/to/vm-backup.sh
#!/bin/bash
for vm_ip in 10.10.0.{200..210}; do
    if ping -c1 $vm_ip >/dev/null 2>&1; then
        rsync -av --exclude='.cache' cal@$vm_ip:/home/cal/ ./backups/$vm_ip/
    fi
done

Quick Reference Commands

Essential VM Management

# VM control via Proxmox
pvesh get /nodes/pve/qemu/<vmid>/status/current
pvesh create /nodes/pve/qemu/<vmid>/status/start
pvesh create /nodes/pve/qemu/<vmid>/status/stop

# SSH with alternative keys
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>

# System health checks
free -h && df -h && systemctl status docker
docker system info && docker system df

Recovery Resources

  • SSH Keys Backup: /mnt/NV2/ssh-keys/backup-*/
  • Proxmox Console: Direct VM access when SSH fails
  • Emergency Contact: Use Discord notifications for critical issues

This troubleshooting guide covers comprehensive recovery procedures for VM management issues in home lab environments.