claude-home/vm-management/examples/troubleshooting.md
Cal Corum 4b7eca8a46
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
docs: add YAML frontmatter to all 151 markdown files
Adds title, description, type, domain, and tags frontmatter to every
doc for improved KB semantic search. The description field is prepended
to every search chunk, and domain/type/tags enable filtered queries.

Type values: context, guide, runbook, reference, troubleshooting
Domain values match directory structure (networking, docker, etc.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 09:00:44 -05:00

11 KiB

title description type domain tags
VM Troubleshooting Examples Detailed troubleshooting examples for Proxmox VM provisioning failures, SSH connectivity issues, Docker installation problems, network/DNS configuration, disk space, and emergency recovery procedures. troubleshooting vm-management
proxmox
troubleshooting
ssh
docker
cloud-init
networking
recovery

VM Management Troubleshooting Guide

Complete troubleshooting guide for Proxmox VM provisioning, SSH connectivity, Docker installation, and common configuration issues.

Common Issues and Solutions

1. VM Provisioning Failures

Cloud-Init Not Working

Symptoms:

  • VM starts but cloud-init configuration not applied
  • User account not created
  • SSH keys not installed

Diagnosis:

# Check cloud-init status
ssh root@<vm-ip> 'cloud-init status --long'

# View cloud-init logs
ssh root@<vm-ip> 'cat /var/log/cloud-init.log'
ssh root@<vm-ip> 'cat /var/log/cloud-init-output.log'

# Check cloud-init configuration
ssh root@<vm-ip> 'cloud-init query userdata'

Solutions:

# Re-run cloud-init (if safe to do so)
ssh root@<vm-ip> 'cloud-init clean --logs'
ssh root@<vm-ip> 'cloud-init init --local'
ssh root@<vm-ip> 'cloud-init init'
ssh root@<vm-ip> 'cloud-init modules --mode=config'
ssh root@<vm-ip> 'cloud-init modules --mode=final'

# Force user creation if missing
ssh root@<vm-ip> 'useradd -m -s /bin/bash -G sudo,docker cal'

# Fix YAML syntax in cloud-init if needed
# Common issues: incorrect indentation, missing quotes

VM Won't Start

Symptoms:

  • VM fails to boot
  • Kernel panic or boot errors
  • Hangs during startup

Diagnosis:

# Check VM configuration in Proxmox
pvesh get /nodes/pve/qemu/<vmid>/config

# View console output
# Use Proxmox web interface Console tab

# Check VM resource allocation
pvesh get /nodes/pve/qemu/<vmid>/status/current

Solutions:

# Increase memory if low
pvesh set /nodes/pve/qemu/<vmid>/config -memory 2048

# Check disk space and format
pvesh get /nodes/pve/storage

# Reset to safe configuration
pvesh set /nodes/pve/qemu/<vmid>/config -cpu host -cores 2

2. SSH Connection Issues

Cannot Connect to VM

Symptoms:

  • Connection timeout
  • Connection refused
  • Host unreachable

Diagnosis:

# Test network connectivity
ping <vm-ip>

# Check SSH port
nc -zv <vm-ip> 22
nmap -p 22 <vm-ip>

# Check from Proxmox console
# Use Proxmox web interface -> VM -> Console
systemctl status sshd
netstat -tlnp | grep :22

Solutions:

# Start SSH service (via console)
systemctl start sshd
systemctl enable sshd

# Check firewall (via console)
ufw status
# If active and blocking SSH:
ufw allow ssh

# Reset network configuration
ip addr show
dhclient  # If using DHCP
systemctl restart networking

SSH Key Authentication Fails

Symptoms:

  • Password prompts despite keys being installed
  • Permission denied (publickey)
  • "No more authentication methods to try"

Diagnosis:

# Verbose SSH connection
ssh -vvv cal@<vm-ip>

# Check authorized_keys file (via console or password auth)
ls -la ~/.ssh/
cat ~/.ssh/authorized_keys

Solutions:

# Fix file permissions
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys

# Verify key content
cat ~/.ssh/authorized_keys | wc -l  # Should show 2 keys

# Re-deploy keys manually
cat ~/.ssh/homelab_rsa.pub >> ~/.ssh/authorized_keys
cat ~/.ssh/emergency_homelab_rsa.pub >> ~/.ssh/authorized_keys

# Check SSH configuration
sudo grep -E "(PubkeyAuth|PasswordAuth)" /etc/ssh/sshd_config
sudo systemctl restart sshd

SSH Configuration Problems

Symptoms:

  • SSH works but with wrong settings
  • Root access when it should be disabled
  • Password authentication enabled

Diagnosis:

# Check effective SSH configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)"

# View SSH configuration files
cat /etc/ssh/sshd_config
ls /etc/ssh/sshd_config.d/

Solutions:

# Apply security hardening manually
sudo tee /etc/ssh/sshd_config.d/99-homelab-security.conf << 'EOF'
PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin no
AllowUsers cal
Protocol 2
ClientAliveInterval 300
ClientAliveCountMax 2
MaxAuthTries 3
EOF

sudo systemctl restart sshd

3. Docker Installation Issues

Docker Installation Fails

Symptoms:

  • Docker packages not found
  • GPG key verification fails
  • Permission denied errors

Diagnosis:

# Check internet connectivity
ping google.com
curl -I https://download.docker.com

# Check repository configuration
cat /etc/apt/sources.list.d/docker.list
apt-cache policy docker-ce

# Check for conflicting packages
dpkg -l | grep docker

Solutions:

# Remove conflicting packages
sudo apt remove -y docker docker-engine docker.io containerd runc

# Re-add Docker repository
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list

# Install Docker
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

Docker Permission Issues

Symptoms:

  • "Permission denied" when running docker commands
  • Must use sudo for docker commands
  • User not in docker group

Diagnosis:

# Check user groups
groups
groups cal

# Check docker group exists
getent group docker

# Check docker service
systemctl status docker

Solutions:

# Add user to docker group
sudo usermod -aG docker cal

# Create docker group if missing
sudo groupadd docker
sudo usermod -aG docker cal

# Apply group membership (logout/login or)
newgrp docker

# Fix socket permissions
sudo chown root:docker /var/run/docker.sock
sudo chmod 664 /var/run/docker.sock

Docker Service Won't Start

Symptoms:

  • Docker daemon not running
  • Socket connection errors
  • systemctl shows failed status

Diagnosis:

# Check service status
systemctl status docker
journalctl -u docker.service -f

# Check daemon logs
sudo dockerd --debug

# Check system resources
df -h
free -h

Solutions:

# Restart Docker service
sudo systemctl restart docker
sudo systemctl enable docker

# Clear Docker data if corrupted
sudo systemctl stop docker
sudo rm -rf /var/lib/docker/tmp/*
sudo systemctl start docker

# Reset Docker configuration
sudo systemctl stop docker
sudo mv /etc/docker/daemon.json /etc/docker/daemon.json.bak
sudo systemctl start docker

4. System Update Issues

Package Update Failures

Symptoms:

  • apt update fails
  • Repository errors
  • Dependency conflicts

Diagnosis:

# Check repository status
sudo apt update
cat /etc/apt/sources.list
ls /etc/apt/sources.list.d/

# Check disk space
df -h /
df -h /var

Solutions:

# Fix broken packages
sudo apt --fix-broken install
sudo dpkg --configure -a

# Clean package cache
sudo apt clean
sudo apt autoclean
sudo apt autoremove

# Reset sources if needed
sudo cp /etc/apt/sources.list /etc/apt/sources.list.backup
# Manually edit to use main Ubuntu repositories

5. Network Configuration Problems

IP Configuration Issues

Symptoms:

  • VM has wrong IP address
  • No network connectivity
  • DNS resolution fails

Diagnosis:

# Check network configuration
ip addr show
ip route show
cat /etc/netplan/*.yaml

# Test connectivity
ping 10.10.0.1  # Gateway
ping 8.8.8.8    # External DNS
nslookup google.com

Solutions:

# Fix netplan configuration
sudo nano /etc/netplan/00-installer-config.yaml

# Example correct configuration:
network:
  version: 2
  ethernets:
    ens18:
      dhcp4: false
      addresses: [10.10.0.200/24]
      gateway4: 10.10.0.1
      nameservers:
        addresses: [10.10.0.16, 8.8.8.8]

# Apply configuration
sudo netplan apply

DNS Resolution Problems

Symptoms:

  • Cannot resolve domain names
  • Package installation fails
  • Hostname lookups fail

Diagnosis:

# Check DNS configuration
cat /etc/resolv.conf
systemd-resolve --status

# Test DNS
nslookup google.com
dig google.com

Solutions:

# Fix DNS in netplan
sudo nano /etc/netplan/00-installer-config.yaml
# Add nameservers section as shown above

# Temporary DNS fix
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf

# Restart networking
sudo netplan apply
sudo systemctl restart systemd-resolved

6. Storage and Disk Issues

Disk Space Problems

Symptoms:

  • VM runs out of disk space
  • Cannot install packages
  • Docker images won't download

Diagnosis:

# Check disk usage
df -h
du -sh /home/*
du -sh /var/*

# Check for large files
find / -size +100M 2>/dev/null | head -10

Solutions:

# Clean system
sudo apt clean
sudo apt autoremove
docker system prune -a

# Extend disk in Proxmox (if needed)
# Use Proxmox web interface: VM -> Hardware -> Hard Disk -> Resize

# Extend filesystem after disk resize
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1

Advanced Troubleshooting

Post-Install Script Debug Mode

# Run script with debug output
bash -x ./scripts/vm-management/vm-post-install.sh <vm-ip> <user>

# Check specific steps manually
ssh cal@<vm-ip> 'docker --version'
ssh cal@<vm-ip> 'sudo systemctl status sshd'
ssh cal@<vm-ip> 'cat ~/.ssh/authorized_keys | wc -l'

Recovery Procedures

Emergency SSH Access

# If primary SSH key fails, use emergency key
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>

# If all SSH access fails, use Proxmox console
# VM -> Console in Proxmox web interface

# Reset SSH configuration
sudo cp /etc/ssh/sshd_config.backup /etc/ssh/sshd_config
sudo systemctl restart sshd

Complete VM Reset

# If VM is completely broken, restore from template
pvesh delete /nodes/pve/qemu/<vmid>
pvesh create /nodes/pve/qemu/<template-id>/clone -newid <vmid> -name <new-name>

# Or re-run cloud-init provisioning
# Delete VM and recreate with same cloud-init configuration

Prevention Best Practices

Pre-Deployment Checks

# Verify SSH keys exist
ls -la ~/.ssh/homelab_rsa*
ls -la ~/.ssh/emergency_homelab_rsa*

# Test network connectivity to target subnet
ping 10.10.0.1

# Verify Proxmox storage space
pvesh get /nodes/pve/storage

Monitoring and Alerts

# Create health check script
#!/bin/bash
# vm-health-monitor.sh
for ip in 10.10.0.{200..210}; do
    if ssh -o ConnectTimeout=5 cal@$ip 'uptime' >/dev/null 2>&1; then
        echo "✅ $ip: OK"
    else
        echo "❌ $ip: FAILED"
    fi
done

# Schedule regular checks
# Add to crontab: */15 * * * * /path/to/vm-health-monitor.sh

Emergency Contacts and Resources

Recovery Information

  • SSH Keys Location: /mnt/NV2/ssh-keys/backup-*/
  • Emergency Access: Use Proxmox console for direct VM access
  • Backup Strategy: VM snapshots before major changes

Quick Reference Commands

# VM Status
pvesh get /nodes/pve/qemu/<vmid>/status/current

# Start/Stop VM
pvesh create /nodes/pve/qemu/<vmid>/status/start
pvesh create /nodes/pve/qemu/<vmid>/status/stop

# SSH with different key
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>

# Docker system info
docker system info
docker system df