claude-home/vm-management/troubleshooting.md
Cal Corum 4b7eca8a46
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
docs: add YAML frontmatter to all 151 markdown files
Adds title, description, type, domain, and tags frontmatter to every
doc for improved KB semantic search. The description field is prepended
to every search chunk, and domain/type/tags enable filtered queries.

Type values: context, guide, runbook, reference, troubleshooting
Domain values match directory structure (networking, docker, etc.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 09:00:44 -05:00

660 lines
14 KiB
Markdown

---
title: "VM Management Troubleshooting"
description: "Troubleshooting guide for Proxmox VM issues: cloud-init failures, SSH access problems, Docker installation errors, network configuration, disk space, and emergency recovery procedures."
type: troubleshooting
domain: vm-management
tags: [proxmox, vm, ssh, docker, cloud-init, networking, recovery]
---
# Virtual Machine Management Troubleshooting Guide
## VM Provisioning Issues
### Cloud-Init Configuration Problems
#### Cloud-Init Not Executing
**Symptoms**:
- VM starts but user accounts not created
- SSH keys not deployed
- Packages not installed
- Configuration not applied
**Diagnosis**:
```bash
# Check cloud-init status and logs
ssh root@<vm-ip> 'cloud-init status --long'
ssh root@<vm-ip> 'cat /var/log/cloud-init.log'
ssh root@<vm-ip> 'cat /var/log/cloud-init-output.log'
# Verify cloud-init configuration
ssh root@<vm-ip> 'cloud-init query userdata'
# Check for YAML syntax errors
ssh root@<vm-ip> 'cloud-init devel schema --config-file /var/lib/cloud/instance/user-data.txt'
```
**Solutions**:
```bash
# Re-run cloud-init (CAUTION: may overwrite changes)
ssh root@<vm-ip> 'cloud-init clean --logs'
ssh root@<vm-ip> 'cloud-init init --local'
ssh root@<vm-ip> 'cloud-init init'
ssh root@<vm-ip> 'cloud-init modules --mode=config'
ssh root@<vm-ip> 'cloud-init modules --mode=final'
# Manual user creation if cloud-init fails
ssh root@<vm-ip> 'useradd -m -s /bin/bash -G sudo,docker cal'
ssh root@<vm-ip> 'mkdir -p /home/cal/.ssh'
ssh root@<vm-ip> 'chown cal:cal /home/cal/.ssh'
ssh root@<vm-ip> 'chmod 700 /home/cal/.ssh'
```
#### Invalid Cloud-Init YAML
**Symptoms**:
- Cloud-init fails with syntax errors
- Parser errors in cloud-init logs
- Partial configuration application
**Common YAML Issues**:
```yaml
# ❌ Incorrect indentation
users:
- name: cal
groups: [sudo, docker] # Wrong indentation
# ✅ Correct indentation
users:
- name: cal
groups: [sudo, docker] # Proper indentation
# ❌ Missing quotes for special characters
ssh_authorized_keys:
- ssh-rsa AAAAB3NzaC1... user@host # May fail with special chars
# ✅ Quoted strings
ssh_authorized_keys:
- "ssh-rsa AAAAB3NzaC1... user@host"
```
### VM Boot and Startup Issues
#### VM Won't Start
**Symptoms**:
- VM fails to boot from Proxmox
- Kernel panic messages
- Boot loop or hanging
**Diagnosis**:
```bash
# Check VM configuration
pvesh get /nodes/pve/qemu/<vmid>/config
# Check resource allocation
pvesh get /nodes/pve/qemu/<vmid>/status/current
# Review VM logs via Proxmox console
# Use Proxmox web interface -> VM -> Console
# Check Proxmox host resources
pvesh get /nodes/pve/status
```
**Solutions**:
```bash
# Increase memory allocation
pvesh set /nodes/pve/qemu/<vmid>/config -memory 4096
# Reset CPU configuration
pvesh set /nodes/pve/qemu/<vmid>/config -cpu host -cores 2
# Check and repair disk
# Stop VM, then:
pvesh get /nodes/pve/qemu/<vmid>/config | grep scsi0
# Use fsck on the disk image if needed
```
#### Resource Constraints
**Symptoms**:
- VM extremely slow performance
- Out-of-memory kills
- Disk I/O bottlenecks
**Diagnosis**:
```bash
# Inside VM resource check
free -h
df -h
iostat 1 5
vmstat 1 5
# Proxmox host resource check
pvesh get /nodes/pve/status
cat /proc/meminfo
df -h /var/lib/vz
```
**Solutions**:
```bash
# Increase VM resources via Proxmox
pvesh set /nodes/pve/qemu/<vmid>/config -memory 8192
pvesh set /nodes/pve/qemu/<vmid>/config -cores 4
# Resize VM disk
# Proxmox GUI: Hardware -> Hard Disk -> Resize
# Then extend filesystem:
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1
```
## SSH Access Issues
### SSH Connection Failures
#### Cannot Connect to VM
**Symptoms**:
- Connection timeout
- Connection refused
- Host unreachable
**Diagnosis**:
```bash
# Network connectivity tests
ping <vm-ip>
traceroute <vm-ip>
# SSH service tests
nc -zv <vm-ip> 22
nmap -p 22 <vm-ip>
# From Proxmox console, check SSH service
systemctl status sshd
ss -tlnp | grep :22
```
**Solutions**:
```bash
# Via Proxmox console - restart SSH
systemctl start sshd
systemctl enable sshd
# Check and configure firewall
ufw status
# If blocking SSH:
ufw allow ssh
ufw allow 22/tcp
# Network configuration reset
ip addr show
dhclient # For DHCP
systemctl restart networking
```
#### SSH Key Authentication Failures
**Symptoms**:
- Password prompts despite key installation
- "Permission denied (publickey)"
- "No more authentication methods"
**Diagnosis**:
```bash
# Verbose SSH debugging
ssh -vvv cal@<vm-ip>
# Check key files locally
ls -la ~/.ssh/homelab_rsa*
ls -la ~/.ssh/emergency_homelab_rsa*
# Via console or password auth, check VM
ls -la ~/.ssh/
cat ~/.ssh/authorized_keys
```
**Solutions**:
```bash
# Fix SSH directory permissions
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
chown -R cal:cal ~/.ssh
# Re-deploy SSH keys
cat > ~/.ssh/authorized_keys << 'EOF'
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... # primary key
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQD... # emergency key
EOF
# Verify SSH server configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)"
```
#### SSH Security Configuration Issues
**Symptoms**:
- Password authentication still enabled
- Root login allowed
- Insecure SSH settings
**Diagnosis**:
```bash
# Check effective SSH configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot|allowusers)"
# Review SSH config files
cat /etc/ssh/sshd_config
ls /etc/ssh/sshd_config.d/
```
**Solutions**:
```bash
# Apply security hardening
sudo tee /etc/ssh/sshd_config.d/99-homelab-security.conf << 'EOF'
PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin no
AllowUsers cal
Protocol 2
ClientAliveInterval 300
ClientAliveCountMax 2
MaxAuthTries 3
X11Forwarding no
EOF
sudo systemctl restart sshd
```
## Docker Installation and Configuration Issues
### Docker Installation Failures
#### Package Installation Fails
**Symptoms**:
- Docker packages not found
- GPG key verification errors
- Repository access failures
**Diagnosis**:
```bash
# Test internet connectivity
ping google.com
curl -I https://download.docker.com
# Check repository configuration
cat /etc/apt/sources.list.d/docker.list
apt-cache policy docker-ce
# Check for package conflicts
dpkg -l | grep docker
```
**Solutions**:
```bash
# Remove conflicting packages
sudo apt remove -y docker docker-engine docker.io containerd runc
# Reinstall Docker repository
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list
# Install Docker
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
```
#### Docker Service Issues
**Symptoms**:
- Docker daemon won't start
- Socket connection errors
- Service failure on boot
**Diagnosis**:
```bash
# Check service status
systemctl status docker
journalctl -u docker.service -f
# Check system resources
df -h
free -h
# Test daemon manually
sudo dockerd --debug
```
**Solutions**:
```bash
# Restart Docker service
sudo systemctl stop docker
sudo systemctl start docker
sudo systemctl enable docker
# Clear corrupted Docker data
sudo systemctl stop docker
sudo rm -rf /var/lib/docker/tmp/*
sudo systemctl start docker
# Reset Docker configuration
sudo mv /etc/docker/daemon.json /etc/docker/daemon.json.bak 2>/dev/null || true
sudo systemctl restart docker
```
### Docker Permission and Access Issues
#### Permission Denied Errors
**Symptoms**:
- Must use sudo for Docker commands
- "Permission denied" when accessing Docker socket
- User not in docker group
**Diagnosis**:
```bash
# Check user groups
groups
groups cal
getent group docker
# Check Docker socket permissions
ls -la /var/run/docker.sock
# Verify Docker service is running
systemctl status docker
```
**Solutions**:
```bash
# Add user to docker group
sudo usermod -aG docker cal
# Create docker group if missing
sudo groupadd docker 2>/dev/null || true
sudo usermod -aG docker cal
# Apply group membership (requires logout/login or):
newgrp docker
# Fix socket permissions
sudo chown root:docker /var/run/docker.sock
sudo chmod 664 /var/run/docker.sock
```
## Network Configuration Problems
### IP Address and Connectivity Issues
#### Incorrect IP Configuration
**Symptoms**:
- VM has wrong IP address
- No network connectivity
- Cannot reach default gateway
**Diagnosis**:
```bash
# Check network configuration
ip addr show
ip route show
cat /etc/netplan/*.yaml
# Test connectivity
ping $(ip route | grep default | awk '{print $3}') # Gateway
ping 8.8.8.8 # External connectivity
```
**Solutions**:
```bash
# Fix netplan configuration
sudo tee /etc/netplan/00-installer-config.yaml << 'EOF'
network:
version: 2
ethernets:
ens18:
dhcp4: false
addresses: [10.10.0.200/24]
gateway4: 10.10.0.1
nameservers:
addresses: [10.10.0.16, 8.8.8.8]
EOF
# Apply network configuration
sudo netplan apply
```
#### DNS Resolution Problems
**Symptoms**:
- Cannot resolve domain names
- Package downloads fail
- Host lookup failures
**Diagnosis**:
```bash
# Check DNS configuration
cat /etc/resolv.conf
systemd-resolve --status
# Test DNS resolution
nslookup google.com
dig google.com @8.8.8.8
```
**Solutions**:
```bash
# Fix DNS in netplan (see above example)
sudo netplan apply
# Temporary DNS fix
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
# Restart DNS services
sudo systemctl restart systemd-resolved
sudo systemctl restart networking
```
## System Maintenance Issues
### Package Management Problems
#### Update Failures
**Symptoms**:
- apt update fails
- Repository signature errors
- Dependency conflicts
**Diagnosis**:
```bash
# Check repository status
sudo apt update
apt-cache policy
# Check disk space
df -h /
df -h /var
# Check for held packages
apt-mark showhold
```
**Solutions**:
```bash
# Fix broken packages
sudo apt --fix-broken install
sudo dpkg --configure -a
# Clean package cache
sudo apt clean
sudo apt autoclean
sudo apt autoremove
# Reset problematic repositories
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys <KEYID>
sudo apt update
```
### Storage and Disk Space Issues
#### Disk Space Exhaustion
**Symptoms**:
- Cannot install packages
- Docker operations fail
- System becomes unresponsive
**Diagnosis**:
```bash
# Check disk usage
df -h
du -sh /home/* /var/* /opt/* 2>/dev/null
# Find large files
find / -size +100M 2>/dev/null | head -20
```
**Solutions**:
```bash
# Clean system files
sudo apt clean
sudo apt autoremove
sudo journalctl --vacuum-time=7d
# Clean Docker data
docker system prune -a -f
docker volume prune -f
# Extend disk (Proxmox GUI: Hardware -> Resize)
# Then extend filesystem:
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1
```
## Emergency Recovery Procedures
### SSH Access Recovery
#### Complete SSH Lockout
**Recovery Steps**:
1. **Use Proxmox console** for direct VM access
2. **Reset SSH configuration**:
```bash
# Via console
sudo cp /etc/ssh/sshd_config.backup /etc/ssh/sshd_config 2>/dev/null || true
sudo systemctl restart sshd
```
3. **Re-enable emergency access**:
```bash
# Temporary password access for recovery
sudo passwd cal
sudo sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config
sudo systemctl restart sshd
```
#### Emergency SSH Key Deployment
**If primary keys fail**:
```bash
# Use emergency key
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
# Or deploy keys via console
mkdir -p ~/.ssh
chmod 700 ~/.ssh
cat > ~/.ssh/authorized_keys << 'EOF'
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... # primary key
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQD... # emergency key
EOF
chmod 600 ~/.ssh/authorized_keys
```
### VM Recovery and Rebuild
#### Corrupt VM Recovery
**Steps**:
1. **Create snapshot** before attempting recovery
2. **Export VM data**:
```bash
# Backup important data
rsync -av cal@<vm-ip>:/home/cal/ ./vm-backup/
```
3. **Restore from template**:
```bash
# Delete corrupt VM
pvesh delete /nodes/pve/qemu/<vmid>
# Clone from template
pvesh create /nodes/pve/qemu/<template-id>/clone -newid <vmid> -name <vm-name>
```
#### Post-Install Script Recovery
**If automation fails**:
```bash
# Run in debug mode
bash -x ./scripts/vm-management/vm-post-install.sh <vm-ip> <user>
# Manual step execution
ssh cal@<vm-ip> 'sudo apt update && sudo apt upgrade -y'
ssh cal@<vm-ip> 'curl -fsSL https://get.docker.com | sh'
ssh cal@<vm-ip> 'sudo usermod -aG docker cal'
```
## Prevention and Monitoring
### Pre-Deployment Validation
```bash
# Verify prerequisites
ls -la ~/.ssh/homelab_rsa*
ls -la ~/.ssh/emergency_homelab_rsa*
ping 10.10.0.1
# Test cloud-init YAML
python3 -c "import yaml; yaml.safe_load(open('cloud-init-user-data.yaml'))"
```
### Health Monitoring Script
```bash
#!/bin/bash
# vm-health-check.sh
VM_IPS="10.10.0.200 10.10.0.201 10.10.0.202"
for ip in $VM_IPS; do
if ssh -o ConnectTimeout=5 -o BatchMode=yes cal@$ip 'uptime' >/dev/null 2>&1; then
echo "✅ $ip: SSH OK"
# Check Docker
if ssh cal@$ip 'docker info >/dev/null 2>&1'; then
echo "✅ $ip: Docker OK"
else
echo "❌ $ip: Docker FAILED"
fi
else
echo "❌ $ip: SSH FAILED"
fi
done
```
### Automated Backup
```bash
# Schedule in crontab: 0 2 * * * /path/to/vm-backup.sh
#!/bin/bash
for vm_ip in 10.10.0.{200..210}; do
if ping -c1 $vm_ip >/dev/null 2>&1; then
rsync -av --exclude='.cache' cal@$vm_ip:/home/cal/ ./backups/$vm_ip/
fi
done
```
## Quick Reference Commands
### Essential VM Management
```bash
# VM control via Proxmox
pvesh get /nodes/pve/qemu/<vmid>/status/current
pvesh create /nodes/pve/qemu/<vmid>/status/start
pvesh create /nodes/pve/qemu/<vmid>/status/stop
# SSH with alternative keys
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
# System health checks
free -h && df -h && systemctl status docker
docker system info && docker system df
```
### Recovery Resources
- **SSH Keys Backup**: `/mnt/NV2/ssh-keys/backup-*/`
- **Proxmox Console**: Direct VM access when SSH fails
- **Emergency Contact**: Use Discord notifications for critical issues
This troubleshooting guide covers comprehensive recovery procedures for VM management issues in home lab environments.