claude-home/vm-management/examples/troubleshooting.md
Cal Corum 4b7eca8a46
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
docs: add YAML frontmatter to all 151 markdown files
Adds title, description, type, domain, and tags frontmatter to every
doc for improved KB semantic search. The description field is prepended
to every search chunk, and domain/type/tags enable filtered queries.

Type values: context, guide, runbook, reference, troubleshooting
Domain values match directory structure (networking, docker, etc.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 09:00:44 -05:00

538 lines
11 KiB
Markdown

---
title: "VM Troubleshooting Examples"
description: "Detailed troubleshooting examples for Proxmox VM provisioning failures, SSH connectivity issues, Docker installation problems, network/DNS configuration, disk space, and emergency recovery procedures."
type: troubleshooting
domain: vm-management
tags: [proxmox, troubleshooting, ssh, docker, cloud-init, networking, recovery]
---
# VM Management Troubleshooting Guide
Complete troubleshooting guide for Proxmox VM provisioning, SSH connectivity, Docker installation, and common configuration issues.
## Common Issues and Solutions
### 1. VM Provisioning Failures
#### Cloud-Init Not Working
**Symptoms:**
- VM starts but cloud-init configuration not applied
- User account not created
- SSH keys not installed
**Diagnosis:**
```bash
# Check cloud-init status
ssh root@<vm-ip> 'cloud-init status --long'
# View cloud-init logs
ssh root@<vm-ip> 'cat /var/log/cloud-init.log'
ssh root@<vm-ip> 'cat /var/log/cloud-init-output.log'
# Check cloud-init configuration
ssh root@<vm-ip> 'cloud-init query userdata'
```
**Solutions:**
```bash
# Re-run cloud-init (if safe to do so)
ssh root@<vm-ip> 'cloud-init clean --logs'
ssh root@<vm-ip> 'cloud-init init --local'
ssh root@<vm-ip> 'cloud-init init'
ssh root@<vm-ip> 'cloud-init modules --mode=config'
ssh root@<vm-ip> 'cloud-init modules --mode=final'
# Force user creation if missing
ssh root@<vm-ip> 'useradd -m -s /bin/bash -G sudo,docker cal'
# Fix YAML syntax in cloud-init if needed
# Common issues: incorrect indentation, missing quotes
```
#### VM Won't Start
**Symptoms:**
- VM fails to boot
- Kernel panic or boot errors
- Hangs during startup
**Diagnosis:**
```bash
# Check VM configuration in Proxmox
pvesh get /nodes/pve/qemu/<vmid>/config
# View console output
# Use Proxmox web interface Console tab
# Check VM resource allocation
pvesh get /nodes/pve/qemu/<vmid>/status/current
```
**Solutions:**
```bash
# Increase memory if low
pvesh set /nodes/pve/qemu/<vmid>/config -memory 2048
# Check disk space and format
pvesh get /nodes/pve/storage
# Reset to safe configuration
pvesh set /nodes/pve/qemu/<vmid>/config -cpu host -cores 2
```
### 2. SSH Connection Issues
#### Cannot Connect to VM
**Symptoms:**
- Connection timeout
- Connection refused
- Host unreachable
**Diagnosis:**
```bash
# Test network connectivity
ping <vm-ip>
# Check SSH port
nc -zv <vm-ip> 22
nmap -p 22 <vm-ip>
# Check from Proxmox console
# Use Proxmox web interface -> VM -> Console
systemctl status sshd
netstat -tlnp | grep :22
```
**Solutions:**
```bash
# Start SSH service (via console)
systemctl start sshd
systemctl enable sshd
# Check firewall (via console)
ufw status
# If active and blocking SSH:
ufw allow ssh
# Reset network configuration
ip addr show
dhclient # If using DHCP
systemctl restart networking
```
#### SSH Key Authentication Fails
**Symptoms:**
- Password prompts despite keys being installed
- Permission denied (publickey)
- "No more authentication methods to try"
**Diagnosis:**
```bash
# Verbose SSH connection
ssh -vvv cal@<vm-ip>
# Check authorized_keys file (via console or password auth)
ls -la ~/.ssh/
cat ~/.ssh/authorized_keys
```
**Solutions:**
```bash
# Fix file permissions
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
# Verify key content
cat ~/.ssh/authorized_keys | wc -l # Should show 2 keys
# Re-deploy keys manually
cat ~/.ssh/homelab_rsa.pub >> ~/.ssh/authorized_keys
cat ~/.ssh/emergency_homelab_rsa.pub >> ~/.ssh/authorized_keys
# Check SSH configuration
sudo grep -E "(PubkeyAuth|PasswordAuth)" /etc/ssh/sshd_config
sudo systemctl restart sshd
```
#### SSH Configuration Problems
**Symptoms:**
- SSH works but with wrong settings
- Root access when it should be disabled
- Password authentication enabled
**Diagnosis:**
```bash
# Check effective SSH configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)"
# View SSH configuration files
cat /etc/ssh/sshd_config
ls /etc/ssh/sshd_config.d/
```
**Solutions:**
```bash
# Apply security hardening manually
sudo tee /etc/ssh/sshd_config.d/99-homelab-security.conf << 'EOF'
PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin no
AllowUsers cal
Protocol 2
ClientAliveInterval 300
ClientAliveCountMax 2
MaxAuthTries 3
EOF
sudo systemctl restart sshd
```
### 3. Docker Installation Issues
#### Docker Installation Fails
**Symptoms:**
- Docker packages not found
- GPG key verification fails
- Permission denied errors
**Diagnosis:**
```bash
# Check internet connectivity
ping google.com
curl -I https://download.docker.com
# Check repository configuration
cat /etc/apt/sources.list.d/docker.list
apt-cache policy docker-ce
# Check for conflicting packages
dpkg -l | grep docker
```
**Solutions:**
```bash
# Remove conflicting packages
sudo apt remove -y docker docker-engine docker.io containerd runc
# Re-add Docker repository
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list
# Install Docker
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
```
#### Docker Permission Issues
**Symptoms:**
- "Permission denied" when running docker commands
- Must use sudo for docker commands
- User not in docker group
**Diagnosis:**
```bash
# Check user groups
groups
groups cal
# Check docker group exists
getent group docker
# Check docker service
systemctl status docker
```
**Solutions:**
```bash
# Add user to docker group
sudo usermod -aG docker cal
# Create docker group if missing
sudo groupadd docker
sudo usermod -aG docker cal
# Apply group membership (logout/login or)
newgrp docker
# Fix socket permissions
sudo chown root:docker /var/run/docker.sock
sudo chmod 664 /var/run/docker.sock
```
#### Docker Service Won't Start
**Symptoms:**
- Docker daemon not running
- Socket connection errors
- systemctl shows failed status
**Diagnosis:**
```bash
# Check service status
systemctl status docker
journalctl -u docker.service -f
# Check daemon logs
sudo dockerd --debug
# Check system resources
df -h
free -h
```
**Solutions:**
```bash
# Restart Docker service
sudo systemctl restart docker
sudo systemctl enable docker
# Clear Docker data if corrupted
sudo systemctl stop docker
sudo rm -rf /var/lib/docker/tmp/*
sudo systemctl start docker
# Reset Docker configuration
sudo systemctl stop docker
sudo mv /etc/docker/daemon.json /etc/docker/daemon.json.bak
sudo systemctl start docker
```
### 4. System Update Issues
#### Package Update Failures
**Symptoms:**
- apt update fails
- Repository errors
- Dependency conflicts
**Diagnosis:**
```bash
# Check repository status
sudo apt update
cat /etc/apt/sources.list
ls /etc/apt/sources.list.d/
# Check disk space
df -h /
df -h /var
```
**Solutions:**
```bash
# Fix broken packages
sudo apt --fix-broken install
sudo dpkg --configure -a
# Clean package cache
sudo apt clean
sudo apt autoclean
sudo apt autoremove
# Reset sources if needed
sudo cp /etc/apt/sources.list /etc/apt/sources.list.backup
# Manually edit to use main Ubuntu repositories
```
### 5. Network Configuration Problems
#### IP Configuration Issues
**Symptoms:**
- VM has wrong IP address
- No network connectivity
- DNS resolution fails
**Diagnosis:**
```bash
# Check network configuration
ip addr show
ip route show
cat /etc/netplan/*.yaml
# Test connectivity
ping 10.10.0.1 # Gateway
ping 8.8.8.8 # External DNS
nslookup google.com
```
**Solutions:**
```bash
# Fix netplan configuration
sudo nano /etc/netplan/00-installer-config.yaml
# Example correct configuration:
network:
version: 2
ethernets:
ens18:
dhcp4: false
addresses: [10.10.0.200/24]
gateway4: 10.10.0.1
nameservers:
addresses: [10.10.0.16, 8.8.8.8]
# Apply configuration
sudo netplan apply
```
#### DNS Resolution Problems
**Symptoms:**
- Cannot resolve domain names
- Package installation fails
- Hostname lookups fail
**Diagnosis:**
```bash
# Check DNS configuration
cat /etc/resolv.conf
systemd-resolve --status
# Test DNS
nslookup google.com
dig google.com
```
**Solutions:**
```bash
# Fix DNS in netplan
sudo nano /etc/netplan/00-installer-config.yaml
# Add nameservers section as shown above
# Temporary DNS fix
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
# Restart networking
sudo netplan apply
sudo systemctl restart systemd-resolved
```
### 6. Storage and Disk Issues
#### Disk Space Problems
**Symptoms:**
- VM runs out of disk space
- Cannot install packages
- Docker images won't download
**Diagnosis:**
```bash
# Check disk usage
df -h
du -sh /home/*
du -sh /var/*
# Check for large files
find / -size +100M 2>/dev/null | head -10
```
**Solutions:**
```bash
# Clean system
sudo apt clean
sudo apt autoremove
docker system prune -a
# Extend disk in Proxmox (if needed)
# Use Proxmox web interface: VM -> Hardware -> Hard Disk -> Resize
# Extend filesystem after disk resize
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1
```
## Advanced Troubleshooting
### Post-Install Script Debug Mode
```bash
# Run script with debug output
bash -x ./scripts/vm-management/vm-post-install.sh <vm-ip> <user>
# Check specific steps manually
ssh cal@<vm-ip> 'docker --version'
ssh cal@<vm-ip> 'sudo systemctl status sshd'
ssh cal@<vm-ip> 'cat ~/.ssh/authorized_keys | wc -l'
```
### Recovery Procedures
#### Emergency SSH Access
```bash
# If primary SSH key fails, use emergency key
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
# If all SSH access fails, use Proxmox console
# VM -> Console in Proxmox web interface
# Reset SSH configuration
sudo cp /etc/ssh/sshd_config.backup /etc/ssh/sshd_config
sudo systemctl restart sshd
```
#### Complete VM Reset
```bash
# If VM is completely broken, restore from template
pvesh delete /nodes/pve/qemu/<vmid>
pvesh create /nodes/pve/qemu/<template-id>/clone -newid <vmid> -name <new-name>
# Or re-run cloud-init provisioning
# Delete VM and recreate with same cloud-init configuration
```
## Prevention Best Practices
### Pre-Deployment Checks
```bash
# Verify SSH keys exist
ls -la ~/.ssh/homelab_rsa*
ls -la ~/.ssh/emergency_homelab_rsa*
# Test network connectivity to target subnet
ping 10.10.0.1
# Verify Proxmox storage space
pvesh get /nodes/pve/storage
```
### Monitoring and Alerts
```bash
# Create health check script
#!/bin/bash
# vm-health-monitor.sh
for ip in 10.10.0.{200..210}; do
if ssh -o ConnectTimeout=5 cal@$ip 'uptime' >/dev/null 2>&1; then
echo "✅ $ip: OK"
else
echo "❌ $ip: FAILED"
fi
done
# Schedule regular checks
# Add to crontab: */15 * * * * /path/to/vm-health-monitor.sh
```
## Emergency Contacts and Resources
### Documentation Links
- **Proxmox Documentation**: https://pve.proxmox.com/wiki/
- **Cloud-Init Documentation**: https://cloud-init.readthedocs.io/
- **Docker Installation Guide**: https://docs.docker.com/engine/install/ubuntu/
### Recovery Information
- **SSH Keys Location**: `/mnt/NV2/ssh-keys/backup-*/`
- **Emergency Access**: Use Proxmox console for direct VM access
- **Backup Strategy**: VM snapshots before major changes
### Quick Reference Commands
```bash
# VM Status
pvesh get /nodes/pve/qemu/<vmid>/status/current
# Start/Stop VM
pvesh create /nodes/pve/qemu/<vmid>/status/start
pvesh create /nodes/pve/qemu/<vmid>/status/stop
# SSH with different key
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
# Docker system info
docker system info
docker system df
```