All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
Adds title, description, type, domain, and tags frontmatter to every doc for improved KB semantic search. The description field is prepended to every search chunk, and domain/type/tags enable filtered queries. Type values: context, guide, runbook, reference, troubleshooting Domain values match directory structure (networking, docker, etc.) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
660 lines
14 KiB
Markdown
660 lines
14 KiB
Markdown
---
|
|
title: "VM Management Troubleshooting"
|
|
description: "Troubleshooting guide for Proxmox VM issues: cloud-init failures, SSH access problems, Docker installation errors, network configuration, disk space, and emergency recovery procedures."
|
|
type: troubleshooting
|
|
domain: vm-management
|
|
tags: [proxmox, vm, ssh, docker, cloud-init, networking, recovery]
|
|
---
|
|
|
|
# Virtual Machine Management Troubleshooting Guide
|
|
|
|
## VM Provisioning Issues
|
|
|
|
### Cloud-Init Configuration Problems
|
|
|
|
#### Cloud-Init Not Executing
|
|
**Symptoms**:
|
|
- VM starts but user accounts not created
|
|
- SSH keys not deployed
|
|
- Packages not installed
|
|
- Configuration not applied
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check cloud-init status and logs
|
|
ssh root@<vm-ip> 'cloud-init status --long'
|
|
ssh root@<vm-ip> 'cat /var/log/cloud-init.log'
|
|
ssh root@<vm-ip> 'cat /var/log/cloud-init-output.log'
|
|
|
|
# Verify cloud-init configuration
|
|
ssh root@<vm-ip> 'cloud-init query userdata'
|
|
|
|
# Check for YAML syntax errors
|
|
ssh root@<vm-ip> 'cloud-init devel schema --config-file /var/lib/cloud/instance/user-data.txt'
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Re-run cloud-init (CAUTION: may overwrite changes)
|
|
ssh root@<vm-ip> 'cloud-init clean --logs'
|
|
ssh root@<vm-ip> 'cloud-init init --local'
|
|
ssh root@<vm-ip> 'cloud-init init'
|
|
ssh root@<vm-ip> 'cloud-init modules --mode=config'
|
|
ssh root@<vm-ip> 'cloud-init modules --mode=final'
|
|
|
|
# Manual user creation if cloud-init fails
|
|
ssh root@<vm-ip> 'useradd -m -s /bin/bash -G sudo,docker cal'
|
|
ssh root@<vm-ip> 'mkdir -p /home/cal/.ssh'
|
|
ssh root@<vm-ip> 'chown cal:cal /home/cal/.ssh'
|
|
ssh root@<vm-ip> 'chmod 700 /home/cal/.ssh'
|
|
```
|
|
|
|
#### Invalid Cloud-Init YAML
|
|
**Symptoms**:
|
|
- Cloud-init fails with syntax errors
|
|
- Parser errors in cloud-init logs
|
|
- Partial configuration application
|
|
|
|
**Common YAML Issues**:
|
|
```yaml
|
|
# ❌ Incorrect indentation
|
|
users:
|
|
- name: cal
|
|
groups: [sudo, docker] # Wrong indentation
|
|
|
|
# ✅ Correct indentation
|
|
users:
|
|
- name: cal
|
|
groups: [sudo, docker] # Proper indentation
|
|
|
|
# ❌ Missing quotes for special characters
|
|
ssh_authorized_keys:
|
|
- ssh-rsa AAAAB3NzaC1... user@host # May fail with special chars
|
|
|
|
# ✅ Quoted strings
|
|
ssh_authorized_keys:
|
|
- "ssh-rsa AAAAB3NzaC1... user@host"
|
|
```
|
|
|
|
### VM Boot and Startup Issues
|
|
|
|
#### VM Won't Start
|
|
**Symptoms**:
|
|
- VM fails to boot from Proxmox
|
|
- Kernel panic messages
|
|
- Boot loop or hanging
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check VM configuration
|
|
pvesh get /nodes/pve/qemu/<vmid>/config
|
|
|
|
# Check resource allocation
|
|
pvesh get /nodes/pve/qemu/<vmid>/status/current
|
|
|
|
# Review VM logs via Proxmox console
|
|
# Use Proxmox web interface -> VM -> Console
|
|
|
|
# Check Proxmox host resources
|
|
pvesh get /nodes/pve/status
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Increase memory allocation
|
|
pvesh set /nodes/pve/qemu/<vmid>/config -memory 4096
|
|
|
|
# Reset CPU configuration
|
|
pvesh set /nodes/pve/qemu/<vmid>/config -cpu host -cores 2
|
|
|
|
# Check and repair disk
|
|
# Stop VM, then:
|
|
pvesh get /nodes/pve/qemu/<vmid>/config | grep scsi0
|
|
# Use fsck on the disk image if needed
|
|
```
|
|
|
|
#### Resource Constraints
|
|
**Symptoms**:
|
|
- VM extremely slow performance
|
|
- Out-of-memory kills
|
|
- Disk I/O bottlenecks
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Inside VM resource check
|
|
free -h
|
|
df -h
|
|
iostat 1 5
|
|
vmstat 1 5
|
|
|
|
# Proxmox host resource check
|
|
pvesh get /nodes/pve/status
|
|
cat /proc/meminfo
|
|
df -h /var/lib/vz
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Increase VM resources via Proxmox
|
|
pvesh set /nodes/pve/qemu/<vmid>/config -memory 8192
|
|
pvesh set /nodes/pve/qemu/<vmid>/config -cores 4
|
|
|
|
# Resize VM disk
|
|
# Proxmox GUI: Hardware -> Hard Disk -> Resize
|
|
# Then extend filesystem:
|
|
sudo growpart /dev/sda 1
|
|
sudo resize2fs /dev/sda1
|
|
```
|
|
|
|
## SSH Access Issues
|
|
|
|
### SSH Connection Failures
|
|
|
|
#### Cannot Connect to VM
|
|
**Symptoms**:
|
|
- Connection timeout
|
|
- Connection refused
|
|
- Host unreachable
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Network connectivity tests
|
|
ping <vm-ip>
|
|
traceroute <vm-ip>
|
|
|
|
# SSH service tests
|
|
nc -zv <vm-ip> 22
|
|
nmap -p 22 <vm-ip>
|
|
|
|
# From Proxmox console, check SSH service
|
|
systemctl status sshd
|
|
ss -tlnp | grep :22
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Via Proxmox console - restart SSH
|
|
systemctl start sshd
|
|
systemctl enable sshd
|
|
|
|
# Check and configure firewall
|
|
ufw status
|
|
# If blocking SSH:
|
|
ufw allow ssh
|
|
ufw allow 22/tcp
|
|
|
|
# Network configuration reset
|
|
ip addr show
|
|
dhclient # For DHCP
|
|
systemctl restart networking
|
|
```
|
|
|
|
#### SSH Key Authentication Failures
|
|
**Symptoms**:
|
|
- Password prompts despite key installation
|
|
- "Permission denied (publickey)"
|
|
- "No more authentication methods"
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Verbose SSH debugging
|
|
ssh -vvv cal@<vm-ip>
|
|
|
|
# Check key files locally
|
|
ls -la ~/.ssh/homelab_rsa*
|
|
ls -la ~/.ssh/emergency_homelab_rsa*
|
|
|
|
# Via console or password auth, check VM
|
|
ls -la ~/.ssh/
|
|
cat ~/.ssh/authorized_keys
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Fix SSH directory permissions
|
|
chmod 700 ~/.ssh
|
|
chmod 600 ~/.ssh/authorized_keys
|
|
chown -R cal:cal ~/.ssh
|
|
|
|
# Re-deploy SSH keys
|
|
cat > ~/.ssh/authorized_keys << 'EOF'
|
|
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... # primary key
|
|
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQD... # emergency key
|
|
EOF
|
|
|
|
# Verify SSH server configuration
|
|
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)"
|
|
```
|
|
|
|
#### SSH Security Configuration Issues
|
|
**Symptoms**:
|
|
- Password authentication still enabled
|
|
- Root login allowed
|
|
- Insecure SSH settings
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check effective SSH configuration
|
|
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot|allowusers)"
|
|
|
|
# Review SSH config files
|
|
cat /etc/ssh/sshd_config
|
|
ls /etc/ssh/sshd_config.d/
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Apply security hardening
|
|
sudo tee /etc/ssh/sshd_config.d/99-homelab-security.conf << 'EOF'
|
|
PasswordAuthentication no
|
|
PubkeyAuthentication yes
|
|
PermitRootLogin no
|
|
AllowUsers cal
|
|
Protocol 2
|
|
ClientAliveInterval 300
|
|
ClientAliveCountMax 2
|
|
MaxAuthTries 3
|
|
X11Forwarding no
|
|
EOF
|
|
|
|
sudo systemctl restart sshd
|
|
```
|
|
|
|
## Docker Installation and Configuration Issues
|
|
|
|
### Docker Installation Failures
|
|
|
|
#### Package Installation Fails
|
|
**Symptoms**:
|
|
- Docker packages not found
|
|
- GPG key verification errors
|
|
- Repository access failures
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Test internet connectivity
|
|
ping google.com
|
|
curl -I https://download.docker.com
|
|
|
|
# Check repository configuration
|
|
cat /etc/apt/sources.list.d/docker.list
|
|
apt-cache policy docker-ce
|
|
|
|
# Check for package conflicts
|
|
dpkg -l | grep docker
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Remove conflicting packages
|
|
sudo apt remove -y docker docker-engine docker.io containerd runc
|
|
|
|
# Reinstall Docker repository
|
|
sudo mkdir -p /etc/apt/keyrings
|
|
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
|
|
|
|
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list
|
|
|
|
# Install Docker
|
|
sudo apt update
|
|
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
|
|
```
|
|
|
|
#### Docker Service Issues
|
|
**Symptoms**:
|
|
- Docker daemon won't start
|
|
- Socket connection errors
|
|
- Service failure on boot
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check service status
|
|
systemctl status docker
|
|
journalctl -u docker.service -f
|
|
|
|
# Check system resources
|
|
df -h
|
|
free -h
|
|
|
|
# Test daemon manually
|
|
sudo dockerd --debug
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Restart Docker service
|
|
sudo systemctl stop docker
|
|
sudo systemctl start docker
|
|
sudo systemctl enable docker
|
|
|
|
# Clear corrupted Docker data
|
|
sudo systemctl stop docker
|
|
sudo rm -rf /var/lib/docker/tmp/*
|
|
sudo systemctl start docker
|
|
|
|
# Reset Docker configuration
|
|
sudo mv /etc/docker/daemon.json /etc/docker/daemon.json.bak 2>/dev/null || true
|
|
sudo systemctl restart docker
|
|
```
|
|
|
|
### Docker Permission and Access Issues
|
|
|
|
#### Permission Denied Errors
|
|
**Symptoms**:
|
|
- Must use sudo for Docker commands
|
|
- "Permission denied" when accessing Docker socket
|
|
- User not in docker group
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check user groups
|
|
groups
|
|
groups cal
|
|
getent group docker
|
|
|
|
# Check Docker socket permissions
|
|
ls -la /var/run/docker.sock
|
|
|
|
# Verify Docker service is running
|
|
systemctl status docker
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Add user to docker group
|
|
sudo usermod -aG docker cal
|
|
|
|
# Create docker group if missing
|
|
sudo groupadd docker 2>/dev/null || true
|
|
sudo usermod -aG docker cal
|
|
|
|
# Apply group membership (requires logout/login or):
|
|
newgrp docker
|
|
|
|
# Fix socket permissions
|
|
sudo chown root:docker /var/run/docker.sock
|
|
sudo chmod 664 /var/run/docker.sock
|
|
```
|
|
|
|
## Network Configuration Problems
|
|
|
|
### IP Address and Connectivity Issues
|
|
|
|
#### Incorrect IP Configuration
|
|
**Symptoms**:
|
|
- VM has wrong IP address
|
|
- No network connectivity
|
|
- Cannot reach default gateway
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check network configuration
|
|
ip addr show
|
|
ip route show
|
|
cat /etc/netplan/*.yaml
|
|
|
|
# Test connectivity
|
|
ping $(ip route | grep default | awk '{print $3}') # Gateway
|
|
ping 8.8.8.8 # External connectivity
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Fix netplan configuration
|
|
sudo tee /etc/netplan/00-installer-config.yaml << 'EOF'
|
|
network:
|
|
version: 2
|
|
ethernets:
|
|
ens18:
|
|
dhcp4: false
|
|
addresses: [10.10.0.200/24]
|
|
gateway4: 10.10.0.1
|
|
nameservers:
|
|
addresses: [10.10.0.16, 8.8.8.8]
|
|
EOF
|
|
|
|
# Apply network configuration
|
|
sudo netplan apply
|
|
```
|
|
|
|
#### DNS Resolution Problems
|
|
**Symptoms**:
|
|
- Cannot resolve domain names
|
|
- Package downloads fail
|
|
- Host lookup failures
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check DNS configuration
|
|
cat /etc/resolv.conf
|
|
systemd-resolve --status
|
|
|
|
# Test DNS resolution
|
|
nslookup google.com
|
|
dig google.com @8.8.8.8
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Fix DNS in netplan (see above example)
|
|
sudo netplan apply
|
|
|
|
# Temporary DNS fix
|
|
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
|
|
|
|
# Restart DNS services
|
|
sudo systemctl restart systemd-resolved
|
|
sudo systemctl restart networking
|
|
```
|
|
|
|
## System Maintenance Issues
|
|
|
|
### Package Management Problems
|
|
|
|
#### Update Failures
|
|
**Symptoms**:
|
|
- apt update fails
|
|
- Repository signature errors
|
|
- Dependency conflicts
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check repository status
|
|
sudo apt update
|
|
apt-cache policy
|
|
|
|
# Check disk space
|
|
df -h /
|
|
df -h /var
|
|
|
|
# Check for held packages
|
|
apt-mark showhold
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Fix broken packages
|
|
sudo apt --fix-broken install
|
|
sudo dpkg --configure -a
|
|
|
|
# Clean package cache
|
|
sudo apt clean
|
|
sudo apt autoclean
|
|
sudo apt autoremove
|
|
|
|
# Reset problematic repositories
|
|
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys <KEYID>
|
|
sudo apt update
|
|
```
|
|
|
|
### Storage and Disk Space Issues
|
|
|
|
#### Disk Space Exhaustion
|
|
**Symptoms**:
|
|
- Cannot install packages
|
|
- Docker operations fail
|
|
- System becomes unresponsive
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check disk usage
|
|
df -h
|
|
du -sh /home/* /var/* /opt/* 2>/dev/null
|
|
|
|
# Find large files
|
|
find / -size +100M 2>/dev/null | head -20
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Clean system files
|
|
sudo apt clean
|
|
sudo apt autoremove
|
|
sudo journalctl --vacuum-time=7d
|
|
|
|
# Clean Docker data
|
|
docker system prune -a -f
|
|
docker volume prune -f
|
|
|
|
# Extend disk (Proxmox GUI: Hardware -> Resize)
|
|
# Then extend filesystem:
|
|
sudo growpart /dev/sda 1
|
|
sudo resize2fs /dev/sda1
|
|
```
|
|
|
|
## Emergency Recovery Procedures
|
|
|
|
### SSH Access Recovery
|
|
|
|
#### Complete SSH Lockout
|
|
**Recovery Steps**:
|
|
1. **Use Proxmox console** for direct VM access
|
|
2. **Reset SSH configuration**:
|
|
```bash
|
|
# Via console
|
|
sudo cp /etc/ssh/sshd_config.backup /etc/ssh/sshd_config 2>/dev/null || true
|
|
sudo systemctl restart sshd
|
|
```
|
|
3. **Re-enable emergency access**:
|
|
```bash
|
|
# Temporary password access for recovery
|
|
sudo passwd cal
|
|
sudo sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config
|
|
sudo systemctl restart sshd
|
|
```
|
|
|
|
#### Emergency SSH Key Deployment
|
|
**If primary keys fail**:
|
|
```bash
|
|
# Use emergency key
|
|
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
|
|
|
|
# Or deploy keys via console
|
|
mkdir -p ~/.ssh
|
|
chmod 700 ~/.ssh
|
|
cat > ~/.ssh/authorized_keys << 'EOF'
|
|
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... # primary key
|
|
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQD... # emergency key
|
|
EOF
|
|
chmod 600 ~/.ssh/authorized_keys
|
|
```
|
|
|
|
### VM Recovery and Rebuild
|
|
|
|
#### Corrupt VM Recovery
|
|
**Steps**:
|
|
1. **Create snapshot** before attempting recovery
|
|
2. **Export VM data**:
|
|
```bash
|
|
# Backup important data
|
|
rsync -av cal@<vm-ip>:/home/cal/ ./vm-backup/
|
|
```
|
|
3. **Restore from template**:
|
|
```bash
|
|
# Delete corrupt VM
|
|
pvesh delete /nodes/pve/qemu/<vmid>
|
|
|
|
# Clone from template
|
|
pvesh create /nodes/pve/qemu/<template-id>/clone -newid <vmid> -name <vm-name>
|
|
```
|
|
|
|
#### Post-Install Script Recovery
|
|
**If automation fails**:
|
|
```bash
|
|
# Run in debug mode
|
|
bash -x ./scripts/vm-management/vm-post-install.sh <vm-ip> <user>
|
|
|
|
# Manual step execution
|
|
ssh cal@<vm-ip> 'sudo apt update && sudo apt upgrade -y'
|
|
ssh cal@<vm-ip> 'curl -fsSL https://get.docker.com | sh'
|
|
ssh cal@<vm-ip> 'sudo usermod -aG docker cal'
|
|
```
|
|
|
|
## Prevention and Monitoring
|
|
|
|
### Pre-Deployment Validation
|
|
```bash
|
|
# Verify prerequisites
|
|
ls -la ~/.ssh/homelab_rsa*
|
|
ls -la ~/.ssh/emergency_homelab_rsa*
|
|
ping 10.10.0.1
|
|
|
|
# Test cloud-init YAML
|
|
python3 -c "import yaml; yaml.safe_load(open('cloud-init-user-data.yaml'))"
|
|
```
|
|
|
|
### Health Monitoring Script
|
|
```bash
|
|
#!/bin/bash
|
|
# vm-health-check.sh
|
|
VM_IPS="10.10.0.200 10.10.0.201 10.10.0.202"
|
|
|
|
for ip in $VM_IPS; do
|
|
if ssh -o ConnectTimeout=5 -o BatchMode=yes cal@$ip 'uptime' >/dev/null 2>&1; then
|
|
echo "✅ $ip: SSH OK"
|
|
# Check Docker
|
|
if ssh cal@$ip 'docker info >/dev/null 2>&1'; then
|
|
echo "✅ $ip: Docker OK"
|
|
else
|
|
echo "❌ $ip: Docker FAILED"
|
|
fi
|
|
else
|
|
echo "❌ $ip: SSH FAILED"
|
|
fi
|
|
done
|
|
```
|
|
|
|
### Automated Backup
|
|
```bash
|
|
# Schedule in crontab: 0 2 * * * /path/to/vm-backup.sh
|
|
#!/bin/bash
|
|
for vm_ip in 10.10.0.{200..210}; do
|
|
if ping -c1 $vm_ip >/dev/null 2>&1; then
|
|
rsync -av --exclude='.cache' cal@$vm_ip:/home/cal/ ./backups/$vm_ip/
|
|
fi
|
|
done
|
|
```
|
|
|
|
## Quick Reference Commands
|
|
|
|
### Essential VM Management
|
|
```bash
|
|
# VM control via Proxmox
|
|
pvesh get /nodes/pve/qemu/<vmid>/status/current
|
|
pvesh create /nodes/pve/qemu/<vmid>/status/start
|
|
pvesh create /nodes/pve/qemu/<vmid>/status/stop
|
|
|
|
# SSH with alternative keys
|
|
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
|
|
|
|
# System health checks
|
|
free -h && df -h && systemctl status docker
|
|
docker system info && docker system df
|
|
```
|
|
|
|
### Recovery Resources
|
|
- **SSH Keys Backup**: `/mnt/NV2/ssh-keys/backup-*/`
|
|
- **Proxmox Console**: Direct VM access when SSH fails
|
|
- **Emergency Contact**: Use Discord notifications for critical issues
|
|
|
|
This troubleshooting guide covers comprehensive recovery procedures for VM management issues in home lab environments. |