All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
Adds title, description, type, domain, and tags frontmatter to every doc for improved KB semantic search. The description field is prepended to every search chunk, and domain/type/tags enable filtered queries. Type values: context, guide, runbook, reference, troubleshooting Domain values match directory structure (networking, docker, etc.) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
538 lines
11 KiB
Markdown
538 lines
11 KiB
Markdown
---
|
|
title: "VM Troubleshooting Examples"
|
|
description: "Detailed troubleshooting examples for Proxmox VM provisioning failures, SSH connectivity issues, Docker installation problems, network/DNS configuration, disk space, and emergency recovery procedures."
|
|
type: troubleshooting
|
|
domain: vm-management
|
|
tags: [proxmox, troubleshooting, ssh, docker, cloud-init, networking, recovery]
|
|
---
|
|
|
|
# VM Management Troubleshooting Guide
|
|
|
|
Complete troubleshooting guide for Proxmox VM provisioning, SSH connectivity, Docker installation, and common configuration issues.
|
|
|
|
## Common Issues and Solutions
|
|
|
|
### 1. VM Provisioning Failures
|
|
|
|
#### Cloud-Init Not Working
|
|
**Symptoms:**
|
|
- VM starts but cloud-init configuration not applied
|
|
- User account not created
|
|
- SSH keys not installed
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check cloud-init status
|
|
ssh root@<vm-ip> 'cloud-init status --long'
|
|
|
|
# View cloud-init logs
|
|
ssh root@<vm-ip> 'cat /var/log/cloud-init.log'
|
|
ssh root@<vm-ip> 'cat /var/log/cloud-init-output.log'
|
|
|
|
# Check cloud-init configuration
|
|
ssh root@<vm-ip> 'cloud-init query userdata'
|
|
```
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Re-run cloud-init (if safe to do so)
|
|
ssh root@<vm-ip> 'cloud-init clean --logs'
|
|
ssh root@<vm-ip> 'cloud-init init --local'
|
|
ssh root@<vm-ip> 'cloud-init init'
|
|
ssh root@<vm-ip> 'cloud-init modules --mode=config'
|
|
ssh root@<vm-ip> 'cloud-init modules --mode=final'
|
|
|
|
# Force user creation if missing
|
|
ssh root@<vm-ip> 'useradd -m -s /bin/bash -G sudo,docker cal'
|
|
|
|
# Fix YAML syntax in cloud-init if needed
|
|
# Common issues: incorrect indentation, missing quotes
|
|
```
|
|
|
|
#### VM Won't Start
|
|
**Symptoms:**
|
|
- VM fails to boot
|
|
- Kernel panic or boot errors
|
|
- Hangs during startup
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check VM configuration in Proxmox
|
|
pvesh get /nodes/pve/qemu/<vmid>/config
|
|
|
|
# View console output
|
|
# Use Proxmox web interface Console tab
|
|
|
|
# Check VM resource allocation
|
|
pvesh get /nodes/pve/qemu/<vmid>/status/current
|
|
```
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Increase memory if low
|
|
pvesh set /nodes/pve/qemu/<vmid>/config -memory 2048
|
|
|
|
# Check disk space and format
|
|
pvesh get /nodes/pve/storage
|
|
|
|
# Reset to safe configuration
|
|
pvesh set /nodes/pve/qemu/<vmid>/config -cpu host -cores 2
|
|
```
|
|
|
|
### 2. SSH Connection Issues
|
|
|
|
#### Cannot Connect to VM
|
|
**Symptoms:**
|
|
- Connection timeout
|
|
- Connection refused
|
|
- Host unreachable
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Test network connectivity
|
|
ping <vm-ip>
|
|
|
|
# Check SSH port
|
|
nc -zv <vm-ip> 22
|
|
nmap -p 22 <vm-ip>
|
|
|
|
# Check from Proxmox console
|
|
# Use Proxmox web interface -> VM -> Console
|
|
systemctl status sshd
|
|
netstat -tlnp | grep :22
|
|
```
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Start SSH service (via console)
|
|
systemctl start sshd
|
|
systemctl enable sshd
|
|
|
|
# Check firewall (via console)
|
|
ufw status
|
|
# If active and blocking SSH:
|
|
ufw allow ssh
|
|
|
|
# Reset network configuration
|
|
ip addr show
|
|
dhclient # If using DHCP
|
|
systemctl restart networking
|
|
```
|
|
|
|
#### SSH Key Authentication Fails
|
|
**Symptoms:**
|
|
- Password prompts despite keys being installed
|
|
- Permission denied (publickey)
|
|
- "No more authentication methods to try"
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Verbose SSH connection
|
|
ssh -vvv cal@<vm-ip>
|
|
|
|
# Check authorized_keys file (via console or password auth)
|
|
ls -la ~/.ssh/
|
|
cat ~/.ssh/authorized_keys
|
|
```
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Fix file permissions
|
|
chmod 700 ~/.ssh
|
|
chmod 600 ~/.ssh/authorized_keys
|
|
|
|
# Verify key content
|
|
cat ~/.ssh/authorized_keys | wc -l # Should show 2 keys
|
|
|
|
# Re-deploy keys manually
|
|
cat ~/.ssh/homelab_rsa.pub >> ~/.ssh/authorized_keys
|
|
cat ~/.ssh/emergency_homelab_rsa.pub >> ~/.ssh/authorized_keys
|
|
|
|
# Check SSH configuration
|
|
sudo grep -E "(PubkeyAuth|PasswordAuth)" /etc/ssh/sshd_config
|
|
sudo systemctl restart sshd
|
|
```
|
|
|
|
#### SSH Configuration Problems
|
|
**Symptoms:**
|
|
- SSH works but with wrong settings
|
|
- Root access when it should be disabled
|
|
- Password authentication enabled
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check effective SSH configuration
|
|
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)"
|
|
|
|
# View SSH configuration files
|
|
cat /etc/ssh/sshd_config
|
|
ls /etc/ssh/sshd_config.d/
|
|
```
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Apply security hardening manually
|
|
sudo tee /etc/ssh/sshd_config.d/99-homelab-security.conf << 'EOF'
|
|
PasswordAuthentication no
|
|
PubkeyAuthentication yes
|
|
PermitRootLogin no
|
|
AllowUsers cal
|
|
Protocol 2
|
|
ClientAliveInterval 300
|
|
ClientAliveCountMax 2
|
|
MaxAuthTries 3
|
|
EOF
|
|
|
|
sudo systemctl restart sshd
|
|
```
|
|
|
|
### 3. Docker Installation Issues
|
|
|
|
#### Docker Installation Fails
|
|
**Symptoms:**
|
|
- Docker packages not found
|
|
- GPG key verification fails
|
|
- Permission denied errors
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check internet connectivity
|
|
ping google.com
|
|
curl -I https://download.docker.com
|
|
|
|
# Check repository configuration
|
|
cat /etc/apt/sources.list.d/docker.list
|
|
apt-cache policy docker-ce
|
|
|
|
# Check for conflicting packages
|
|
dpkg -l | grep docker
|
|
```
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Remove conflicting packages
|
|
sudo apt remove -y docker docker-engine docker.io containerd runc
|
|
|
|
# Re-add Docker repository
|
|
sudo mkdir -p /etc/apt/keyrings
|
|
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
|
|
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list
|
|
|
|
# Install Docker
|
|
sudo apt update
|
|
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
|
|
```
|
|
|
|
#### Docker Permission Issues
|
|
**Symptoms:**
|
|
- "Permission denied" when running docker commands
|
|
- Must use sudo for docker commands
|
|
- User not in docker group
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check user groups
|
|
groups
|
|
groups cal
|
|
|
|
# Check docker group exists
|
|
getent group docker
|
|
|
|
# Check docker service
|
|
systemctl status docker
|
|
```
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Add user to docker group
|
|
sudo usermod -aG docker cal
|
|
|
|
# Create docker group if missing
|
|
sudo groupadd docker
|
|
sudo usermod -aG docker cal
|
|
|
|
# Apply group membership (logout/login or)
|
|
newgrp docker
|
|
|
|
# Fix socket permissions
|
|
sudo chown root:docker /var/run/docker.sock
|
|
sudo chmod 664 /var/run/docker.sock
|
|
```
|
|
|
|
#### Docker Service Won't Start
|
|
**Symptoms:**
|
|
- Docker daemon not running
|
|
- Socket connection errors
|
|
- systemctl shows failed status
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check service status
|
|
systemctl status docker
|
|
journalctl -u docker.service -f
|
|
|
|
# Check daemon logs
|
|
sudo dockerd --debug
|
|
|
|
# Check system resources
|
|
df -h
|
|
free -h
|
|
```
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Restart Docker service
|
|
sudo systemctl restart docker
|
|
sudo systemctl enable docker
|
|
|
|
# Clear Docker data if corrupted
|
|
sudo systemctl stop docker
|
|
sudo rm -rf /var/lib/docker/tmp/*
|
|
sudo systemctl start docker
|
|
|
|
# Reset Docker configuration
|
|
sudo systemctl stop docker
|
|
sudo mv /etc/docker/daemon.json /etc/docker/daemon.json.bak
|
|
sudo systemctl start docker
|
|
```
|
|
|
|
### 4. System Update Issues
|
|
|
|
#### Package Update Failures
|
|
**Symptoms:**
|
|
- apt update fails
|
|
- Repository errors
|
|
- Dependency conflicts
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check repository status
|
|
sudo apt update
|
|
cat /etc/apt/sources.list
|
|
ls /etc/apt/sources.list.d/
|
|
|
|
# Check disk space
|
|
df -h /
|
|
df -h /var
|
|
```
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Fix broken packages
|
|
sudo apt --fix-broken install
|
|
sudo dpkg --configure -a
|
|
|
|
# Clean package cache
|
|
sudo apt clean
|
|
sudo apt autoclean
|
|
sudo apt autoremove
|
|
|
|
# Reset sources if needed
|
|
sudo cp /etc/apt/sources.list /etc/apt/sources.list.backup
|
|
# Manually edit to use main Ubuntu repositories
|
|
```
|
|
|
|
### 5. Network Configuration Problems
|
|
|
|
#### IP Configuration Issues
|
|
**Symptoms:**
|
|
- VM has wrong IP address
|
|
- No network connectivity
|
|
- DNS resolution fails
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check network configuration
|
|
ip addr show
|
|
ip route show
|
|
cat /etc/netplan/*.yaml
|
|
|
|
# Test connectivity
|
|
ping 10.10.0.1 # Gateway
|
|
ping 8.8.8.8 # External DNS
|
|
nslookup google.com
|
|
```
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Fix netplan configuration
|
|
sudo nano /etc/netplan/00-installer-config.yaml
|
|
|
|
# Example correct configuration:
|
|
network:
|
|
version: 2
|
|
ethernets:
|
|
ens18:
|
|
dhcp4: false
|
|
addresses: [10.10.0.200/24]
|
|
gateway4: 10.10.0.1
|
|
nameservers:
|
|
addresses: [10.10.0.16, 8.8.8.8]
|
|
|
|
# Apply configuration
|
|
sudo netplan apply
|
|
```
|
|
|
|
#### DNS Resolution Problems
|
|
**Symptoms:**
|
|
- Cannot resolve domain names
|
|
- Package installation fails
|
|
- Hostname lookups fail
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check DNS configuration
|
|
cat /etc/resolv.conf
|
|
systemd-resolve --status
|
|
|
|
# Test DNS
|
|
nslookup google.com
|
|
dig google.com
|
|
```
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Fix DNS in netplan
|
|
sudo nano /etc/netplan/00-installer-config.yaml
|
|
# Add nameservers section as shown above
|
|
|
|
# Temporary DNS fix
|
|
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
|
|
|
|
# Restart networking
|
|
sudo netplan apply
|
|
sudo systemctl restart systemd-resolved
|
|
```
|
|
|
|
### 6. Storage and Disk Issues
|
|
|
|
#### Disk Space Problems
|
|
**Symptoms:**
|
|
- VM runs out of disk space
|
|
- Cannot install packages
|
|
- Docker images won't download
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check disk usage
|
|
df -h
|
|
du -sh /home/*
|
|
du -sh /var/*
|
|
|
|
# Check for large files
|
|
find / -size +100M 2>/dev/null | head -10
|
|
```
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Clean system
|
|
sudo apt clean
|
|
sudo apt autoremove
|
|
docker system prune -a
|
|
|
|
# Extend disk in Proxmox (if needed)
|
|
# Use Proxmox web interface: VM -> Hardware -> Hard Disk -> Resize
|
|
|
|
# Extend filesystem after disk resize
|
|
sudo growpart /dev/sda 1
|
|
sudo resize2fs /dev/sda1
|
|
```
|
|
|
|
## Advanced Troubleshooting
|
|
|
|
### Post-Install Script Debug Mode
|
|
```bash
|
|
# Run script with debug output
|
|
bash -x ./scripts/vm-management/vm-post-install.sh <vm-ip> <user>
|
|
|
|
# Check specific steps manually
|
|
ssh cal@<vm-ip> 'docker --version'
|
|
ssh cal@<vm-ip> 'sudo systemctl status sshd'
|
|
ssh cal@<vm-ip> 'cat ~/.ssh/authorized_keys | wc -l'
|
|
```
|
|
|
|
### Recovery Procedures
|
|
|
|
#### Emergency SSH Access
|
|
```bash
|
|
# If primary SSH key fails, use emergency key
|
|
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
|
|
|
|
# If all SSH access fails, use Proxmox console
|
|
# VM -> Console in Proxmox web interface
|
|
|
|
# Reset SSH configuration
|
|
sudo cp /etc/ssh/sshd_config.backup /etc/ssh/sshd_config
|
|
sudo systemctl restart sshd
|
|
```
|
|
|
|
#### Complete VM Reset
|
|
```bash
|
|
# If VM is completely broken, restore from template
|
|
pvesh delete /nodes/pve/qemu/<vmid>
|
|
pvesh create /nodes/pve/qemu/<template-id>/clone -newid <vmid> -name <new-name>
|
|
|
|
# Or re-run cloud-init provisioning
|
|
# Delete VM and recreate with same cloud-init configuration
|
|
```
|
|
|
|
## Prevention Best Practices
|
|
|
|
### Pre-Deployment Checks
|
|
```bash
|
|
# Verify SSH keys exist
|
|
ls -la ~/.ssh/homelab_rsa*
|
|
ls -la ~/.ssh/emergency_homelab_rsa*
|
|
|
|
# Test network connectivity to target subnet
|
|
ping 10.10.0.1
|
|
|
|
# Verify Proxmox storage space
|
|
pvesh get /nodes/pve/storage
|
|
```
|
|
|
|
### Monitoring and Alerts
|
|
```bash
|
|
# Create health check script
|
|
#!/bin/bash
|
|
# vm-health-monitor.sh
|
|
for ip in 10.10.0.{200..210}; do
|
|
if ssh -o ConnectTimeout=5 cal@$ip 'uptime' >/dev/null 2>&1; then
|
|
echo "✅ $ip: OK"
|
|
else
|
|
echo "❌ $ip: FAILED"
|
|
fi
|
|
done
|
|
|
|
# Schedule regular checks
|
|
# Add to crontab: */15 * * * * /path/to/vm-health-monitor.sh
|
|
```
|
|
|
|
## Emergency Contacts and Resources
|
|
|
|
### Documentation Links
|
|
- **Proxmox Documentation**: https://pve.proxmox.com/wiki/
|
|
- **Cloud-Init Documentation**: https://cloud-init.readthedocs.io/
|
|
- **Docker Installation Guide**: https://docs.docker.com/engine/install/ubuntu/
|
|
|
|
### Recovery Information
|
|
- **SSH Keys Location**: `/mnt/NV2/ssh-keys/backup-*/`
|
|
- **Emergency Access**: Use Proxmox console for direct VM access
|
|
- **Backup Strategy**: VM snapshots before major changes
|
|
|
|
### Quick Reference Commands
|
|
```bash
|
|
# VM Status
|
|
pvesh get /nodes/pve/qemu/<vmid>/status/current
|
|
|
|
# Start/Stop VM
|
|
pvesh create /nodes/pve/qemu/<vmid>/status/start
|
|
pvesh create /nodes/pve/qemu/<vmid>/status/stop
|
|
|
|
# SSH with different key
|
|
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
|
|
|
|
# Docker system info
|
|
docker system info
|
|
docker system df
|
|
``` |