- Update patterns/vm-management/README.md: Add comprehensive automation workflows - Cloud-init deployment strategies and post-install automation - SSH key management integration and security hardening patterns - Implementation workflows for new and existing VM provisioning - Add complete VM management examples and reference documentation - examples/vm-management/: Proxmox automation and provisioning examples - reference/vm-management/: Troubleshooting guides and best practices - scripts/vm-management/: Operational scripts for automated VM setup - Update reference/docker/tdarr-monitoring-configuration.md: API monitoring integration - Document new tdarr_monitor.py integration with existing Discord monitoring - Add API-based health checks and cron scheduling examples - Enhanced gaming scheduler integration with health verification - Update Tdarr operational scripts with stability improvements - scripts/tdarr/start-tdarr-gpu-podman-clean.sh: Resource limits and CDI GPU access - scripts/tdarr/tdarr-schedule-manager.sh: Updated container name references - scripts/monitoring/tdarr-timeout-monitor.sh: Enhanced completion monitoring 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
530 lines
11 KiB
Markdown
530 lines
11 KiB
Markdown
# VM Management Troubleshooting Guide
|
|
|
|
Complete troubleshooting guide for Proxmox VM provisioning, SSH connectivity, Docker installation, and common configuration issues.
|
|
|
|
## Common Issues and Solutions
|
|
|
|
### 1. VM Provisioning Failures
|
|
|
|
#### Cloud-Init Not Working
|
|
**Symptoms:**
|
|
- VM starts but cloud-init configuration not applied
|
|
- User account not created
|
|
- SSH keys not installed
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check cloud-init status
|
|
ssh root@<vm-ip> 'cloud-init status --long'
|
|
|
|
# View cloud-init logs
|
|
ssh root@<vm-ip> 'cat /var/log/cloud-init.log'
|
|
ssh root@<vm-ip> 'cat /var/log/cloud-init-output.log'
|
|
|
|
# Check cloud-init configuration
|
|
ssh root@<vm-ip> 'cloud-init query userdata'
|
|
```
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Re-run cloud-init (if safe to do so)
|
|
ssh root@<vm-ip> 'cloud-init clean --logs'
|
|
ssh root@<vm-ip> 'cloud-init init --local'
|
|
ssh root@<vm-ip> 'cloud-init init'
|
|
ssh root@<vm-ip> 'cloud-init modules --mode=config'
|
|
ssh root@<vm-ip> 'cloud-init modules --mode=final'
|
|
|
|
# Force user creation if missing
|
|
ssh root@<vm-ip> 'useradd -m -s /bin/bash -G sudo,docker cal'
|
|
|
|
# Fix YAML syntax in cloud-init if needed
|
|
# Common issues: incorrect indentation, missing quotes
|
|
```
|
|
|
|
#### VM Won't Start
|
|
**Symptoms:**
|
|
- VM fails to boot
|
|
- Kernel panic or boot errors
|
|
- Hangs during startup
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check VM configuration in Proxmox
|
|
pvesh get /nodes/pve/qemu/<vmid>/config
|
|
|
|
# View console output
|
|
# Use Proxmox web interface Console tab
|
|
|
|
# Check VM resource allocation
|
|
pvesh get /nodes/pve/qemu/<vmid>/status/current
|
|
```
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Increase memory if low
|
|
pvesh set /nodes/pve/qemu/<vmid>/config -memory 2048
|
|
|
|
# Check disk space and format
|
|
pvesh get /nodes/pve/storage
|
|
|
|
# Reset to safe configuration
|
|
pvesh set /nodes/pve/qemu/<vmid>/config -cpu host -cores 2
|
|
```
|
|
|
|
### 2. SSH Connection Issues
|
|
|
|
#### Cannot Connect to VM
|
|
**Symptoms:**
|
|
- Connection timeout
|
|
- Connection refused
|
|
- Host unreachable
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Test network connectivity
|
|
ping <vm-ip>
|
|
|
|
# Check SSH port
|
|
nc -zv <vm-ip> 22
|
|
nmap -p 22 <vm-ip>
|
|
|
|
# Check from Proxmox console
|
|
# Use Proxmox web interface -> VM -> Console
|
|
systemctl status sshd
|
|
netstat -tlnp | grep :22
|
|
```
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Start SSH service (via console)
|
|
systemctl start sshd
|
|
systemctl enable sshd
|
|
|
|
# Check firewall (via console)
|
|
ufw status
|
|
# If active and blocking SSH:
|
|
ufw allow ssh
|
|
|
|
# Reset network configuration
|
|
ip addr show
|
|
dhclient # If using DHCP
|
|
systemctl restart networking
|
|
```
|
|
|
|
#### SSH Key Authentication Fails
|
|
**Symptoms:**
|
|
- Password prompts despite keys being installed
|
|
- Permission denied (publickey)
|
|
- "No more authentication methods to try"
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Verbose SSH connection
|
|
ssh -vvv cal@<vm-ip>
|
|
|
|
# Check authorized_keys file (via console or password auth)
|
|
ls -la ~/.ssh/
|
|
cat ~/.ssh/authorized_keys
|
|
```
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Fix file permissions
|
|
chmod 700 ~/.ssh
|
|
chmod 600 ~/.ssh/authorized_keys
|
|
|
|
# Verify key content
|
|
cat ~/.ssh/authorized_keys | wc -l # Should show 2 keys
|
|
|
|
# Re-deploy keys manually
|
|
cat ~/.ssh/homelab_rsa.pub >> ~/.ssh/authorized_keys
|
|
cat ~/.ssh/emergency_homelab_rsa.pub >> ~/.ssh/authorized_keys
|
|
|
|
# Check SSH configuration
|
|
sudo grep -E "(PubkeyAuth|PasswordAuth)" /etc/ssh/sshd_config
|
|
sudo systemctl restart sshd
|
|
```
|
|
|
|
#### SSH Configuration Problems
|
|
**Symptoms:**
|
|
- SSH works but with wrong settings
|
|
- Root access when it should be disabled
|
|
- Password authentication enabled
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check effective SSH configuration
|
|
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)"
|
|
|
|
# View SSH configuration files
|
|
cat /etc/ssh/sshd_config
|
|
ls /etc/ssh/sshd_config.d/
|
|
```
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Apply security hardening manually
|
|
sudo tee /etc/ssh/sshd_config.d/99-homelab-security.conf << 'EOF'
|
|
PasswordAuthentication no
|
|
PubkeyAuthentication yes
|
|
PermitRootLogin no
|
|
AllowUsers cal
|
|
Protocol 2
|
|
ClientAliveInterval 300
|
|
ClientAliveCountMax 2
|
|
MaxAuthTries 3
|
|
EOF
|
|
|
|
sudo systemctl restart sshd
|
|
```
|
|
|
|
### 3. Docker Installation Issues
|
|
|
|
#### Docker Installation Fails
|
|
**Symptoms:**
|
|
- Docker packages not found
|
|
- GPG key verification fails
|
|
- Permission denied errors
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check internet connectivity
|
|
ping google.com
|
|
curl -I https://download.docker.com
|
|
|
|
# Check repository configuration
|
|
cat /etc/apt/sources.list.d/docker.list
|
|
apt-cache policy docker-ce
|
|
|
|
# Check for conflicting packages
|
|
dpkg -l | grep docker
|
|
```
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Remove conflicting packages
|
|
sudo apt remove -y docker docker-engine docker.io containerd runc
|
|
|
|
# Re-add Docker repository
|
|
sudo mkdir -p /etc/apt/keyrings
|
|
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
|
|
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list
|
|
|
|
# Install Docker
|
|
sudo apt update
|
|
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
|
|
```
|
|
|
|
#### Docker Permission Issues
|
|
**Symptoms:**
|
|
- "Permission denied" when running docker commands
|
|
- Must use sudo for docker commands
|
|
- User not in docker group
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check user groups
|
|
groups
|
|
groups cal
|
|
|
|
# Check docker group exists
|
|
getent group docker
|
|
|
|
# Check docker service
|
|
systemctl status docker
|
|
```
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Add user to docker group
|
|
sudo usermod -aG docker cal
|
|
|
|
# Create docker group if missing
|
|
sudo groupadd docker
|
|
sudo usermod -aG docker cal
|
|
|
|
# Apply group membership (logout/login or)
|
|
newgrp docker
|
|
|
|
# Fix socket permissions
|
|
sudo chown root:docker /var/run/docker.sock
|
|
sudo chmod 664 /var/run/docker.sock
|
|
```
|
|
|
|
#### Docker Service Won't Start
|
|
**Symptoms:**
|
|
- Docker daemon not running
|
|
- Socket connection errors
|
|
- systemctl shows failed status
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check service status
|
|
systemctl status docker
|
|
journalctl -u docker.service -f
|
|
|
|
# Check daemon logs
|
|
sudo dockerd --debug
|
|
|
|
# Check system resources
|
|
df -h
|
|
free -h
|
|
```
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Restart Docker service
|
|
sudo systemctl restart docker
|
|
sudo systemctl enable docker
|
|
|
|
# Clear Docker data if corrupted
|
|
sudo systemctl stop docker
|
|
sudo rm -rf /var/lib/docker/tmp/*
|
|
sudo systemctl start docker
|
|
|
|
# Reset Docker configuration
|
|
sudo systemctl stop docker
|
|
sudo mv /etc/docker/daemon.json /etc/docker/daemon.json.bak
|
|
sudo systemctl start docker
|
|
```
|
|
|
|
### 4. System Update Issues
|
|
|
|
#### Package Update Failures
|
|
**Symptoms:**
|
|
- apt update fails
|
|
- Repository errors
|
|
- Dependency conflicts
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check repository status
|
|
sudo apt update
|
|
cat /etc/apt/sources.list
|
|
ls /etc/apt/sources.list.d/
|
|
|
|
# Check disk space
|
|
df -h /
|
|
df -h /var
|
|
```
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Fix broken packages
|
|
sudo apt --fix-broken install
|
|
sudo dpkg --configure -a
|
|
|
|
# Clean package cache
|
|
sudo apt clean
|
|
sudo apt autoclean
|
|
sudo apt autoremove
|
|
|
|
# Reset sources if needed
|
|
sudo cp /etc/apt/sources.list /etc/apt/sources.list.backup
|
|
# Manually edit to use main Ubuntu repositories
|
|
```
|
|
|
|
### 5. Network Configuration Problems
|
|
|
|
#### IP Configuration Issues
|
|
**Symptoms:**
|
|
- VM has wrong IP address
|
|
- No network connectivity
|
|
- DNS resolution fails
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check network configuration
|
|
ip addr show
|
|
ip route show
|
|
cat /etc/netplan/*.yaml
|
|
|
|
# Test connectivity
|
|
ping 10.10.0.1 # Gateway
|
|
ping 8.8.8.8 # External DNS
|
|
nslookup google.com
|
|
```
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Fix netplan configuration
|
|
sudo nano /etc/netplan/00-installer-config.yaml
|
|
|
|
# Example correct configuration:
|
|
network:
|
|
version: 2
|
|
ethernets:
|
|
ens18:
|
|
dhcp4: false
|
|
addresses: [10.10.0.200/24]
|
|
gateway4: 10.10.0.1
|
|
nameservers:
|
|
addresses: [10.10.0.16, 8.8.8.8]
|
|
|
|
# Apply configuration
|
|
sudo netplan apply
|
|
```
|
|
|
|
#### DNS Resolution Problems
|
|
**Symptoms:**
|
|
- Cannot resolve domain names
|
|
- Package installation fails
|
|
- Hostname lookups fail
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check DNS configuration
|
|
cat /etc/resolv.conf
|
|
systemd-resolve --status
|
|
|
|
# Test DNS
|
|
nslookup google.com
|
|
dig google.com
|
|
```
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Fix DNS in netplan
|
|
sudo nano /etc/netplan/00-installer-config.yaml
|
|
# Add nameservers section as shown above
|
|
|
|
# Temporary DNS fix
|
|
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
|
|
|
|
# Restart networking
|
|
sudo netplan apply
|
|
sudo systemctl restart systemd-resolved
|
|
```
|
|
|
|
### 6. Storage and Disk Issues
|
|
|
|
#### Disk Space Problems
|
|
**Symptoms:**
|
|
- VM runs out of disk space
|
|
- Cannot install packages
|
|
- Docker images won't download
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check disk usage
|
|
df -h
|
|
du -sh /home/*
|
|
du -sh /var/*
|
|
|
|
# Check for large files
|
|
find / -size +100M 2>/dev/null | head -10
|
|
```
|
|
|
|
**Solutions:**
|
|
```bash
|
|
# Clean system
|
|
sudo apt clean
|
|
sudo apt autoremove
|
|
docker system prune -a
|
|
|
|
# Extend disk in Proxmox (if needed)
|
|
# Use Proxmox web interface: VM -> Hardware -> Hard Disk -> Resize
|
|
|
|
# Extend filesystem after disk resize
|
|
sudo growpart /dev/sda 1
|
|
sudo resize2fs /dev/sda1
|
|
```
|
|
|
|
## Advanced Troubleshooting
|
|
|
|
### Post-Install Script Debug Mode
|
|
```bash
|
|
# Run script with debug output
|
|
bash -x ./scripts/vm-management/vm-post-install.sh <vm-ip> <user>
|
|
|
|
# Check specific steps manually
|
|
ssh cal@<vm-ip> 'docker --version'
|
|
ssh cal@<vm-ip> 'sudo systemctl status sshd'
|
|
ssh cal@<vm-ip> 'cat ~/.ssh/authorized_keys | wc -l'
|
|
```
|
|
|
|
### Recovery Procedures
|
|
|
|
#### Emergency SSH Access
|
|
```bash
|
|
# If primary SSH key fails, use emergency key
|
|
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
|
|
|
|
# If all SSH access fails, use Proxmox console
|
|
# VM -> Console in Proxmox web interface
|
|
|
|
# Reset SSH configuration
|
|
sudo cp /etc/ssh/sshd_config.backup /etc/ssh/sshd_config
|
|
sudo systemctl restart sshd
|
|
```
|
|
|
|
#### Complete VM Reset
|
|
```bash
|
|
# If VM is completely broken, restore from template
|
|
pvesh delete /nodes/pve/qemu/<vmid>
|
|
pvesh create /nodes/pve/qemu/<template-id>/clone -newid <vmid> -name <new-name>
|
|
|
|
# Or re-run cloud-init provisioning
|
|
# Delete VM and recreate with same cloud-init configuration
|
|
```
|
|
|
|
## Prevention Best Practices
|
|
|
|
### Pre-Deployment Checks
|
|
```bash
|
|
# Verify SSH keys exist
|
|
ls -la ~/.ssh/homelab_rsa*
|
|
ls -la ~/.ssh/emergency_homelab_rsa*
|
|
|
|
# Test network connectivity to target subnet
|
|
ping 10.10.0.1
|
|
|
|
# Verify Proxmox storage space
|
|
pvesh get /nodes/pve/storage
|
|
```
|
|
|
|
### Monitoring and Alerts
|
|
```bash
|
|
# Create health check script
|
|
#!/bin/bash
|
|
# vm-health-monitor.sh
|
|
for ip in 10.10.0.{200..210}; do
|
|
if ssh -o ConnectTimeout=5 cal@$ip 'uptime' >/dev/null 2>&1; then
|
|
echo "✅ $ip: OK"
|
|
else
|
|
echo "❌ $ip: FAILED"
|
|
fi
|
|
done
|
|
|
|
# Schedule regular checks
|
|
# Add to crontab: */15 * * * * /path/to/vm-health-monitor.sh
|
|
```
|
|
|
|
## Emergency Contacts and Resources
|
|
|
|
### Documentation Links
|
|
- **Proxmox Documentation**: https://pve.proxmox.com/wiki/
|
|
- **Cloud-Init Documentation**: https://cloud-init.readthedocs.io/
|
|
- **Docker Installation Guide**: https://docs.docker.com/engine/install/ubuntu/
|
|
|
|
### Recovery Information
|
|
- **SSH Keys Location**: `/mnt/NV2/ssh-keys/backup-*/`
|
|
- **Emergency Access**: Use Proxmox console for direct VM access
|
|
- **Backup Strategy**: VM snapshots before major changes
|
|
|
|
### Quick Reference Commands
|
|
```bash
|
|
# VM Status
|
|
pvesh get /nodes/pve/qemu/<vmid>/status/current
|
|
|
|
# Start/Stop VM
|
|
pvesh create /nodes/pve/qemu/<vmid>/status/start
|
|
pvesh create /nodes/pve/qemu/<vmid>/status/stop
|
|
|
|
# SSH with different key
|
|
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
|
|
|
|
# Docker system info
|
|
docker system info
|
|
docker system df
|
|
``` |