claude-home/reference/vm-management/troubleshooting.md
Cal Corum 7edb4a3a9c CLAUDE: Update VM management patterns and Tdarr operational scripts
- Update patterns/vm-management/README.md: Add comprehensive automation workflows
  - Cloud-init deployment strategies and post-install automation
  - SSH key management integration and security hardening patterns
  - Implementation workflows for new and existing VM provisioning

- Add complete VM management examples and reference documentation
  - examples/vm-management/: Proxmox automation and provisioning examples
  - reference/vm-management/: Troubleshooting guides and best practices
  - scripts/vm-management/: Operational scripts for automated VM setup

- Update reference/docker/tdarr-monitoring-configuration.md: API monitoring integration
  - Document new tdarr_monitor.py integration with existing Discord monitoring
  - Add API-based health checks and cron scheduling examples
  - Enhanced gaming scheduler integration with health verification

- Update Tdarr operational scripts with stability improvements
  - scripts/tdarr/start-tdarr-gpu-podman-clean.sh: Resource limits and CDI GPU access
  - scripts/tdarr/tdarr-schedule-manager.sh: Updated container name references
  - scripts/monitoring/tdarr-timeout-monitor.sh: Enhanced completion monitoring

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-12 12:18:43 -05:00

530 lines
11 KiB
Markdown

# VM Management Troubleshooting Guide
Complete troubleshooting guide for Proxmox VM provisioning, SSH connectivity, Docker installation, and common configuration issues.
## Common Issues and Solutions
### 1. VM Provisioning Failures
#### Cloud-Init Not Working
**Symptoms:**
- VM starts but cloud-init configuration not applied
- User account not created
- SSH keys not installed
**Diagnosis:**
```bash
# Check cloud-init status
ssh root@<vm-ip> 'cloud-init status --long'
# View cloud-init logs
ssh root@<vm-ip> 'cat /var/log/cloud-init.log'
ssh root@<vm-ip> 'cat /var/log/cloud-init-output.log'
# Check cloud-init configuration
ssh root@<vm-ip> 'cloud-init query userdata'
```
**Solutions:**
```bash
# Re-run cloud-init (if safe to do so)
ssh root@<vm-ip> 'cloud-init clean --logs'
ssh root@<vm-ip> 'cloud-init init --local'
ssh root@<vm-ip> 'cloud-init init'
ssh root@<vm-ip> 'cloud-init modules --mode=config'
ssh root@<vm-ip> 'cloud-init modules --mode=final'
# Force user creation if missing
ssh root@<vm-ip> 'useradd -m -s /bin/bash -G sudo,docker cal'
# Fix YAML syntax in cloud-init if needed
# Common issues: incorrect indentation, missing quotes
```
#### VM Won't Start
**Symptoms:**
- VM fails to boot
- Kernel panic or boot errors
- Hangs during startup
**Diagnosis:**
```bash
# Check VM configuration in Proxmox
pvesh get /nodes/pve/qemu/<vmid>/config
# View console output
# Use Proxmox web interface Console tab
# Check VM resource allocation
pvesh get /nodes/pve/qemu/<vmid>/status/current
```
**Solutions:**
```bash
# Increase memory if low
pvesh set /nodes/pve/qemu/<vmid>/config -memory 2048
# Check disk space and format
pvesh get /nodes/pve/storage
# Reset to safe configuration
pvesh set /nodes/pve/qemu/<vmid>/config -cpu host -cores 2
```
### 2. SSH Connection Issues
#### Cannot Connect to VM
**Symptoms:**
- Connection timeout
- Connection refused
- Host unreachable
**Diagnosis:**
```bash
# Test network connectivity
ping <vm-ip>
# Check SSH port
nc -zv <vm-ip> 22
nmap -p 22 <vm-ip>
# Check from Proxmox console
# Use Proxmox web interface -> VM -> Console
systemctl status sshd
netstat -tlnp | grep :22
```
**Solutions:**
```bash
# Start SSH service (via console)
systemctl start sshd
systemctl enable sshd
# Check firewall (via console)
ufw status
# If active and blocking SSH:
ufw allow ssh
# Reset network configuration
ip addr show
dhclient # If using DHCP
systemctl restart networking
```
#### SSH Key Authentication Fails
**Symptoms:**
- Password prompts despite keys being installed
- Permission denied (publickey)
- "No more authentication methods to try"
**Diagnosis:**
```bash
# Verbose SSH connection
ssh -vvv cal@<vm-ip>
# Check authorized_keys file (via console or password auth)
ls -la ~/.ssh/
cat ~/.ssh/authorized_keys
```
**Solutions:**
```bash
# Fix file permissions
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
# Verify key content
cat ~/.ssh/authorized_keys | wc -l # Should show 2 keys
# Re-deploy keys manually
cat ~/.ssh/homelab_rsa.pub >> ~/.ssh/authorized_keys
cat ~/.ssh/emergency_homelab_rsa.pub >> ~/.ssh/authorized_keys
# Check SSH configuration
sudo grep -E "(PubkeyAuth|PasswordAuth)" /etc/ssh/sshd_config
sudo systemctl restart sshd
```
#### SSH Configuration Problems
**Symptoms:**
- SSH works but with wrong settings
- Root access when it should be disabled
- Password authentication enabled
**Diagnosis:**
```bash
# Check effective SSH configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)"
# View SSH configuration files
cat /etc/ssh/sshd_config
ls /etc/ssh/sshd_config.d/
```
**Solutions:**
```bash
# Apply security hardening manually
sudo tee /etc/ssh/sshd_config.d/99-homelab-security.conf << 'EOF'
PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin no
AllowUsers cal
Protocol 2
ClientAliveInterval 300
ClientAliveCountMax 2
MaxAuthTries 3
EOF
sudo systemctl restart sshd
```
### 3. Docker Installation Issues
#### Docker Installation Fails
**Symptoms:**
- Docker packages not found
- GPG key verification fails
- Permission denied errors
**Diagnosis:**
```bash
# Check internet connectivity
ping google.com
curl -I https://download.docker.com
# Check repository configuration
cat /etc/apt/sources.list.d/docker.list
apt-cache policy docker-ce
# Check for conflicting packages
dpkg -l | grep docker
```
**Solutions:**
```bash
# Remove conflicting packages
sudo apt remove -y docker docker-engine docker.io containerd runc
# Re-add Docker repository
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list
# Install Docker
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
```
#### Docker Permission Issues
**Symptoms:**
- "Permission denied" when running docker commands
- Must use sudo for docker commands
- User not in docker group
**Diagnosis:**
```bash
# Check user groups
groups
groups cal
# Check docker group exists
getent group docker
# Check docker service
systemctl status docker
```
**Solutions:**
```bash
# Add user to docker group
sudo usermod -aG docker cal
# Create docker group if missing
sudo groupadd docker
sudo usermod -aG docker cal
# Apply group membership (logout/login or)
newgrp docker
# Fix socket permissions
sudo chown root:docker /var/run/docker.sock
sudo chmod 664 /var/run/docker.sock
```
#### Docker Service Won't Start
**Symptoms:**
- Docker daemon not running
- Socket connection errors
- systemctl shows failed status
**Diagnosis:**
```bash
# Check service status
systemctl status docker
journalctl -u docker.service -f
# Check daemon logs
sudo dockerd --debug
# Check system resources
df -h
free -h
```
**Solutions:**
```bash
# Restart Docker service
sudo systemctl restart docker
sudo systemctl enable docker
# Clear Docker data if corrupted
sudo systemctl stop docker
sudo rm -rf /var/lib/docker/tmp/*
sudo systemctl start docker
# Reset Docker configuration
sudo systemctl stop docker
sudo mv /etc/docker/daemon.json /etc/docker/daemon.json.bak
sudo systemctl start docker
```
### 4. System Update Issues
#### Package Update Failures
**Symptoms:**
- apt update fails
- Repository errors
- Dependency conflicts
**Diagnosis:**
```bash
# Check repository status
sudo apt update
cat /etc/apt/sources.list
ls /etc/apt/sources.list.d/
# Check disk space
df -h /
df -h /var
```
**Solutions:**
```bash
# Fix broken packages
sudo apt --fix-broken install
sudo dpkg --configure -a
# Clean package cache
sudo apt clean
sudo apt autoclean
sudo apt autoremove
# Reset sources if needed
sudo cp /etc/apt/sources.list /etc/apt/sources.list.backup
# Manually edit to use main Ubuntu repositories
```
### 5. Network Configuration Problems
#### IP Configuration Issues
**Symptoms:**
- VM has wrong IP address
- No network connectivity
- DNS resolution fails
**Diagnosis:**
```bash
# Check network configuration
ip addr show
ip route show
cat /etc/netplan/*.yaml
# Test connectivity
ping 10.10.0.1 # Gateway
ping 8.8.8.8 # External DNS
nslookup google.com
```
**Solutions:**
```bash
# Fix netplan configuration
sudo nano /etc/netplan/00-installer-config.yaml
# Example correct configuration:
network:
version: 2
ethernets:
ens18:
dhcp4: false
addresses: [10.10.0.200/24]
gateway4: 10.10.0.1
nameservers:
addresses: [10.10.0.16, 8.8.8.8]
# Apply configuration
sudo netplan apply
```
#### DNS Resolution Problems
**Symptoms:**
- Cannot resolve domain names
- Package installation fails
- Hostname lookups fail
**Diagnosis:**
```bash
# Check DNS configuration
cat /etc/resolv.conf
systemd-resolve --status
# Test DNS
nslookup google.com
dig google.com
```
**Solutions:**
```bash
# Fix DNS in netplan
sudo nano /etc/netplan/00-installer-config.yaml
# Add nameservers section as shown above
# Temporary DNS fix
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
# Restart networking
sudo netplan apply
sudo systemctl restart systemd-resolved
```
### 6. Storage and Disk Issues
#### Disk Space Problems
**Symptoms:**
- VM runs out of disk space
- Cannot install packages
- Docker images won't download
**Diagnosis:**
```bash
# Check disk usage
df -h
du -sh /home/*
du -sh /var/*
# Check for large files
find / -size +100M 2>/dev/null | head -10
```
**Solutions:**
```bash
# Clean system
sudo apt clean
sudo apt autoremove
docker system prune -a
# Extend disk in Proxmox (if needed)
# Use Proxmox web interface: VM -> Hardware -> Hard Disk -> Resize
# Extend filesystem after disk resize
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1
```
## Advanced Troubleshooting
### Post-Install Script Debug Mode
```bash
# Run script with debug output
bash -x ./scripts/vm-management/vm-post-install.sh <vm-ip> <user>
# Check specific steps manually
ssh cal@<vm-ip> 'docker --version'
ssh cal@<vm-ip> 'sudo systemctl status sshd'
ssh cal@<vm-ip> 'cat ~/.ssh/authorized_keys | wc -l'
```
### Recovery Procedures
#### Emergency SSH Access
```bash
# If primary SSH key fails, use emergency key
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
# If all SSH access fails, use Proxmox console
# VM -> Console in Proxmox web interface
# Reset SSH configuration
sudo cp /etc/ssh/sshd_config.backup /etc/ssh/sshd_config
sudo systemctl restart sshd
```
#### Complete VM Reset
```bash
# If VM is completely broken, restore from template
pvesh delete /nodes/pve/qemu/<vmid>
pvesh create /nodes/pve/qemu/<template-id>/clone -newid <vmid> -name <new-name>
# Or re-run cloud-init provisioning
# Delete VM and recreate with same cloud-init configuration
```
## Prevention Best Practices
### Pre-Deployment Checks
```bash
# Verify SSH keys exist
ls -la ~/.ssh/homelab_rsa*
ls -la ~/.ssh/emergency_homelab_rsa*
# Test network connectivity to target subnet
ping 10.10.0.1
# Verify Proxmox storage space
pvesh get /nodes/pve/storage
```
### Monitoring and Alerts
```bash
# Create health check script
#!/bin/bash
# vm-health-monitor.sh
for ip in 10.10.0.{200..210}; do
if ssh -o ConnectTimeout=5 cal@$ip 'uptime' >/dev/null 2>&1; then
echo "✅ $ip: OK"
else
echo "❌ $ip: FAILED"
fi
done
# Schedule regular checks
# Add to crontab: */15 * * * * /path/to/vm-health-monitor.sh
```
## Emergency Contacts and Resources
### Documentation Links
- **Proxmox Documentation**: https://pve.proxmox.com/wiki/
- **Cloud-Init Documentation**: https://cloud-init.readthedocs.io/
- **Docker Installation Guide**: https://docs.docker.com/engine/install/ubuntu/
### Recovery Information
- **SSH Keys Location**: `/mnt/NV2/ssh-keys/backup-*/`
- **Emergency Access**: Use Proxmox console for direct VM access
- **Backup Strategy**: VM snapshots before major changes
### Quick Reference Commands
```bash
# VM Status
pvesh get /nodes/pve/qemu/<vmid>/status/current
# Start/Stop VM
pvesh create /nodes/pve/qemu/<vmid>/status/start
pvesh create /nodes/pve/qemu/<vmid>/status/stop
# SSH with different key
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
# Docker system info
docker system info
docker system df
```