claude-home/reference/vm-management/troubleshooting.md
Cal Corum 7edb4a3a9c CLAUDE: Update VM management patterns and Tdarr operational scripts
- Update patterns/vm-management/README.md: Add comprehensive automation workflows
  - Cloud-init deployment strategies and post-install automation
  - SSH key management integration and security hardening patterns
  - Implementation workflows for new and existing VM provisioning

- Add complete VM management examples and reference documentation
  - examples/vm-management/: Proxmox automation and provisioning examples
  - reference/vm-management/: Troubleshooting guides and best practices
  - scripts/vm-management/: Operational scripts for automated VM setup

- Update reference/docker/tdarr-monitoring-configuration.md: API monitoring integration
  - Document new tdarr_monitor.py integration with existing Discord monitoring
  - Add API-based health checks and cron scheduling examples
  - Enhanced gaming scheduler integration with health verification

- Update Tdarr operational scripts with stability improvements
  - scripts/tdarr/start-tdarr-gpu-podman-clean.sh: Resource limits and CDI GPU access
  - scripts/tdarr/tdarr-schedule-manager.sh: Updated container name references
  - scripts/monitoring/tdarr-timeout-monitor.sh: Enhanced completion monitoring

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-12 12:18:43 -05:00

11 KiB

VM Management Troubleshooting Guide

Complete troubleshooting guide for Proxmox VM provisioning, SSH connectivity, Docker installation, and common configuration issues.

Common Issues and Solutions

1. VM Provisioning Failures

Cloud-Init Not Working

Symptoms:

  • VM starts but cloud-init configuration not applied
  • User account not created
  • SSH keys not installed

Diagnosis:

# Check cloud-init status
ssh root@<vm-ip> 'cloud-init status --long'

# View cloud-init logs
ssh root@<vm-ip> 'cat /var/log/cloud-init.log'
ssh root@<vm-ip> 'cat /var/log/cloud-init-output.log'

# Check cloud-init configuration
ssh root@<vm-ip> 'cloud-init query userdata'

Solutions:

# Re-run cloud-init (if safe to do so)
ssh root@<vm-ip> 'cloud-init clean --logs'
ssh root@<vm-ip> 'cloud-init init --local'
ssh root@<vm-ip> 'cloud-init init'
ssh root@<vm-ip> 'cloud-init modules --mode=config'
ssh root@<vm-ip> 'cloud-init modules --mode=final'

# Force user creation if missing
ssh root@<vm-ip> 'useradd -m -s /bin/bash -G sudo,docker cal'

# Fix YAML syntax in cloud-init if needed
# Common issues: incorrect indentation, missing quotes

VM Won't Start

Symptoms:

  • VM fails to boot
  • Kernel panic or boot errors
  • Hangs during startup

Diagnosis:

# Check VM configuration in Proxmox
pvesh get /nodes/pve/qemu/<vmid>/config

# View console output
# Use Proxmox web interface Console tab

# Check VM resource allocation
pvesh get /nodes/pve/qemu/<vmid>/status/current

Solutions:

# Increase memory if low
pvesh set /nodes/pve/qemu/<vmid>/config -memory 2048

# Check disk space and format
pvesh get /nodes/pve/storage

# Reset to safe configuration
pvesh set /nodes/pve/qemu/<vmid>/config -cpu host -cores 2

2. SSH Connection Issues

Cannot Connect to VM

Symptoms:

  • Connection timeout
  • Connection refused
  • Host unreachable

Diagnosis:

# Test network connectivity
ping <vm-ip>

# Check SSH port
nc -zv <vm-ip> 22
nmap -p 22 <vm-ip>

# Check from Proxmox console
# Use Proxmox web interface -> VM -> Console
systemctl status sshd
netstat -tlnp | grep :22

Solutions:

# Start SSH service (via console)
systemctl start sshd
systemctl enable sshd

# Check firewall (via console)
ufw status
# If active and blocking SSH:
ufw allow ssh

# Reset network configuration
ip addr show
dhclient  # If using DHCP
systemctl restart networking

SSH Key Authentication Fails

Symptoms:

  • Password prompts despite keys being installed
  • Permission denied (publickey)
  • "No more authentication methods to try"

Diagnosis:

# Verbose SSH connection
ssh -vvv cal@<vm-ip>

# Check authorized_keys file (via console or password auth)
ls -la ~/.ssh/
cat ~/.ssh/authorized_keys

Solutions:

# Fix file permissions
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys

# Verify key content
cat ~/.ssh/authorized_keys | wc -l  # Should show 2 keys

# Re-deploy keys manually
cat ~/.ssh/homelab_rsa.pub >> ~/.ssh/authorized_keys
cat ~/.ssh/emergency_homelab_rsa.pub >> ~/.ssh/authorized_keys

# Check SSH configuration
sudo grep -E "(PubkeyAuth|PasswordAuth)" /etc/ssh/sshd_config
sudo systemctl restart sshd

SSH Configuration Problems

Symptoms:

  • SSH works but with wrong settings
  • Root access when it should be disabled
  • Password authentication enabled

Diagnosis:

# Check effective SSH configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)"

# View SSH configuration files
cat /etc/ssh/sshd_config
ls /etc/ssh/sshd_config.d/

Solutions:

# Apply security hardening manually
sudo tee /etc/ssh/sshd_config.d/99-homelab-security.conf << 'EOF'
PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin no
AllowUsers cal
Protocol 2
ClientAliveInterval 300
ClientAliveCountMax 2
MaxAuthTries 3
EOF

sudo systemctl restart sshd

3. Docker Installation Issues

Docker Installation Fails

Symptoms:

  • Docker packages not found
  • GPG key verification fails
  • Permission denied errors

Diagnosis:

# Check internet connectivity
ping google.com
curl -I https://download.docker.com

# Check repository configuration
cat /etc/apt/sources.list.d/docker.list
apt-cache policy docker-ce

# Check for conflicting packages
dpkg -l | grep docker

Solutions:

# Remove conflicting packages
sudo apt remove -y docker docker-engine docker.io containerd runc

# Re-add Docker repository
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list

# Install Docker
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

Docker Permission Issues

Symptoms:

  • "Permission denied" when running docker commands
  • Must use sudo for docker commands
  • User not in docker group

Diagnosis:

# Check user groups
groups
groups cal

# Check docker group exists
getent group docker

# Check docker service
systemctl status docker

Solutions:

# Add user to docker group
sudo usermod -aG docker cal

# Create docker group if missing
sudo groupadd docker
sudo usermod -aG docker cal

# Apply group membership (logout/login or)
newgrp docker

# Fix socket permissions
sudo chown root:docker /var/run/docker.sock
sudo chmod 664 /var/run/docker.sock

Docker Service Won't Start

Symptoms:

  • Docker daemon not running
  • Socket connection errors
  • systemctl shows failed status

Diagnosis:

# Check service status
systemctl status docker
journalctl -u docker.service -f

# Check daemon logs
sudo dockerd --debug

# Check system resources
df -h
free -h

Solutions:

# Restart Docker service
sudo systemctl restart docker
sudo systemctl enable docker

# Clear Docker data if corrupted
sudo systemctl stop docker
sudo rm -rf /var/lib/docker/tmp/*
sudo systemctl start docker

# Reset Docker configuration
sudo systemctl stop docker
sudo mv /etc/docker/daemon.json /etc/docker/daemon.json.bak
sudo systemctl start docker

4. System Update Issues

Package Update Failures

Symptoms:

  • apt update fails
  • Repository errors
  • Dependency conflicts

Diagnosis:

# Check repository status
sudo apt update
cat /etc/apt/sources.list
ls /etc/apt/sources.list.d/

# Check disk space
df -h /
df -h /var

Solutions:

# Fix broken packages
sudo apt --fix-broken install
sudo dpkg --configure -a

# Clean package cache
sudo apt clean
sudo apt autoclean
sudo apt autoremove

# Reset sources if needed
sudo cp /etc/apt/sources.list /etc/apt/sources.list.backup
# Manually edit to use main Ubuntu repositories

5. Network Configuration Problems

IP Configuration Issues

Symptoms:

  • VM has wrong IP address
  • No network connectivity
  • DNS resolution fails

Diagnosis:

# Check network configuration
ip addr show
ip route show
cat /etc/netplan/*.yaml

# Test connectivity
ping 10.10.0.1  # Gateway
ping 8.8.8.8    # External DNS
nslookup google.com

Solutions:

# Fix netplan configuration
sudo nano /etc/netplan/00-installer-config.yaml

# Example correct configuration:
network:
  version: 2
  ethernets:
    ens18:
      dhcp4: false
      addresses: [10.10.0.200/24]
      gateway4: 10.10.0.1
      nameservers:
        addresses: [10.10.0.16, 8.8.8.8]

# Apply configuration
sudo netplan apply

DNS Resolution Problems

Symptoms:

  • Cannot resolve domain names
  • Package installation fails
  • Hostname lookups fail

Diagnosis:

# Check DNS configuration
cat /etc/resolv.conf
systemd-resolve --status

# Test DNS
nslookup google.com
dig google.com

Solutions:

# Fix DNS in netplan
sudo nano /etc/netplan/00-installer-config.yaml
# Add nameservers section as shown above

# Temporary DNS fix
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf

# Restart networking
sudo netplan apply
sudo systemctl restart systemd-resolved

6. Storage and Disk Issues

Disk Space Problems

Symptoms:

  • VM runs out of disk space
  • Cannot install packages
  • Docker images won't download

Diagnosis:

# Check disk usage
df -h
du -sh /home/*
du -sh /var/*

# Check for large files
find / -size +100M 2>/dev/null | head -10

Solutions:

# Clean system
sudo apt clean
sudo apt autoremove
docker system prune -a

# Extend disk in Proxmox (if needed)
# Use Proxmox web interface: VM -> Hardware -> Hard Disk -> Resize

# Extend filesystem after disk resize
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1

Advanced Troubleshooting

Post-Install Script Debug Mode

# Run script with debug output
bash -x ./scripts/vm-management/vm-post-install.sh <vm-ip> <user>

# Check specific steps manually
ssh cal@<vm-ip> 'docker --version'
ssh cal@<vm-ip> 'sudo systemctl status sshd'
ssh cal@<vm-ip> 'cat ~/.ssh/authorized_keys | wc -l'

Recovery Procedures

Emergency SSH Access

# If primary SSH key fails, use emergency key
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>

# If all SSH access fails, use Proxmox console
# VM -> Console in Proxmox web interface

# Reset SSH configuration
sudo cp /etc/ssh/sshd_config.backup /etc/ssh/sshd_config
sudo systemctl restart sshd

Complete VM Reset

# If VM is completely broken, restore from template
pvesh delete /nodes/pve/qemu/<vmid>
pvesh create /nodes/pve/qemu/<template-id>/clone -newid <vmid> -name <new-name>

# Or re-run cloud-init provisioning
# Delete VM and recreate with same cloud-init configuration

Prevention Best Practices

Pre-Deployment Checks

# Verify SSH keys exist
ls -la ~/.ssh/homelab_rsa*
ls -la ~/.ssh/emergency_homelab_rsa*

# Test network connectivity to target subnet
ping 10.10.0.1

# Verify Proxmox storage space
pvesh get /nodes/pve/storage

Monitoring and Alerts

# Create health check script
#!/bin/bash
# vm-health-monitor.sh
for ip in 10.10.0.{200..210}; do
    if ssh -o ConnectTimeout=5 cal@$ip 'uptime' >/dev/null 2>&1; then
        echo "✅ $ip: OK"
    else
        echo "❌ $ip: FAILED"
    fi
done

# Schedule regular checks
# Add to crontab: */15 * * * * /path/to/vm-health-monitor.sh

Emergency Contacts and Resources

Recovery Information

  • SSH Keys Location: /mnt/NV2/ssh-keys/backup-*/
  • Emergency Access: Use Proxmox console for direct VM access
  • Backup Strategy: VM snapshots before major changes

Quick Reference Commands

# VM Status
pvesh get /nodes/pve/qemu/<vmid>/status/current

# Start/Stop VM
pvesh create /nodes/pve/qemu/<vmid>/status/start
pvesh create /nodes/pve/qemu/<vmid>/status/stop

# SSH with different key
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>

# Docker system info
docker system info
docker system df