claude-home/vm-management/troubleshooting.md
Cal Corum 10c9e0d854 CLAUDE: Migrate to technology-first documentation architecture
Complete restructure from patterns/examples/reference to technology-focused directories:

• Created technology-specific directories with comprehensive documentation:
  - /tdarr/ - Transcoding automation with gaming-aware scheduling
  - /docker/ - Container management with GPU acceleration patterns
  - /vm-management/ - Virtual machine automation and cloud-init
  - /networking/ - SSH infrastructure, reverse proxy, and security
  - /monitoring/ - System health checks and Discord notifications
  - /databases/ - Database patterns and troubleshooting
  - /development/ - Programming language patterns (bash, nodejs, python, vuejs)

• Enhanced CLAUDE.md with intelligent context loading:
  - Technology-first loading rules for automatic context provision
  - Troubleshooting keyword triggers for emergency scenarios
  - Documentation maintenance protocols with automated reminders
  - Context window management for optimal documentation updates

• Preserved valuable content from .claude/tmp/:
  - SSH security improvements and server inventory
  - Tdarr CIFS troubleshooting and Docker iptables solutions
  - Operational scripts with proper technology classification

• Benefits achieved:
  - Self-contained technology directories with complete context
  - Automatic loading of relevant documentation based on keywords
  - Emergency-ready troubleshooting with comprehensive guides
  - Scalable structure for future technology additions
  - Eliminated context bloat through targeted loading

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-12 23:20:15 -05:00

14 KiB

Virtual Machine Management Troubleshooting Guide

VM Provisioning Issues

Cloud-Init Configuration Problems

Cloud-Init Not Executing

Symptoms:

  • VM starts but user accounts not created
  • SSH keys not deployed
  • Packages not installed
  • Configuration not applied

Diagnosis:

# Check cloud-init status and logs
ssh root@<vm-ip> 'cloud-init status --long'
ssh root@<vm-ip> 'cat /var/log/cloud-init.log'
ssh root@<vm-ip> 'cat /var/log/cloud-init-output.log'

# Verify cloud-init configuration
ssh root@<vm-ip> 'cloud-init query userdata'

# Check for YAML syntax errors
ssh root@<vm-ip> 'cloud-init devel schema --config-file /var/lib/cloud/instance/user-data.txt'

Solutions:

# Re-run cloud-init (CAUTION: may overwrite changes)
ssh root@<vm-ip> 'cloud-init clean --logs'
ssh root@<vm-ip> 'cloud-init init --local'
ssh root@<vm-ip> 'cloud-init init'
ssh root@<vm-ip> 'cloud-init modules --mode=config'
ssh root@<vm-ip> 'cloud-init modules --mode=final'

# Manual user creation if cloud-init fails
ssh root@<vm-ip> 'useradd -m -s /bin/bash -G sudo,docker cal'
ssh root@<vm-ip> 'mkdir -p /home/cal/.ssh'
ssh root@<vm-ip> 'chown cal:cal /home/cal/.ssh'
ssh root@<vm-ip> 'chmod 700 /home/cal/.ssh'

Invalid Cloud-Init YAML

Symptoms:

  • Cloud-init fails with syntax errors
  • Parser errors in cloud-init logs
  • Partial configuration application

Common YAML Issues:

# ❌ Incorrect indentation
users:
- name: cal
groups: [sudo, docker]  # Wrong indentation

# ✅ Correct indentation  
users:
  - name: cal
    groups: [sudo, docker]  # Proper indentation

# ❌ Missing quotes for special characters
ssh_authorized_keys:
  - ssh-rsa AAAAB3NzaC1... user@host  # May fail with special chars

# ✅ Quoted strings
ssh_authorized_keys:
  - "ssh-rsa AAAAB3NzaC1... user@host"

VM Boot and Startup Issues

VM Won't Start

Symptoms:

  • VM fails to boot from Proxmox
  • Kernel panic messages
  • Boot loop or hanging

Diagnosis:

# Check VM configuration
pvesh get /nodes/pve/qemu/<vmid>/config

# Check resource allocation
pvesh get /nodes/pve/qemu/<vmid>/status/current

# Review VM logs via Proxmox console
# Use Proxmox web interface -> VM -> Console

# Check Proxmox host resources
pvesh get /nodes/pve/status

Solutions:

# Increase memory allocation
pvesh set /nodes/pve/qemu/<vmid>/config -memory 4096

# Reset CPU configuration
pvesh set /nodes/pve/qemu/<vmid>/config -cpu host -cores 2

# Check and repair disk
# Stop VM, then:
pvesh get /nodes/pve/qemu/<vmid>/config | grep scsi0
# Use fsck on the disk image if needed

Resource Constraints

Symptoms:

  • VM extremely slow performance
  • Out-of-memory kills
  • Disk I/O bottlenecks

Diagnosis:

# Inside VM resource check
free -h
df -h
iostat 1 5
vmstat 1 5

# Proxmox host resource check
pvesh get /nodes/pve/status
cat /proc/meminfo
df -h /var/lib/vz

Solutions:

# Increase VM resources via Proxmox
pvesh set /nodes/pve/qemu/<vmid>/config -memory 8192
pvesh set /nodes/pve/qemu/<vmid>/config -cores 4

# Resize VM disk
# Proxmox GUI: Hardware -> Hard Disk -> Resize
# Then extend filesystem:
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1

SSH Access Issues

SSH Connection Failures

Cannot Connect to VM

Symptoms:

  • Connection timeout
  • Connection refused
  • Host unreachable

Diagnosis:

# Network connectivity tests
ping <vm-ip>
traceroute <vm-ip>

# SSH service tests
nc -zv <vm-ip> 22
nmap -p 22 <vm-ip>

# From Proxmox console, check SSH service
systemctl status sshd
ss -tlnp | grep :22

Solutions:

# Via Proxmox console - restart SSH
systemctl start sshd
systemctl enable sshd

# Check and configure firewall
ufw status
# If blocking SSH:
ufw allow ssh
ufw allow 22/tcp

# Network configuration reset
ip addr show
dhclient  # For DHCP
systemctl restart networking

SSH Key Authentication Failures

Symptoms:

  • Password prompts despite key installation
  • "Permission denied (publickey)"
  • "No more authentication methods"

Diagnosis:

# Verbose SSH debugging
ssh -vvv cal@<vm-ip>

# Check key files locally
ls -la ~/.ssh/homelab_rsa*
ls -la ~/.ssh/emergency_homelab_rsa*

# Via console or password auth, check VM
ls -la ~/.ssh/
cat ~/.ssh/authorized_keys

Solutions:

# Fix SSH directory permissions
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
chown -R cal:cal ~/.ssh

# Re-deploy SSH keys
cat > ~/.ssh/authorized_keys << 'EOF'
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... # primary key
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQD... # emergency key  
EOF

# Verify SSH server configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)"

SSH Security Configuration Issues

Symptoms:

  • Password authentication still enabled
  • Root login allowed
  • Insecure SSH settings

Diagnosis:

# Check effective SSH configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot|allowusers)"

# Review SSH config files
cat /etc/ssh/sshd_config
ls /etc/ssh/sshd_config.d/

Solutions:

# Apply security hardening
sudo tee /etc/ssh/sshd_config.d/99-homelab-security.conf << 'EOF'
PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin no
AllowUsers cal
Protocol 2
ClientAliveInterval 300
ClientAliveCountMax 2
MaxAuthTries 3
X11Forwarding no
EOF

sudo systemctl restart sshd

Docker Installation and Configuration Issues

Docker Installation Failures

Package Installation Fails

Symptoms:

  • Docker packages not found
  • GPG key verification errors
  • Repository access failures

Diagnosis:

# Test internet connectivity
ping google.com
curl -I https://download.docker.com

# Check repository configuration
cat /etc/apt/sources.list.d/docker.list
apt-cache policy docker-ce

# Check for package conflicts
dpkg -l | grep docker

Solutions:

# Remove conflicting packages
sudo apt remove -y docker docker-engine docker.io containerd runc

# Reinstall Docker repository
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list

# Install Docker
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

Docker Service Issues

Symptoms:

  • Docker daemon won't start
  • Socket connection errors
  • Service failure on boot

Diagnosis:

# Check service status
systemctl status docker
journalctl -u docker.service -f

# Check system resources
df -h
free -h

# Test daemon manually
sudo dockerd --debug

Solutions:

# Restart Docker service
sudo systemctl stop docker
sudo systemctl start docker
sudo systemctl enable docker

# Clear corrupted Docker data
sudo systemctl stop docker
sudo rm -rf /var/lib/docker/tmp/*
sudo systemctl start docker

# Reset Docker configuration
sudo mv /etc/docker/daemon.json /etc/docker/daemon.json.bak 2>/dev/null || true
sudo systemctl restart docker

Docker Permission and Access Issues

Permission Denied Errors

Symptoms:

  • Must use sudo for Docker commands
  • "Permission denied" when accessing Docker socket
  • User not in docker group

Diagnosis:

# Check user groups
groups
groups cal
getent group docker

# Check Docker socket permissions
ls -la /var/run/docker.sock

# Verify Docker service is running
systemctl status docker

Solutions:

# Add user to docker group
sudo usermod -aG docker cal

# Create docker group if missing
sudo groupadd docker 2>/dev/null || true
sudo usermod -aG docker cal

# Apply group membership (requires logout/login or):
newgrp docker

# Fix socket permissions
sudo chown root:docker /var/run/docker.sock
sudo chmod 664 /var/run/docker.sock

Network Configuration Problems

IP Address and Connectivity Issues

Incorrect IP Configuration

Symptoms:

  • VM has wrong IP address
  • No network connectivity
  • Cannot reach default gateway

Diagnosis:

# Check network configuration
ip addr show
ip route show
cat /etc/netplan/*.yaml

# Test connectivity
ping $(ip route | grep default | awk '{print $3}')  # Gateway
ping 8.8.8.8  # External connectivity

Solutions:

# Fix netplan configuration
sudo tee /etc/netplan/00-installer-config.yaml << 'EOF'
network:
  version: 2
  ethernets:
    ens18:
      dhcp4: false
      addresses: [10.10.0.200/24]
      gateway4: 10.10.0.1
      nameservers:
        addresses: [10.10.0.16, 8.8.8.8]
EOF

# Apply network configuration
sudo netplan apply

DNS Resolution Problems

Symptoms:

  • Cannot resolve domain names
  • Package downloads fail
  • Host lookup failures

Diagnosis:

# Check DNS configuration
cat /etc/resolv.conf
systemd-resolve --status

# Test DNS resolution
nslookup google.com
dig google.com @8.8.8.8

Solutions:

# Fix DNS in netplan (see above example)
sudo netplan apply

# Temporary DNS fix
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf

# Restart DNS services
sudo systemctl restart systemd-resolved
sudo systemctl restart networking

System Maintenance Issues

Package Management Problems

Update Failures

Symptoms:

  • apt update fails
  • Repository signature errors
  • Dependency conflicts

Diagnosis:

# Check repository status
sudo apt update
apt-cache policy

# Check disk space
df -h /
df -h /var

# Check for held packages
apt-mark showhold

Solutions:

# Fix broken packages
sudo apt --fix-broken install
sudo dpkg --configure -a

# Clean package cache
sudo apt clean
sudo apt autoclean
sudo apt autoremove

# Reset problematic repositories
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys <KEYID>
sudo apt update

Storage and Disk Space Issues

Disk Space Exhaustion

Symptoms:

  • Cannot install packages
  • Docker operations fail
  • System becomes unresponsive

Diagnosis:

# Check disk usage
df -h
du -sh /home/* /var/* /opt/* 2>/dev/null

# Find large files
find / -size +100M 2>/dev/null | head -20

Solutions:

# Clean system files
sudo apt clean
sudo apt autoremove
sudo journalctl --vacuum-time=7d

# Clean Docker data
docker system prune -a -f
docker volume prune -f

# Extend disk (Proxmox GUI: Hardware -> Resize)
# Then extend filesystem:
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1

Emergency Recovery Procedures

SSH Access Recovery

Complete SSH Lockout

Recovery Steps:

  1. Use Proxmox console for direct VM access
  2. Reset SSH configuration:
    # Via console
    sudo cp /etc/ssh/sshd_config.backup /etc/ssh/sshd_config 2>/dev/null || true
    sudo systemctl restart sshd
    
  3. Re-enable emergency access:
    # Temporary password access for recovery
    sudo passwd cal
    sudo sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config
    sudo systemctl restart sshd
    

Emergency SSH Key Deployment

If primary keys fail:

# Use emergency key
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>

# Or deploy keys via console
mkdir -p ~/.ssh
chmod 700 ~/.ssh
cat > ~/.ssh/authorized_keys << 'EOF'
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... # primary key
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQD... # emergency key
EOF
chmod 600 ~/.ssh/authorized_keys

VM Recovery and Rebuild

Corrupt VM Recovery

Steps:

  1. Create snapshot before attempting recovery
  2. Export VM data:
    # Backup important data
    rsync -av cal@<vm-ip>:/home/cal/ ./vm-backup/
    
  3. Restore from template:
    # Delete corrupt VM
    pvesh delete /nodes/pve/qemu/<vmid>
    
    # Clone from template
    pvesh create /nodes/pve/qemu/<template-id>/clone -newid <vmid> -name <vm-name>
    

Post-Install Script Recovery

If automation fails:

# Run in debug mode
bash -x ./scripts/vm-management/vm-post-install.sh <vm-ip> <user>

# Manual step execution
ssh cal@<vm-ip> 'sudo apt update && sudo apt upgrade -y'
ssh cal@<vm-ip> 'curl -fsSL https://get.docker.com | sh'
ssh cal@<vm-ip> 'sudo usermod -aG docker cal'

Prevention and Monitoring

Pre-Deployment Validation

# Verify prerequisites
ls -la ~/.ssh/homelab_rsa*
ls -la ~/.ssh/emergency_homelab_rsa*
ping 10.10.0.1

# Test cloud-init YAML
python3 -c "import yaml; yaml.safe_load(open('cloud-init-user-data.yaml'))"

Health Monitoring Script

#!/bin/bash
# vm-health-check.sh
VM_IPS="10.10.0.200 10.10.0.201 10.10.0.202"

for ip in $VM_IPS; do
    if ssh -o ConnectTimeout=5 -o BatchMode=yes cal@$ip 'uptime' >/dev/null 2>&1; then
        echo "✅ $ip: SSH OK"
        # Check Docker
        if ssh cal@$ip 'docker info >/dev/null 2>&1'; then
            echo "✅ $ip: Docker OK"
        else
            echo "❌ $ip: Docker FAILED"
        fi
    else
        echo "❌ $ip: SSH FAILED"
    fi
done

Automated Backup

# Schedule in crontab: 0 2 * * * /path/to/vm-backup.sh
#!/bin/bash
for vm_ip in 10.10.0.{200..210}; do
    if ping -c1 $vm_ip >/dev/null 2>&1; then
        rsync -av --exclude='.cache' cal@$vm_ip:/home/cal/ ./backups/$vm_ip/
    fi
done

Quick Reference Commands

Essential VM Management

# VM control via Proxmox
pvesh get /nodes/pve/qemu/<vmid>/status/current
pvesh create /nodes/pve/qemu/<vmid>/status/start
pvesh create /nodes/pve/qemu/<vmid>/status/stop

# SSH with alternative keys
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>

# System health checks
free -h && df -h && systemctl status docker
docker system info && docker system df

Recovery Resources

  • SSH Keys Backup: /mnt/NV2/ssh-keys/backup-*/
  • Proxmox Console: Direct VM access when SSH fails
  • Emergency Contact: Use Discord notifications for critical issues

This troubleshooting guide covers comprehensive recovery procedures for VM management issues in home lab environments.