claude-home/vm-management/troubleshooting.md
Cal Corum 10c9e0d854 CLAUDE: Migrate to technology-first documentation architecture
Complete restructure from patterns/examples/reference to technology-focused directories:

• Created technology-specific directories with comprehensive documentation:
  - /tdarr/ - Transcoding automation with gaming-aware scheduling
  - /docker/ - Container management with GPU acceleration patterns
  - /vm-management/ - Virtual machine automation and cloud-init
  - /networking/ - SSH infrastructure, reverse proxy, and security
  - /monitoring/ - System health checks and Discord notifications
  - /databases/ - Database patterns and troubleshooting
  - /development/ - Programming language patterns (bash, nodejs, python, vuejs)

• Enhanced CLAUDE.md with intelligent context loading:
  - Technology-first loading rules for automatic context provision
  - Troubleshooting keyword triggers for emergency scenarios
  - Documentation maintenance protocols with automated reminders
  - Context window management for optimal documentation updates

• Preserved valuable content from .claude/tmp/:
  - SSH security improvements and server inventory
  - Tdarr CIFS troubleshooting and Docker iptables solutions
  - Operational scripts with proper technology classification

• Benefits achieved:
  - Self-contained technology directories with complete context
  - Automatic loading of relevant documentation based on keywords
  - Emergency-ready troubleshooting with comprehensive guides
  - Scalable structure for future technology additions
  - Eliminated context bloat through targeted loading

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-12 23:20:15 -05:00

652 lines
14 KiB
Markdown

# Virtual Machine Management Troubleshooting Guide
## VM Provisioning Issues
### Cloud-Init Configuration Problems
#### Cloud-Init Not Executing
**Symptoms**:
- VM starts but user accounts not created
- SSH keys not deployed
- Packages not installed
- Configuration not applied
**Diagnosis**:
```bash
# Check cloud-init status and logs
ssh root@<vm-ip> 'cloud-init status --long'
ssh root@<vm-ip> 'cat /var/log/cloud-init.log'
ssh root@<vm-ip> 'cat /var/log/cloud-init-output.log'
# Verify cloud-init configuration
ssh root@<vm-ip> 'cloud-init query userdata'
# Check for YAML syntax errors
ssh root@<vm-ip> 'cloud-init devel schema --config-file /var/lib/cloud/instance/user-data.txt'
```
**Solutions**:
```bash
# Re-run cloud-init (CAUTION: may overwrite changes)
ssh root@<vm-ip> 'cloud-init clean --logs'
ssh root@<vm-ip> 'cloud-init init --local'
ssh root@<vm-ip> 'cloud-init init'
ssh root@<vm-ip> 'cloud-init modules --mode=config'
ssh root@<vm-ip> 'cloud-init modules --mode=final'
# Manual user creation if cloud-init fails
ssh root@<vm-ip> 'useradd -m -s /bin/bash -G sudo,docker cal'
ssh root@<vm-ip> 'mkdir -p /home/cal/.ssh'
ssh root@<vm-ip> 'chown cal:cal /home/cal/.ssh'
ssh root@<vm-ip> 'chmod 700 /home/cal/.ssh'
```
#### Invalid Cloud-Init YAML
**Symptoms**:
- Cloud-init fails with syntax errors
- Parser errors in cloud-init logs
- Partial configuration application
**Common YAML Issues**:
```yaml
# ❌ Incorrect indentation
users:
- name: cal
groups: [sudo, docker] # Wrong indentation
# ✅ Correct indentation
users:
- name: cal
groups: [sudo, docker] # Proper indentation
# ❌ Missing quotes for special characters
ssh_authorized_keys:
- ssh-rsa AAAAB3NzaC1... user@host # May fail with special chars
# ✅ Quoted strings
ssh_authorized_keys:
- "ssh-rsa AAAAB3NzaC1... user@host"
```
### VM Boot and Startup Issues
#### VM Won't Start
**Symptoms**:
- VM fails to boot from Proxmox
- Kernel panic messages
- Boot loop or hanging
**Diagnosis**:
```bash
# Check VM configuration
pvesh get /nodes/pve/qemu/<vmid>/config
# Check resource allocation
pvesh get /nodes/pve/qemu/<vmid>/status/current
# Review VM logs via Proxmox console
# Use Proxmox web interface -> VM -> Console
# Check Proxmox host resources
pvesh get /nodes/pve/status
```
**Solutions**:
```bash
# Increase memory allocation
pvesh set /nodes/pve/qemu/<vmid>/config -memory 4096
# Reset CPU configuration
pvesh set /nodes/pve/qemu/<vmid>/config -cpu host -cores 2
# Check and repair disk
# Stop VM, then:
pvesh get /nodes/pve/qemu/<vmid>/config | grep scsi0
# Use fsck on the disk image if needed
```
#### Resource Constraints
**Symptoms**:
- VM extremely slow performance
- Out-of-memory kills
- Disk I/O bottlenecks
**Diagnosis**:
```bash
# Inside VM resource check
free -h
df -h
iostat 1 5
vmstat 1 5
# Proxmox host resource check
pvesh get /nodes/pve/status
cat /proc/meminfo
df -h /var/lib/vz
```
**Solutions**:
```bash
# Increase VM resources via Proxmox
pvesh set /nodes/pve/qemu/<vmid>/config -memory 8192
pvesh set /nodes/pve/qemu/<vmid>/config -cores 4
# Resize VM disk
# Proxmox GUI: Hardware -> Hard Disk -> Resize
# Then extend filesystem:
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1
```
## SSH Access Issues
### SSH Connection Failures
#### Cannot Connect to VM
**Symptoms**:
- Connection timeout
- Connection refused
- Host unreachable
**Diagnosis**:
```bash
# Network connectivity tests
ping <vm-ip>
traceroute <vm-ip>
# SSH service tests
nc -zv <vm-ip> 22
nmap -p 22 <vm-ip>
# From Proxmox console, check SSH service
systemctl status sshd
ss -tlnp | grep :22
```
**Solutions**:
```bash
# Via Proxmox console - restart SSH
systemctl start sshd
systemctl enable sshd
# Check and configure firewall
ufw status
# If blocking SSH:
ufw allow ssh
ufw allow 22/tcp
# Network configuration reset
ip addr show
dhclient # For DHCP
systemctl restart networking
```
#### SSH Key Authentication Failures
**Symptoms**:
- Password prompts despite key installation
- "Permission denied (publickey)"
- "No more authentication methods"
**Diagnosis**:
```bash
# Verbose SSH debugging
ssh -vvv cal@<vm-ip>
# Check key files locally
ls -la ~/.ssh/homelab_rsa*
ls -la ~/.ssh/emergency_homelab_rsa*
# Via console or password auth, check VM
ls -la ~/.ssh/
cat ~/.ssh/authorized_keys
```
**Solutions**:
```bash
# Fix SSH directory permissions
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
chown -R cal:cal ~/.ssh
# Re-deploy SSH keys
cat > ~/.ssh/authorized_keys << 'EOF'
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... # primary key
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQD... # emergency key
EOF
# Verify SSH server configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)"
```
#### SSH Security Configuration Issues
**Symptoms**:
- Password authentication still enabled
- Root login allowed
- Insecure SSH settings
**Diagnosis**:
```bash
# Check effective SSH configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot|allowusers)"
# Review SSH config files
cat /etc/ssh/sshd_config
ls /etc/ssh/sshd_config.d/
```
**Solutions**:
```bash
# Apply security hardening
sudo tee /etc/ssh/sshd_config.d/99-homelab-security.conf << 'EOF'
PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin no
AllowUsers cal
Protocol 2
ClientAliveInterval 300
ClientAliveCountMax 2
MaxAuthTries 3
X11Forwarding no
EOF
sudo systemctl restart sshd
```
## Docker Installation and Configuration Issues
### Docker Installation Failures
#### Package Installation Fails
**Symptoms**:
- Docker packages not found
- GPG key verification errors
- Repository access failures
**Diagnosis**:
```bash
# Test internet connectivity
ping google.com
curl -I https://download.docker.com
# Check repository configuration
cat /etc/apt/sources.list.d/docker.list
apt-cache policy docker-ce
# Check for package conflicts
dpkg -l | grep docker
```
**Solutions**:
```bash
# Remove conflicting packages
sudo apt remove -y docker docker-engine docker.io containerd runc
# Reinstall Docker repository
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list
# Install Docker
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
```
#### Docker Service Issues
**Symptoms**:
- Docker daemon won't start
- Socket connection errors
- Service failure on boot
**Diagnosis**:
```bash
# Check service status
systemctl status docker
journalctl -u docker.service -f
# Check system resources
df -h
free -h
# Test daemon manually
sudo dockerd --debug
```
**Solutions**:
```bash
# Restart Docker service
sudo systemctl stop docker
sudo systemctl start docker
sudo systemctl enable docker
# Clear corrupted Docker data
sudo systemctl stop docker
sudo rm -rf /var/lib/docker/tmp/*
sudo systemctl start docker
# Reset Docker configuration
sudo mv /etc/docker/daemon.json /etc/docker/daemon.json.bak 2>/dev/null || true
sudo systemctl restart docker
```
### Docker Permission and Access Issues
#### Permission Denied Errors
**Symptoms**:
- Must use sudo for Docker commands
- "Permission denied" when accessing Docker socket
- User not in docker group
**Diagnosis**:
```bash
# Check user groups
groups
groups cal
getent group docker
# Check Docker socket permissions
ls -la /var/run/docker.sock
# Verify Docker service is running
systemctl status docker
```
**Solutions**:
```bash
# Add user to docker group
sudo usermod -aG docker cal
# Create docker group if missing
sudo groupadd docker 2>/dev/null || true
sudo usermod -aG docker cal
# Apply group membership (requires logout/login or):
newgrp docker
# Fix socket permissions
sudo chown root:docker /var/run/docker.sock
sudo chmod 664 /var/run/docker.sock
```
## Network Configuration Problems
### IP Address and Connectivity Issues
#### Incorrect IP Configuration
**Symptoms**:
- VM has wrong IP address
- No network connectivity
- Cannot reach default gateway
**Diagnosis**:
```bash
# Check network configuration
ip addr show
ip route show
cat /etc/netplan/*.yaml
# Test connectivity
ping $(ip route | grep default | awk '{print $3}') # Gateway
ping 8.8.8.8 # External connectivity
```
**Solutions**:
```bash
# Fix netplan configuration
sudo tee /etc/netplan/00-installer-config.yaml << 'EOF'
network:
version: 2
ethernets:
ens18:
dhcp4: false
addresses: [10.10.0.200/24]
gateway4: 10.10.0.1
nameservers:
addresses: [10.10.0.16, 8.8.8.8]
EOF
# Apply network configuration
sudo netplan apply
```
#### DNS Resolution Problems
**Symptoms**:
- Cannot resolve domain names
- Package downloads fail
- Host lookup failures
**Diagnosis**:
```bash
# Check DNS configuration
cat /etc/resolv.conf
systemd-resolve --status
# Test DNS resolution
nslookup google.com
dig google.com @8.8.8.8
```
**Solutions**:
```bash
# Fix DNS in netplan (see above example)
sudo netplan apply
# Temporary DNS fix
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
# Restart DNS services
sudo systemctl restart systemd-resolved
sudo systemctl restart networking
```
## System Maintenance Issues
### Package Management Problems
#### Update Failures
**Symptoms**:
- apt update fails
- Repository signature errors
- Dependency conflicts
**Diagnosis**:
```bash
# Check repository status
sudo apt update
apt-cache policy
# Check disk space
df -h /
df -h /var
# Check for held packages
apt-mark showhold
```
**Solutions**:
```bash
# Fix broken packages
sudo apt --fix-broken install
sudo dpkg --configure -a
# Clean package cache
sudo apt clean
sudo apt autoclean
sudo apt autoremove
# Reset problematic repositories
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys <KEYID>
sudo apt update
```
### Storage and Disk Space Issues
#### Disk Space Exhaustion
**Symptoms**:
- Cannot install packages
- Docker operations fail
- System becomes unresponsive
**Diagnosis**:
```bash
# Check disk usage
df -h
du -sh /home/* /var/* /opt/* 2>/dev/null
# Find large files
find / -size +100M 2>/dev/null | head -20
```
**Solutions**:
```bash
# Clean system files
sudo apt clean
sudo apt autoremove
sudo journalctl --vacuum-time=7d
# Clean Docker data
docker system prune -a -f
docker volume prune -f
# Extend disk (Proxmox GUI: Hardware -> Resize)
# Then extend filesystem:
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1
```
## Emergency Recovery Procedures
### SSH Access Recovery
#### Complete SSH Lockout
**Recovery Steps**:
1. **Use Proxmox console** for direct VM access
2. **Reset SSH configuration**:
```bash
# Via console
sudo cp /etc/ssh/sshd_config.backup /etc/ssh/sshd_config 2>/dev/null || true
sudo systemctl restart sshd
```
3. **Re-enable emergency access**:
```bash
# Temporary password access for recovery
sudo passwd cal
sudo sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config
sudo systemctl restart sshd
```
#### Emergency SSH Key Deployment
**If primary keys fail**:
```bash
# Use emergency key
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
# Or deploy keys via console
mkdir -p ~/.ssh
chmod 700 ~/.ssh
cat > ~/.ssh/authorized_keys << 'EOF'
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... # primary key
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQD... # emergency key
EOF
chmod 600 ~/.ssh/authorized_keys
```
### VM Recovery and Rebuild
#### Corrupt VM Recovery
**Steps**:
1. **Create snapshot** before attempting recovery
2. **Export VM data**:
```bash
# Backup important data
rsync -av cal@<vm-ip>:/home/cal/ ./vm-backup/
```
3. **Restore from template**:
```bash
# Delete corrupt VM
pvesh delete /nodes/pve/qemu/<vmid>
# Clone from template
pvesh create /nodes/pve/qemu/<template-id>/clone -newid <vmid> -name <vm-name>
```
#### Post-Install Script Recovery
**If automation fails**:
```bash
# Run in debug mode
bash -x ./scripts/vm-management/vm-post-install.sh <vm-ip> <user>
# Manual step execution
ssh cal@<vm-ip> 'sudo apt update && sudo apt upgrade -y'
ssh cal@<vm-ip> 'curl -fsSL https://get.docker.com | sh'
ssh cal@<vm-ip> 'sudo usermod -aG docker cal'
```
## Prevention and Monitoring
### Pre-Deployment Validation
```bash
# Verify prerequisites
ls -la ~/.ssh/homelab_rsa*
ls -la ~/.ssh/emergency_homelab_rsa*
ping 10.10.0.1
# Test cloud-init YAML
python3 -c "import yaml; yaml.safe_load(open('cloud-init-user-data.yaml'))"
```
### Health Monitoring Script
```bash
#!/bin/bash
# vm-health-check.sh
VM_IPS="10.10.0.200 10.10.0.201 10.10.0.202"
for ip in $VM_IPS; do
if ssh -o ConnectTimeout=5 -o BatchMode=yes cal@$ip 'uptime' >/dev/null 2>&1; then
echo "✅ $ip: SSH OK"
# Check Docker
if ssh cal@$ip 'docker info >/dev/null 2>&1'; then
echo "✅ $ip: Docker OK"
else
echo "❌ $ip: Docker FAILED"
fi
else
echo "❌ $ip: SSH FAILED"
fi
done
```
### Automated Backup
```bash
# Schedule in crontab: 0 2 * * * /path/to/vm-backup.sh
#!/bin/bash
for vm_ip in 10.10.0.{200..210}; do
if ping -c1 $vm_ip >/dev/null 2>&1; then
rsync -av --exclude='.cache' cal@$vm_ip:/home/cal/ ./backups/$vm_ip/
fi
done
```
## Quick Reference Commands
### Essential VM Management
```bash
# VM control via Proxmox
pvesh get /nodes/pve/qemu/<vmid>/status/current
pvesh create /nodes/pve/qemu/<vmid>/status/start
pvesh create /nodes/pve/qemu/<vmid>/status/stop
# SSH with alternative keys
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
# System health checks
free -h && df -h && systemctl status docker
docker system info && docker system df
```
### Recovery Resources
- **SSH Keys Backup**: `/mnt/NV2/ssh-keys/backup-*/`
- **Proxmox Console**: Direct VM access when SSH fails
- **Emergency Contact**: Use Discord notifications for critical issues
This troubleshooting guide covers comprehensive recovery procedures for VM management issues in home lab environments.