Complete restructure from patterns/examples/reference to technology-focused directories: • Created technology-specific directories with comprehensive documentation: - /tdarr/ - Transcoding automation with gaming-aware scheduling - /docker/ - Container management with GPU acceleration patterns - /vm-management/ - Virtual machine automation and cloud-init - /networking/ - SSH infrastructure, reverse proxy, and security - /monitoring/ - System health checks and Discord notifications - /databases/ - Database patterns and troubleshooting - /development/ - Programming language patterns (bash, nodejs, python, vuejs) • Enhanced CLAUDE.md with intelligent context loading: - Technology-first loading rules for automatic context provision - Troubleshooting keyword triggers for emergency scenarios - Documentation maintenance protocols with automated reminders - Context window management for optimal documentation updates • Preserved valuable content from .claude/tmp/: - SSH security improvements and server inventory - Tdarr CIFS troubleshooting and Docker iptables solutions - Operational scripts with proper technology classification • Benefits achieved: - Self-contained technology directories with complete context - Automatic loading of relevant documentation based on keywords - Emergency-ready troubleshooting with comprehensive guides - Scalable structure for future technology additions - Eliminated context bloat through targeted loading 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
14 KiB
14 KiB
Virtual Machine Management Troubleshooting Guide
VM Provisioning Issues
Cloud-Init Configuration Problems
Cloud-Init Not Executing
Symptoms:
- VM starts but user accounts not created
- SSH keys not deployed
- Packages not installed
- Configuration not applied
Diagnosis:
# Check cloud-init status and logs
ssh root@<vm-ip> 'cloud-init status --long'
ssh root@<vm-ip> 'cat /var/log/cloud-init.log'
ssh root@<vm-ip> 'cat /var/log/cloud-init-output.log'
# Verify cloud-init configuration
ssh root@<vm-ip> 'cloud-init query userdata'
# Check for YAML syntax errors
ssh root@<vm-ip> 'cloud-init devel schema --config-file /var/lib/cloud/instance/user-data.txt'
Solutions:
# Re-run cloud-init (CAUTION: may overwrite changes)
ssh root@<vm-ip> 'cloud-init clean --logs'
ssh root@<vm-ip> 'cloud-init init --local'
ssh root@<vm-ip> 'cloud-init init'
ssh root@<vm-ip> 'cloud-init modules --mode=config'
ssh root@<vm-ip> 'cloud-init modules --mode=final'
# Manual user creation if cloud-init fails
ssh root@<vm-ip> 'useradd -m -s /bin/bash -G sudo,docker cal'
ssh root@<vm-ip> 'mkdir -p /home/cal/.ssh'
ssh root@<vm-ip> 'chown cal:cal /home/cal/.ssh'
ssh root@<vm-ip> 'chmod 700 /home/cal/.ssh'
Invalid Cloud-Init YAML
Symptoms:
- Cloud-init fails with syntax errors
- Parser errors in cloud-init logs
- Partial configuration application
Common YAML Issues:
# ❌ Incorrect indentation
users:
- name: cal
groups: [sudo, docker] # Wrong indentation
# ✅ Correct indentation
users:
- name: cal
groups: [sudo, docker] # Proper indentation
# ❌ Missing quotes for special characters
ssh_authorized_keys:
- ssh-rsa AAAAB3NzaC1... user@host # May fail with special chars
# ✅ Quoted strings
ssh_authorized_keys:
- "ssh-rsa AAAAB3NzaC1... user@host"
VM Boot and Startup Issues
VM Won't Start
Symptoms:
- VM fails to boot from Proxmox
- Kernel panic messages
- Boot loop or hanging
Diagnosis:
# Check VM configuration
pvesh get /nodes/pve/qemu/<vmid>/config
# Check resource allocation
pvesh get /nodes/pve/qemu/<vmid>/status/current
# Review VM logs via Proxmox console
# Use Proxmox web interface -> VM -> Console
# Check Proxmox host resources
pvesh get /nodes/pve/status
Solutions:
# Increase memory allocation
pvesh set /nodes/pve/qemu/<vmid>/config -memory 4096
# Reset CPU configuration
pvesh set /nodes/pve/qemu/<vmid>/config -cpu host -cores 2
# Check and repair disk
# Stop VM, then:
pvesh get /nodes/pve/qemu/<vmid>/config | grep scsi0
# Use fsck on the disk image if needed
Resource Constraints
Symptoms:
- VM extremely slow performance
- Out-of-memory kills
- Disk I/O bottlenecks
Diagnosis:
# Inside VM resource check
free -h
df -h
iostat 1 5
vmstat 1 5
# Proxmox host resource check
pvesh get /nodes/pve/status
cat /proc/meminfo
df -h /var/lib/vz
Solutions:
# Increase VM resources via Proxmox
pvesh set /nodes/pve/qemu/<vmid>/config -memory 8192
pvesh set /nodes/pve/qemu/<vmid>/config -cores 4
# Resize VM disk
# Proxmox GUI: Hardware -> Hard Disk -> Resize
# Then extend filesystem:
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1
SSH Access Issues
SSH Connection Failures
Cannot Connect to VM
Symptoms:
- Connection timeout
- Connection refused
- Host unreachable
Diagnosis:
# Network connectivity tests
ping <vm-ip>
traceroute <vm-ip>
# SSH service tests
nc -zv <vm-ip> 22
nmap -p 22 <vm-ip>
# From Proxmox console, check SSH service
systemctl status sshd
ss -tlnp | grep :22
Solutions:
# Via Proxmox console - restart SSH
systemctl start sshd
systemctl enable sshd
# Check and configure firewall
ufw status
# If blocking SSH:
ufw allow ssh
ufw allow 22/tcp
# Network configuration reset
ip addr show
dhclient # For DHCP
systemctl restart networking
SSH Key Authentication Failures
Symptoms:
- Password prompts despite key installation
- "Permission denied (publickey)"
- "No more authentication methods"
Diagnosis:
# Verbose SSH debugging
ssh -vvv cal@<vm-ip>
# Check key files locally
ls -la ~/.ssh/homelab_rsa*
ls -la ~/.ssh/emergency_homelab_rsa*
# Via console or password auth, check VM
ls -la ~/.ssh/
cat ~/.ssh/authorized_keys
Solutions:
# Fix SSH directory permissions
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
chown -R cal:cal ~/.ssh
# Re-deploy SSH keys
cat > ~/.ssh/authorized_keys << 'EOF'
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... # primary key
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQD... # emergency key
EOF
# Verify SSH server configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)"
SSH Security Configuration Issues
Symptoms:
- Password authentication still enabled
- Root login allowed
- Insecure SSH settings
Diagnosis:
# Check effective SSH configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot|allowusers)"
# Review SSH config files
cat /etc/ssh/sshd_config
ls /etc/ssh/sshd_config.d/
Solutions:
# Apply security hardening
sudo tee /etc/ssh/sshd_config.d/99-homelab-security.conf << 'EOF'
PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin no
AllowUsers cal
Protocol 2
ClientAliveInterval 300
ClientAliveCountMax 2
MaxAuthTries 3
X11Forwarding no
EOF
sudo systemctl restart sshd
Docker Installation and Configuration Issues
Docker Installation Failures
Package Installation Fails
Symptoms:
- Docker packages not found
- GPG key verification errors
- Repository access failures
Diagnosis:
# Test internet connectivity
ping google.com
curl -I https://download.docker.com
# Check repository configuration
cat /etc/apt/sources.list.d/docker.list
apt-cache policy docker-ce
# Check for package conflicts
dpkg -l | grep docker
Solutions:
# Remove conflicting packages
sudo apt remove -y docker docker-engine docker.io containerd runc
# Reinstall Docker repository
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list
# Install Docker
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
Docker Service Issues
Symptoms:
- Docker daemon won't start
- Socket connection errors
- Service failure on boot
Diagnosis:
# Check service status
systemctl status docker
journalctl -u docker.service -f
# Check system resources
df -h
free -h
# Test daemon manually
sudo dockerd --debug
Solutions:
# Restart Docker service
sudo systemctl stop docker
sudo systemctl start docker
sudo systemctl enable docker
# Clear corrupted Docker data
sudo systemctl stop docker
sudo rm -rf /var/lib/docker/tmp/*
sudo systemctl start docker
# Reset Docker configuration
sudo mv /etc/docker/daemon.json /etc/docker/daemon.json.bak 2>/dev/null || true
sudo systemctl restart docker
Docker Permission and Access Issues
Permission Denied Errors
Symptoms:
- Must use sudo for Docker commands
- "Permission denied" when accessing Docker socket
- User not in docker group
Diagnosis:
# Check user groups
groups
groups cal
getent group docker
# Check Docker socket permissions
ls -la /var/run/docker.sock
# Verify Docker service is running
systemctl status docker
Solutions:
# Add user to docker group
sudo usermod -aG docker cal
# Create docker group if missing
sudo groupadd docker 2>/dev/null || true
sudo usermod -aG docker cal
# Apply group membership (requires logout/login or):
newgrp docker
# Fix socket permissions
sudo chown root:docker /var/run/docker.sock
sudo chmod 664 /var/run/docker.sock
Network Configuration Problems
IP Address and Connectivity Issues
Incorrect IP Configuration
Symptoms:
- VM has wrong IP address
- No network connectivity
- Cannot reach default gateway
Diagnosis:
# Check network configuration
ip addr show
ip route show
cat /etc/netplan/*.yaml
# Test connectivity
ping $(ip route | grep default | awk '{print $3}') # Gateway
ping 8.8.8.8 # External connectivity
Solutions:
# Fix netplan configuration
sudo tee /etc/netplan/00-installer-config.yaml << 'EOF'
network:
version: 2
ethernets:
ens18:
dhcp4: false
addresses: [10.10.0.200/24]
gateway4: 10.10.0.1
nameservers:
addresses: [10.10.0.16, 8.8.8.8]
EOF
# Apply network configuration
sudo netplan apply
DNS Resolution Problems
Symptoms:
- Cannot resolve domain names
- Package downloads fail
- Host lookup failures
Diagnosis:
# Check DNS configuration
cat /etc/resolv.conf
systemd-resolve --status
# Test DNS resolution
nslookup google.com
dig google.com @8.8.8.8
Solutions:
# Fix DNS in netplan (see above example)
sudo netplan apply
# Temporary DNS fix
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
# Restart DNS services
sudo systemctl restart systemd-resolved
sudo systemctl restart networking
System Maintenance Issues
Package Management Problems
Update Failures
Symptoms:
- apt update fails
- Repository signature errors
- Dependency conflicts
Diagnosis:
# Check repository status
sudo apt update
apt-cache policy
# Check disk space
df -h /
df -h /var
# Check for held packages
apt-mark showhold
Solutions:
# Fix broken packages
sudo apt --fix-broken install
sudo dpkg --configure -a
# Clean package cache
sudo apt clean
sudo apt autoclean
sudo apt autoremove
# Reset problematic repositories
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys <KEYID>
sudo apt update
Storage and Disk Space Issues
Disk Space Exhaustion
Symptoms:
- Cannot install packages
- Docker operations fail
- System becomes unresponsive
Diagnosis:
# Check disk usage
df -h
du -sh /home/* /var/* /opt/* 2>/dev/null
# Find large files
find / -size +100M 2>/dev/null | head -20
Solutions:
# Clean system files
sudo apt clean
sudo apt autoremove
sudo journalctl --vacuum-time=7d
# Clean Docker data
docker system prune -a -f
docker volume prune -f
# Extend disk (Proxmox GUI: Hardware -> Resize)
# Then extend filesystem:
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1
Emergency Recovery Procedures
SSH Access Recovery
Complete SSH Lockout
Recovery Steps:
- Use Proxmox console for direct VM access
- Reset SSH configuration:
# Via console sudo cp /etc/ssh/sshd_config.backup /etc/ssh/sshd_config 2>/dev/null || true sudo systemctl restart sshd - Re-enable emergency access:
# Temporary password access for recovery sudo passwd cal sudo sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config sudo systemctl restart sshd
Emergency SSH Key Deployment
If primary keys fail:
# Use emergency key
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
# Or deploy keys via console
mkdir -p ~/.ssh
chmod 700 ~/.ssh
cat > ~/.ssh/authorized_keys << 'EOF'
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... # primary key
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQD... # emergency key
EOF
chmod 600 ~/.ssh/authorized_keys
VM Recovery and Rebuild
Corrupt VM Recovery
Steps:
- Create snapshot before attempting recovery
- Export VM data:
# Backup important data rsync -av cal@<vm-ip>:/home/cal/ ./vm-backup/ - Restore from template:
# Delete corrupt VM pvesh delete /nodes/pve/qemu/<vmid> # Clone from template pvesh create /nodes/pve/qemu/<template-id>/clone -newid <vmid> -name <vm-name>
Post-Install Script Recovery
If automation fails:
# Run in debug mode
bash -x ./scripts/vm-management/vm-post-install.sh <vm-ip> <user>
# Manual step execution
ssh cal@<vm-ip> 'sudo apt update && sudo apt upgrade -y'
ssh cal@<vm-ip> 'curl -fsSL https://get.docker.com | sh'
ssh cal@<vm-ip> 'sudo usermod -aG docker cal'
Prevention and Monitoring
Pre-Deployment Validation
# Verify prerequisites
ls -la ~/.ssh/homelab_rsa*
ls -la ~/.ssh/emergency_homelab_rsa*
ping 10.10.0.1
# Test cloud-init YAML
python3 -c "import yaml; yaml.safe_load(open('cloud-init-user-data.yaml'))"
Health Monitoring Script
#!/bin/bash
# vm-health-check.sh
VM_IPS="10.10.0.200 10.10.0.201 10.10.0.202"
for ip in $VM_IPS; do
if ssh -o ConnectTimeout=5 -o BatchMode=yes cal@$ip 'uptime' >/dev/null 2>&1; then
echo "✅ $ip: SSH OK"
# Check Docker
if ssh cal@$ip 'docker info >/dev/null 2>&1'; then
echo "✅ $ip: Docker OK"
else
echo "❌ $ip: Docker FAILED"
fi
else
echo "❌ $ip: SSH FAILED"
fi
done
Automated Backup
# Schedule in crontab: 0 2 * * * /path/to/vm-backup.sh
#!/bin/bash
for vm_ip in 10.10.0.{200..210}; do
if ping -c1 $vm_ip >/dev/null 2>&1; then
rsync -av --exclude='.cache' cal@$vm_ip:/home/cal/ ./backups/$vm_ip/
fi
done
Quick Reference Commands
Essential VM Management
# VM control via Proxmox
pvesh get /nodes/pve/qemu/<vmid>/status/current
pvesh create /nodes/pve/qemu/<vmid>/status/start
pvesh create /nodes/pve/qemu/<vmid>/status/stop
# SSH with alternative keys
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
# System health checks
free -h && df -h && systemctl status docker
docker system info && docker system df
Recovery Resources
- SSH Keys Backup:
/mnt/NV2/ssh-keys/backup-*/ - Proxmox Console: Direct VM access when SSH fails
- Emergency Contact: Use Discord notifications for critical issues
This troubleshooting guide covers comprehensive recovery procedures for VM management issues in home lab environments.