Complete restructure from patterns/examples/reference to technology-focused directories: • Created technology-specific directories with comprehensive documentation: - /tdarr/ - Transcoding automation with gaming-aware scheduling - /docker/ - Container management with GPU acceleration patterns - /vm-management/ - Virtual machine automation and cloud-init - /networking/ - SSH infrastructure, reverse proxy, and security - /monitoring/ - System health checks and Discord notifications - /databases/ - Database patterns and troubleshooting - /development/ - Programming language patterns (bash, nodejs, python, vuejs) • Enhanced CLAUDE.md with intelligent context loading: - Technology-first loading rules for automatic context provision - Troubleshooting keyword triggers for emergency scenarios - Documentation maintenance protocols with automated reminders - Context window management for optimal documentation updates • Preserved valuable content from .claude/tmp/: - SSH security improvements and server inventory - Tdarr CIFS troubleshooting and Docker iptables solutions - Operational scripts with proper technology classification • Benefits achieved: - Self-contained technology directories with complete context - Automatic loading of relevant documentation based on keywords - Emergency-ready troubleshooting with comprehensive guides - Scalable structure for future technology additions - Eliminated context bloat through targeted loading 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
652 lines
14 KiB
Markdown
652 lines
14 KiB
Markdown
# Virtual Machine Management Troubleshooting Guide
|
|
|
|
## VM Provisioning Issues
|
|
|
|
### Cloud-Init Configuration Problems
|
|
|
|
#### Cloud-Init Not Executing
|
|
**Symptoms**:
|
|
- VM starts but user accounts not created
|
|
- SSH keys not deployed
|
|
- Packages not installed
|
|
- Configuration not applied
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check cloud-init status and logs
|
|
ssh root@<vm-ip> 'cloud-init status --long'
|
|
ssh root@<vm-ip> 'cat /var/log/cloud-init.log'
|
|
ssh root@<vm-ip> 'cat /var/log/cloud-init-output.log'
|
|
|
|
# Verify cloud-init configuration
|
|
ssh root@<vm-ip> 'cloud-init query userdata'
|
|
|
|
# Check for YAML syntax errors
|
|
ssh root@<vm-ip> 'cloud-init devel schema --config-file /var/lib/cloud/instance/user-data.txt'
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Re-run cloud-init (CAUTION: may overwrite changes)
|
|
ssh root@<vm-ip> 'cloud-init clean --logs'
|
|
ssh root@<vm-ip> 'cloud-init init --local'
|
|
ssh root@<vm-ip> 'cloud-init init'
|
|
ssh root@<vm-ip> 'cloud-init modules --mode=config'
|
|
ssh root@<vm-ip> 'cloud-init modules --mode=final'
|
|
|
|
# Manual user creation if cloud-init fails
|
|
ssh root@<vm-ip> 'useradd -m -s /bin/bash -G sudo,docker cal'
|
|
ssh root@<vm-ip> 'mkdir -p /home/cal/.ssh'
|
|
ssh root@<vm-ip> 'chown cal:cal /home/cal/.ssh'
|
|
ssh root@<vm-ip> 'chmod 700 /home/cal/.ssh'
|
|
```
|
|
|
|
#### Invalid Cloud-Init YAML
|
|
**Symptoms**:
|
|
- Cloud-init fails with syntax errors
|
|
- Parser errors in cloud-init logs
|
|
- Partial configuration application
|
|
|
|
**Common YAML Issues**:
|
|
```yaml
|
|
# ❌ Incorrect indentation
|
|
users:
|
|
- name: cal
|
|
groups: [sudo, docker] # Wrong indentation
|
|
|
|
# ✅ Correct indentation
|
|
users:
|
|
- name: cal
|
|
groups: [sudo, docker] # Proper indentation
|
|
|
|
# ❌ Missing quotes for special characters
|
|
ssh_authorized_keys:
|
|
- ssh-rsa AAAAB3NzaC1... user@host # May fail with special chars
|
|
|
|
# ✅ Quoted strings
|
|
ssh_authorized_keys:
|
|
- "ssh-rsa AAAAB3NzaC1... user@host"
|
|
```
|
|
|
|
### VM Boot and Startup Issues
|
|
|
|
#### VM Won't Start
|
|
**Symptoms**:
|
|
- VM fails to boot from Proxmox
|
|
- Kernel panic messages
|
|
- Boot loop or hanging
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check VM configuration
|
|
pvesh get /nodes/pve/qemu/<vmid>/config
|
|
|
|
# Check resource allocation
|
|
pvesh get /nodes/pve/qemu/<vmid>/status/current
|
|
|
|
# Review VM logs via Proxmox console
|
|
# Use Proxmox web interface -> VM -> Console
|
|
|
|
# Check Proxmox host resources
|
|
pvesh get /nodes/pve/status
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Increase memory allocation
|
|
pvesh set /nodes/pve/qemu/<vmid>/config -memory 4096
|
|
|
|
# Reset CPU configuration
|
|
pvesh set /nodes/pve/qemu/<vmid>/config -cpu host -cores 2
|
|
|
|
# Check and repair disk
|
|
# Stop VM, then:
|
|
pvesh get /nodes/pve/qemu/<vmid>/config | grep scsi0
|
|
# Use fsck on the disk image if needed
|
|
```
|
|
|
|
#### Resource Constraints
|
|
**Symptoms**:
|
|
- VM extremely slow performance
|
|
- Out-of-memory kills
|
|
- Disk I/O bottlenecks
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Inside VM resource check
|
|
free -h
|
|
df -h
|
|
iostat 1 5
|
|
vmstat 1 5
|
|
|
|
# Proxmox host resource check
|
|
pvesh get /nodes/pve/status
|
|
cat /proc/meminfo
|
|
df -h /var/lib/vz
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Increase VM resources via Proxmox
|
|
pvesh set /nodes/pve/qemu/<vmid>/config -memory 8192
|
|
pvesh set /nodes/pve/qemu/<vmid>/config -cores 4
|
|
|
|
# Resize VM disk
|
|
# Proxmox GUI: Hardware -> Hard Disk -> Resize
|
|
# Then extend filesystem:
|
|
sudo growpart /dev/sda 1
|
|
sudo resize2fs /dev/sda1
|
|
```
|
|
|
|
## SSH Access Issues
|
|
|
|
### SSH Connection Failures
|
|
|
|
#### Cannot Connect to VM
|
|
**Symptoms**:
|
|
- Connection timeout
|
|
- Connection refused
|
|
- Host unreachable
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Network connectivity tests
|
|
ping <vm-ip>
|
|
traceroute <vm-ip>
|
|
|
|
# SSH service tests
|
|
nc -zv <vm-ip> 22
|
|
nmap -p 22 <vm-ip>
|
|
|
|
# From Proxmox console, check SSH service
|
|
systemctl status sshd
|
|
ss -tlnp | grep :22
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Via Proxmox console - restart SSH
|
|
systemctl start sshd
|
|
systemctl enable sshd
|
|
|
|
# Check and configure firewall
|
|
ufw status
|
|
# If blocking SSH:
|
|
ufw allow ssh
|
|
ufw allow 22/tcp
|
|
|
|
# Network configuration reset
|
|
ip addr show
|
|
dhclient # For DHCP
|
|
systemctl restart networking
|
|
```
|
|
|
|
#### SSH Key Authentication Failures
|
|
**Symptoms**:
|
|
- Password prompts despite key installation
|
|
- "Permission denied (publickey)"
|
|
- "No more authentication methods"
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Verbose SSH debugging
|
|
ssh -vvv cal@<vm-ip>
|
|
|
|
# Check key files locally
|
|
ls -la ~/.ssh/homelab_rsa*
|
|
ls -la ~/.ssh/emergency_homelab_rsa*
|
|
|
|
# Via console or password auth, check VM
|
|
ls -la ~/.ssh/
|
|
cat ~/.ssh/authorized_keys
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Fix SSH directory permissions
|
|
chmod 700 ~/.ssh
|
|
chmod 600 ~/.ssh/authorized_keys
|
|
chown -R cal:cal ~/.ssh
|
|
|
|
# Re-deploy SSH keys
|
|
cat > ~/.ssh/authorized_keys << 'EOF'
|
|
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... # primary key
|
|
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQD... # emergency key
|
|
EOF
|
|
|
|
# Verify SSH server configuration
|
|
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)"
|
|
```
|
|
|
|
#### SSH Security Configuration Issues
|
|
**Symptoms**:
|
|
- Password authentication still enabled
|
|
- Root login allowed
|
|
- Insecure SSH settings
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check effective SSH configuration
|
|
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot|allowusers)"
|
|
|
|
# Review SSH config files
|
|
cat /etc/ssh/sshd_config
|
|
ls /etc/ssh/sshd_config.d/
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Apply security hardening
|
|
sudo tee /etc/ssh/sshd_config.d/99-homelab-security.conf << 'EOF'
|
|
PasswordAuthentication no
|
|
PubkeyAuthentication yes
|
|
PermitRootLogin no
|
|
AllowUsers cal
|
|
Protocol 2
|
|
ClientAliveInterval 300
|
|
ClientAliveCountMax 2
|
|
MaxAuthTries 3
|
|
X11Forwarding no
|
|
EOF
|
|
|
|
sudo systemctl restart sshd
|
|
```
|
|
|
|
## Docker Installation and Configuration Issues
|
|
|
|
### Docker Installation Failures
|
|
|
|
#### Package Installation Fails
|
|
**Symptoms**:
|
|
- Docker packages not found
|
|
- GPG key verification errors
|
|
- Repository access failures
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Test internet connectivity
|
|
ping google.com
|
|
curl -I https://download.docker.com
|
|
|
|
# Check repository configuration
|
|
cat /etc/apt/sources.list.d/docker.list
|
|
apt-cache policy docker-ce
|
|
|
|
# Check for package conflicts
|
|
dpkg -l | grep docker
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Remove conflicting packages
|
|
sudo apt remove -y docker docker-engine docker.io containerd runc
|
|
|
|
# Reinstall Docker repository
|
|
sudo mkdir -p /etc/apt/keyrings
|
|
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
|
|
|
|
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list
|
|
|
|
# Install Docker
|
|
sudo apt update
|
|
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
|
|
```
|
|
|
|
#### Docker Service Issues
|
|
**Symptoms**:
|
|
- Docker daemon won't start
|
|
- Socket connection errors
|
|
- Service failure on boot
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check service status
|
|
systemctl status docker
|
|
journalctl -u docker.service -f
|
|
|
|
# Check system resources
|
|
df -h
|
|
free -h
|
|
|
|
# Test daemon manually
|
|
sudo dockerd --debug
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Restart Docker service
|
|
sudo systemctl stop docker
|
|
sudo systemctl start docker
|
|
sudo systemctl enable docker
|
|
|
|
# Clear corrupted Docker data
|
|
sudo systemctl stop docker
|
|
sudo rm -rf /var/lib/docker/tmp/*
|
|
sudo systemctl start docker
|
|
|
|
# Reset Docker configuration
|
|
sudo mv /etc/docker/daemon.json /etc/docker/daemon.json.bak 2>/dev/null || true
|
|
sudo systemctl restart docker
|
|
```
|
|
|
|
### Docker Permission and Access Issues
|
|
|
|
#### Permission Denied Errors
|
|
**Symptoms**:
|
|
- Must use sudo for Docker commands
|
|
- "Permission denied" when accessing Docker socket
|
|
- User not in docker group
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check user groups
|
|
groups
|
|
groups cal
|
|
getent group docker
|
|
|
|
# Check Docker socket permissions
|
|
ls -la /var/run/docker.sock
|
|
|
|
# Verify Docker service is running
|
|
systemctl status docker
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Add user to docker group
|
|
sudo usermod -aG docker cal
|
|
|
|
# Create docker group if missing
|
|
sudo groupadd docker 2>/dev/null || true
|
|
sudo usermod -aG docker cal
|
|
|
|
# Apply group membership (requires logout/login or):
|
|
newgrp docker
|
|
|
|
# Fix socket permissions
|
|
sudo chown root:docker /var/run/docker.sock
|
|
sudo chmod 664 /var/run/docker.sock
|
|
```
|
|
|
|
## Network Configuration Problems
|
|
|
|
### IP Address and Connectivity Issues
|
|
|
|
#### Incorrect IP Configuration
|
|
**Symptoms**:
|
|
- VM has wrong IP address
|
|
- No network connectivity
|
|
- Cannot reach default gateway
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check network configuration
|
|
ip addr show
|
|
ip route show
|
|
cat /etc/netplan/*.yaml
|
|
|
|
# Test connectivity
|
|
ping $(ip route | grep default | awk '{print $3}') # Gateway
|
|
ping 8.8.8.8 # External connectivity
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Fix netplan configuration
|
|
sudo tee /etc/netplan/00-installer-config.yaml << 'EOF'
|
|
network:
|
|
version: 2
|
|
ethernets:
|
|
ens18:
|
|
dhcp4: false
|
|
addresses: [10.10.0.200/24]
|
|
gateway4: 10.10.0.1
|
|
nameservers:
|
|
addresses: [10.10.0.16, 8.8.8.8]
|
|
EOF
|
|
|
|
# Apply network configuration
|
|
sudo netplan apply
|
|
```
|
|
|
|
#### DNS Resolution Problems
|
|
**Symptoms**:
|
|
- Cannot resolve domain names
|
|
- Package downloads fail
|
|
- Host lookup failures
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check DNS configuration
|
|
cat /etc/resolv.conf
|
|
systemd-resolve --status
|
|
|
|
# Test DNS resolution
|
|
nslookup google.com
|
|
dig google.com @8.8.8.8
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Fix DNS in netplan (see above example)
|
|
sudo netplan apply
|
|
|
|
# Temporary DNS fix
|
|
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
|
|
|
|
# Restart DNS services
|
|
sudo systemctl restart systemd-resolved
|
|
sudo systemctl restart networking
|
|
```
|
|
|
|
## System Maintenance Issues
|
|
|
|
### Package Management Problems
|
|
|
|
#### Update Failures
|
|
**Symptoms**:
|
|
- apt update fails
|
|
- Repository signature errors
|
|
- Dependency conflicts
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check repository status
|
|
sudo apt update
|
|
apt-cache policy
|
|
|
|
# Check disk space
|
|
df -h /
|
|
df -h /var
|
|
|
|
# Check for held packages
|
|
apt-mark showhold
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Fix broken packages
|
|
sudo apt --fix-broken install
|
|
sudo dpkg --configure -a
|
|
|
|
# Clean package cache
|
|
sudo apt clean
|
|
sudo apt autoclean
|
|
sudo apt autoremove
|
|
|
|
# Reset problematic repositories
|
|
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys <KEYID>
|
|
sudo apt update
|
|
```
|
|
|
|
### Storage and Disk Space Issues
|
|
|
|
#### Disk Space Exhaustion
|
|
**Symptoms**:
|
|
- Cannot install packages
|
|
- Docker operations fail
|
|
- System becomes unresponsive
|
|
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check disk usage
|
|
df -h
|
|
du -sh /home/* /var/* /opt/* 2>/dev/null
|
|
|
|
# Find large files
|
|
find / -size +100M 2>/dev/null | head -20
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Clean system files
|
|
sudo apt clean
|
|
sudo apt autoremove
|
|
sudo journalctl --vacuum-time=7d
|
|
|
|
# Clean Docker data
|
|
docker system prune -a -f
|
|
docker volume prune -f
|
|
|
|
# Extend disk (Proxmox GUI: Hardware -> Resize)
|
|
# Then extend filesystem:
|
|
sudo growpart /dev/sda 1
|
|
sudo resize2fs /dev/sda1
|
|
```
|
|
|
|
## Emergency Recovery Procedures
|
|
|
|
### SSH Access Recovery
|
|
|
|
#### Complete SSH Lockout
|
|
**Recovery Steps**:
|
|
1. **Use Proxmox console** for direct VM access
|
|
2. **Reset SSH configuration**:
|
|
```bash
|
|
# Via console
|
|
sudo cp /etc/ssh/sshd_config.backup /etc/ssh/sshd_config 2>/dev/null || true
|
|
sudo systemctl restart sshd
|
|
```
|
|
3. **Re-enable emergency access**:
|
|
```bash
|
|
# Temporary password access for recovery
|
|
sudo passwd cal
|
|
sudo sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config
|
|
sudo systemctl restart sshd
|
|
```
|
|
|
|
#### Emergency SSH Key Deployment
|
|
**If primary keys fail**:
|
|
```bash
|
|
# Use emergency key
|
|
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
|
|
|
|
# Or deploy keys via console
|
|
mkdir -p ~/.ssh
|
|
chmod 700 ~/.ssh
|
|
cat > ~/.ssh/authorized_keys << 'EOF'
|
|
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... # primary key
|
|
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQD... # emergency key
|
|
EOF
|
|
chmod 600 ~/.ssh/authorized_keys
|
|
```
|
|
|
|
### VM Recovery and Rebuild
|
|
|
|
#### Corrupt VM Recovery
|
|
**Steps**:
|
|
1. **Create snapshot** before attempting recovery
|
|
2. **Export VM data**:
|
|
```bash
|
|
# Backup important data
|
|
rsync -av cal@<vm-ip>:/home/cal/ ./vm-backup/
|
|
```
|
|
3. **Restore from template**:
|
|
```bash
|
|
# Delete corrupt VM
|
|
pvesh delete /nodes/pve/qemu/<vmid>
|
|
|
|
# Clone from template
|
|
pvesh create /nodes/pve/qemu/<template-id>/clone -newid <vmid> -name <vm-name>
|
|
```
|
|
|
|
#### Post-Install Script Recovery
|
|
**If automation fails**:
|
|
```bash
|
|
# Run in debug mode
|
|
bash -x ./scripts/vm-management/vm-post-install.sh <vm-ip> <user>
|
|
|
|
# Manual step execution
|
|
ssh cal@<vm-ip> 'sudo apt update && sudo apt upgrade -y'
|
|
ssh cal@<vm-ip> 'curl -fsSL https://get.docker.com | sh'
|
|
ssh cal@<vm-ip> 'sudo usermod -aG docker cal'
|
|
```
|
|
|
|
## Prevention and Monitoring
|
|
|
|
### Pre-Deployment Validation
|
|
```bash
|
|
# Verify prerequisites
|
|
ls -la ~/.ssh/homelab_rsa*
|
|
ls -la ~/.ssh/emergency_homelab_rsa*
|
|
ping 10.10.0.1
|
|
|
|
# Test cloud-init YAML
|
|
python3 -c "import yaml; yaml.safe_load(open('cloud-init-user-data.yaml'))"
|
|
```
|
|
|
|
### Health Monitoring Script
|
|
```bash
|
|
#!/bin/bash
|
|
# vm-health-check.sh
|
|
VM_IPS="10.10.0.200 10.10.0.201 10.10.0.202"
|
|
|
|
for ip in $VM_IPS; do
|
|
if ssh -o ConnectTimeout=5 -o BatchMode=yes cal@$ip 'uptime' >/dev/null 2>&1; then
|
|
echo "✅ $ip: SSH OK"
|
|
# Check Docker
|
|
if ssh cal@$ip 'docker info >/dev/null 2>&1'; then
|
|
echo "✅ $ip: Docker OK"
|
|
else
|
|
echo "❌ $ip: Docker FAILED"
|
|
fi
|
|
else
|
|
echo "❌ $ip: SSH FAILED"
|
|
fi
|
|
done
|
|
```
|
|
|
|
### Automated Backup
|
|
```bash
|
|
# Schedule in crontab: 0 2 * * * /path/to/vm-backup.sh
|
|
#!/bin/bash
|
|
for vm_ip in 10.10.0.{200..210}; do
|
|
if ping -c1 $vm_ip >/dev/null 2>&1; then
|
|
rsync -av --exclude='.cache' cal@$vm_ip:/home/cal/ ./backups/$vm_ip/
|
|
fi
|
|
done
|
|
```
|
|
|
|
## Quick Reference Commands
|
|
|
|
### Essential VM Management
|
|
```bash
|
|
# VM control via Proxmox
|
|
pvesh get /nodes/pve/qemu/<vmid>/status/current
|
|
pvesh create /nodes/pve/qemu/<vmid>/status/start
|
|
pvesh create /nodes/pve/qemu/<vmid>/status/stop
|
|
|
|
# SSH with alternative keys
|
|
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>
|
|
|
|
# System health checks
|
|
free -h && df -h && systemctl status docker
|
|
docker system info && docker system df
|
|
```
|
|
|
|
### Recovery Resources
|
|
- **SSH Keys Backup**: `/mnt/NV2/ssh-keys/backup-*/`
|
|
- **Proxmox Console**: Direct VM access when SSH fails
|
|
- **Emergency Contact**: Use Discord notifications for critical issues
|
|
|
|
This troubleshooting guide covers comprehensive recovery procedures for VM management issues in home lab environments. |