cal/claude-home

Fork 0

Cal Corum 646991e1a9

Reindex Knowledge Base / reindex (push) Successful in 3s

Details

docs: sync KB — troubleshooting.md

2026-03-25 00:00:43 -05:00

17 KiB

Raw Blame History

title

description

type

domain

Virtual Machine Management Troubleshooting Guide

VM Provisioning Issues

Cloud-Init Configuration Problems

Cloud-Init Not Executing

Symptoms:

VM starts but user accounts not created
SSH keys not deployed
Packages not installed
Configuration not applied

Diagnosis:

# Check cloud-init status and logs
ssh root@<vm-ip> 'cloud-init status --long'
ssh root@<vm-ip> 'cat /var/log/cloud-init.log'
ssh root@<vm-ip> 'cat /var/log/cloud-init-output.log'

# Verify cloud-init configuration
ssh root@<vm-ip> 'cloud-init query userdata'

# Check for YAML syntax errors
ssh root@<vm-ip> 'cloud-init devel schema --config-file /var/lib/cloud/instance/user-data.txt'

Solutions:

# Re-run cloud-init (CAUTION: may overwrite changes)
ssh root@<vm-ip> 'cloud-init clean --logs'
ssh root@<vm-ip> 'cloud-init init --local'
ssh root@<vm-ip> 'cloud-init init'
ssh root@<vm-ip> 'cloud-init modules --mode=config'
ssh root@<vm-ip> 'cloud-init modules --mode=final'

# Manual user creation if cloud-init fails
ssh root@<vm-ip> 'useradd -m -s /bin/bash -G sudo,docker cal'
ssh root@<vm-ip> 'mkdir -p /home/cal/.ssh'
ssh root@<vm-ip> 'chown cal:cal /home/cal/.ssh'
ssh root@<vm-ip> 'chmod 700 /home/cal/.ssh'

Invalid Cloud-Init YAML

Symptoms:

Cloud-init fails with syntax errors
Parser errors in cloud-init logs
Partial configuration application

Common YAML Issues:

# ❌ Incorrect indentation
users:
- name: cal
groups: [sudo, docker]  # Wrong indentation

# ✅ Correct indentation  
users:
  - name: cal
    groups: [sudo, docker]  # Proper indentation

# ❌ Missing quotes for special characters
ssh_authorized_keys:
  - ssh-rsa AAAAB3NzaC1... user@host  # May fail with special chars

# ✅ Quoted strings
ssh_authorized_keys:
  - "ssh-rsa AAAAB3NzaC1... user@host"

VM Boot and Startup Issues

VM Won't Start

Symptoms:

VM fails to boot from Proxmox
Kernel panic messages
Boot loop or hanging

Diagnosis:

# Check VM configuration
pvesh get /nodes/pve/qemu/<vmid>/config

# Check resource allocation
pvesh get /nodes/pve/qemu/<vmid>/status/current

# Review VM logs via Proxmox console
# Use Proxmox web interface -> VM -> Console

# Check Proxmox host resources
pvesh get /nodes/pve/status

Solutions:

# Increase memory allocation
pvesh set /nodes/pve/qemu/<vmid>/config -memory 4096

# Reset CPU configuration
pvesh set /nodes/pve/qemu/<vmid>/config -cpu host -cores 2

# Check and repair disk
# Stop VM, then:
pvesh get /nodes/pve/qemu/<vmid>/config | grep scsi0
# Use fsck on the disk image if needed

Resource Constraints

Symptoms:

VM extremely slow performance
Out-of-memory kills
Disk I/O bottlenecks

Diagnosis:

# Inside VM resource check
free -h
df -h
iostat 1 5
vmstat 1 5

# Proxmox host resource check
pvesh get /nodes/pve/status
cat /proc/meminfo
df -h /var/lib/vz

Solutions:

# Increase VM resources via Proxmox
pvesh set /nodes/pve/qemu/<vmid>/config -memory 8192
pvesh set /nodes/pve/qemu/<vmid>/config -cores 4

# Resize VM disk
# Proxmox GUI: Hardware -> Hard Disk -> Resize
# Then extend filesystem:
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1

SSH Access Issues

SSH Connection Failures

Cannot Connect to VM

Symptoms:

Connection timeout
Connection refused
Host unreachable

Diagnosis:

# Network connectivity tests
ping <vm-ip>
traceroute <vm-ip>

# SSH service tests
nc -zv <vm-ip> 22
nmap -p 22 <vm-ip>

# From Proxmox console, check SSH service
systemctl status sshd
ss -tlnp | grep :22

Solutions:

# Via Proxmox console - restart SSH
systemctl start sshd
systemctl enable sshd

# Check and configure firewall
ufw status
# If blocking SSH:
ufw allow ssh
ufw allow 22/tcp

# Network configuration reset
ip addr show
dhclient  # For DHCP
systemctl restart networking

SSH Key Authentication Failures

Symptoms:

Password prompts despite key installation
"Permission denied (publickey)"
"No more authentication methods"

Diagnosis:

# Verbose SSH debugging
ssh -vvv cal@<vm-ip>

# Check key files locally
ls -la ~/.ssh/homelab_rsa*
ls -la ~/.ssh/emergency_homelab_rsa*

# Via console or password auth, check VM
ls -la ~/.ssh/
cat ~/.ssh/authorized_keys

Solutions:

# Fix SSH directory permissions
chmod 700 ~/.ssh
chmod 600 ~/.ssh/authorized_keys
chown -R cal:cal ~/.ssh

# Re-deploy SSH keys
cat > ~/.ssh/authorized_keys << 'EOF'
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... # primary key
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQD... # emergency key  
EOF

# Verify SSH server configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)"

SSH Security Configuration Issues

Symptoms:

Password authentication still enabled
Root login allowed
Insecure SSH settings

Diagnosis:

# Check effective SSH configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot|allowusers)"

# Review SSH config files
cat /etc/ssh/sshd_config
ls /etc/ssh/sshd_config.d/

Solutions:

# Apply security hardening
sudo tee /etc/ssh/sshd_config.d/99-homelab-security.conf << 'EOF'
PasswordAuthentication no
PubkeyAuthentication yes
PermitRootLogin no
AllowUsers cal
Protocol 2
ClientAliveInterval 300
ClientAliveCountMax 2
MaxAuthTries 3
X11Forwarding no
EOF

sudo systemctl restart sshd

Docker Installation and Configuration Issues

Docker Installation Failures

Package Installation Fails

Symptoms:

Docker packages not found
GPG key verification errors
Repository access failures

Diagnosis:

# Test internet connectivity
ping google.com
curl -I https://download.docker.com

# Check repository configuration
cat /etc/apt/sources.list.d/docker.list
apt-cache policy docker-ce

# Check for package conflicts
dpkg -l | grep docker

Solutions:

# Remove conflicting packages
sudo apt remove -y docker docker-engine docker.io containerd runc

# Reinstall Docker repository
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list

# Install Docker
sudo apt update
sudo apt install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

Docker Service Issues

Symptoms:

Docker daemon won't start
Socket connection errors
Service failure on boot

Diagnosis:

# Check service status
systemctl status docker
journalctl -u docker.service -f

# Check system resources
df -h
free -h

# Test daemon manually
sudo dockerd --debug

Solutions:

# Restart Docker service
sudo systemctl stop docker
sudo systemctl start docker
sudo systemctl enable docker

# Clear corrupted Docker data
sudo systemctl stop docker
sudo rm -rf /var/lib/docker/tmp/*
sudo systemctl start docker

# Reset Docker configuration
sudo mv /etc/docker/daemon.json /etc/docker/daemon.json.bak 2>/dev/null || true
sudo systemctl restart docker

Docker Permission and Access Issues

Permission Denied Errors

Symptoms:

Must use sudo for Docker commands
"Permission denied" when accessing Docker socket
User not in docker group

Diagnosis:

# Check user groups
groups
groups cal
getent group docker

# Check Docker socket permissions
ls -la /var/run/docker.sock

# Verify Docker service is running
systemctl status docker

Solutions:

# Add user to docker group
sudo usermod -aG docker cal

# Create docker group if missing
sudo groupadd docker 2>/dev/null || true
sudo usermod -aG docker cal

# Apply group membership (requires logout/login or):
newgrp docker

# Fix socket permissions
sudo chown root:docker /var/run/docker.sock
sudo chmod 664 /var/run/docker.sock

Network Configuration Problems

IP Address and Connectivity Issues

Incorrect IP Configuration

Symptoms:

VM has wrong IP address
No network connectivity
Cannot reach default gateway

Diagnosis:

# Check network configuration
ip addr show
ip route show
cat /etc/netplan/*.yaml

# Test connectivity
ping $(ip route | grep default | awk '{print $3}')  # Gateway
ping 8.8.8.8  # External connectivity

Solutions:

# Fix netplan configuration
sudo tee /etc/netplan/00-installer-config.yaml << 'EOF'
network:
  version: 2
  ethernets:
    ens18:
      dhcp4: false
      addresses: [10.10.0.200/24]
      gateway4: 10.10.0.1
      nameservers:
        addresses: [10.10.0.16, 8.8.8.8]
EOF

# Apply network configuration
sudo netplan apply

DNS Resolution Problems

Symptoms:

Cannot resolve domain names
Package downloads fail
Host lookup failures

Diagnosis:

# Check DNS configuration
cat /etc/resolv.conf
systemd-resolve --status

# Test DNS resolution
nslookup google.com
dig google.com @8.8.8.8

Solutions:

# Fix DNS in netplan (see above example)
sudo netplan apply

# Temporary DNS fix
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf

# Restart DNS services
sudo systemctl restart systemd-resolved
sudo systemctl restart networking

System Maintenance Issues

Package Management Problems

Update Failures

Symptoms:

apt update fails
Repository signature errors
Dependency conflicts

Diagnosis:

# Check repository status
sudo apt update
apt-cache policy

# Check disk space
df -h /
df -h /var

# Check for held packages
apt-mark showhold

Solutions:

# Fix broken packages
sudo apt --fix-broken install
sudo dpkg --configure -a

# Clean package cache
sudo apt clean
sudo apt autoclean
sudo apt autoremove

# Reset problematic repositories
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys <KEYID>
sudo apt update

Storage and Disk Space Issues

Disk Space Exhaustion

Symptoms:

Cannot install packages
Docker operations fail
System becomes unresponsive

Diagnosis:

# Check disk usage
df -h
du -sh /home/* /var/* /opt/* 2>/dev/null

# Find large files
find / -size +100M 2>/dev/null | head -20

Solutions:

# Clean system files
sudo apt clean
sudo apt autoremove
sudo journalctl --vacuum-time=7d

# Clean Docker data
docker system prune -a -f
docker volume prune -f

# Extend disk (Proxmox GUI: Hardware -> Resize)
# Then extend filesystem:
sudo growpart /dev/sda 1
sudo resize2fs /dev/sda1

Emergency Recovery Procedures

SSH Access Recovery

Complete SSH Lockout

Recovery Steps:

Use Proxmox console for direct VM access

Reset SSH configuration:

# Via console
sudo cp /etc/ssh/sshd_config.backup /etc/ssh/sshd_config 2>/dev/null || true
sudo systemctl restart sshd

Re-enable emergency access:

# Temporary password access for recovery
sudo passwd cal
sudo sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config
sudo systemctl restart sshd

Emergency SSH Key Deployment

If primary keys fail:

# Use emergency key
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>

# Or deploy keys via console
mkdir -p ~/.ssh
chmod 700 ~/.ssh
cat > ~/.ssh/authorized_keys << 'EOF'
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQC... # primary key
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQD... # emergency key
EOF
chmod 600 ~/.ssh/authorized_keys

VM Recovery and Rebuild

Corrupt VM Recovery

Steps:

Create snapshot before attempting recovery

Export VM data:

# Backup important data
rsync -av cal@<vm-ip>:/home/cal/ ./vm-backup/

Restore from template:

# Delete corrupt VM
pvesh delete /nodes/pve/qemu/<vmid>

# Clone from template
pvesh create /nodes/pve/qemu/<template-id>/clone -newid <vmid> -name <vm-name>

Post-Install Script Recovery

If automation fails:

# Run in debug mode
bash -x ./scripts/vm-management/vm-post-install.sh <vm-ip> <user>

# Manual step execution
ssh cal@<vm-ip> 'sudo apt update && sudo apt upgrade -y'
ssh cal@<vm-ip> 'curl -fsSL https://get.docker.com | sh'
ssh cal@<vm-ip> 'sudo usermod -aG docker cal'

Prevention and Monitoring

Pre-Deployment Validation

# Verify prerequisites
ls -la ~/.ssh/homelab_rsa*
ls -la ~/.ssh/emergency_homelab_rsa*
ping 10.10.0.1

# Test cloud-init YAML
python3 -c "import yaml; yaml.safe_load(open('cloud-init-user-data.yaml'))"

Health Monitoring Script

#!/bin/bash
# vm-health-check.sh
VM_IPS="10.10.0.200 10.10.0.201 10.10.0.202"

for ip in $VM_IPS; do
    if ssh -o ConnectTimeout=5 -o BatchMode=yes cal@$ip 'uptime' >/dev/null 2>&1; then
        echo "✅ $ip: SSH OK"
        # Check Docker
        if ssh cal@$ip 'docker info >/dev/null 2>&1'; then
            echo "✅ $ip: Docker OK"
        else
            echo "❌ $ip: Docker FAILED"
        fi
    else
        echo "❌ $ip: SSH FAILED"
    fi
done

Automated Backup

# Schedule in crontab: 0 2 * * * /path/to/vm-backup.sh
#!/bin/bash
for vm_ip in 10.10.0.{200..210}; do
    if ping -c1 $vm_ip >/dev/null 2>&1; then
        rsync -av --exclude='.cache' cal@$vm_ip:/home/cal/ ./backups/$vm_ip/
    fi
done

Quick Reference Commands

Essential VM Management

# VM control via Proxmox
pvesh get /nodes/pve/qemu/<vmid>/status/current
pvesh create /nodes/pve/qemu/<vmid>/status/start
pvesh create /nodes/pve/qemu/<vmid>/status/stop

# SSH with alternative keys
ssh -i ~/.ssh/emergency_homelab_rsa cal@<vm-ip>

# System health checks
free -h && df -h && systemctl status docker
docker system info && docker system df

Recovery Resources

SSH Keys Backup: /mnt/NV2/ssh-keys/backup-*/
Proxmox Console: Direct VM access when SSH fails
Emergency Contact: Use Discord notifications for critical issues

This troubleshooting guide covers comprehensive recovery procedures for VM management issues in home lab environments.

ubuntu-manticore crash recovery — initramfs fsck on wrong device (2026-03-24)

Severity: High — server unbootable, all services down (pihole DNS, jellyfin, tdarr, kb-rag)

Problem: After a physical server crash, ubuntu-manticore dropped to initramfs shell. Boot fsck targeted /dev/nvme0n1p2 (an NTFS data partition labeled "2TB 970") instead of the actual ext4 root on /dev/nvme1n1p2. The generic busybox fsck -y wrapper didn't invoke the ext4 backend.

Root Cause: Two issues compounded: (1) The crash corrupted the ext4 root filesystem (block/inode count mismatches across ~15 groups). (2) The initramfs resolved the root device UUID to the wrong NVMe drive — nvme0n1p2 instead of nvme1n1p2. NVMe device enumeration order can shift between boots; fstab uses UUIDs correctly but the initramfs got confused during this boot.

Fix: Ran /usr/sbin/fsck.ext4 -y /dev/nvme1n1p2 directly from initramfs (identified correct partition via blkid). After exit, boot completed normally and all 9 Docker containers came up automatically via restart policies.

Crash cause investigation:

Kernel panic: BUG: unable to handle page fault for address: fffffb2320041d50 — supervisor write to not-present page
PCIe AER correctable errors (Data Link Layer Timeout) on port 0000:00:01.2 (AMD X470/B450 root port) logged on Mar 19
Nvidia proprietary driver loaded, kernel tainted — common source of page faults
AMD Zen1 DIV0 bug flagged at boot (Ryzen 5 2600)
SMART data: both Samsung 970s healthy (PASSED), but nvme1 (250GB root drive, 22k hours) has 1 Media and Data Integrity Error — monitor for growth
nvme0 (2TB data): 0 media errors, 2% used, 1,571 hours — clean
Most likely cause: Nvidia driver panic or PCIe timeout on NVMe controller

Remediation: Upgraded Nvidia driver 570.211.01 → 580.126.09. The 570 packages were in a broken state (partial upgrade left metapackage pinned at .24.04.1 while deps moved to .24.04.2), requiring explicit removal of nvidia-driver-570 nvidia-dkms-570 nvidia-kernel-source-570 nvidia-kernel-common-570 libnvidia-common-570 libnvidia-gl-570 with --allow-change-held-packages before 580 could install cleanly. Note: 590 drivers reported unstable — avoid.

Lesson:

Always use blkid in initramfs to confirm the actual root partition before running fsck — NVMe device ordering is not stable across boots
Use /usr/sbin/fsck.ext4 -y directly rather than the busybox fsck wrapper, which may not invoke the correct backend
Docker containers with restart policies recovered without intervention — validates that approach
Install smartmontools on bare-metal servers proactively — wasn't available during initial investigation
Monitor nvme1 media integrity error count; if it increments, plan replacement
When upgrading Nvidia driver major versions on Ubuntu, apt often can't resolve conflicts automatically — explicitly remove the old driver packages with --allow-change-held-packages first

17 KiB Raw Blame History

Virtual Machine Management Troubleshooting Guide

VM Provisioning Issues

Cloud-Init Configuration Problems

Cloud-Init Not Executing

Invalid Cloud-Init YAML

VM Boot and Startup Issues

VM Won't Start

Resource Constraints

SSH Access Issues

SSH Connection Failures

Cannot Connect to VM

SSH Key Authentication Failures

SSH Security Configuration Issues

Docker Installation and Configuration Issues

Docker Installation Failures

Package Installation Fails

Docker Service Issues

Docker Permission and Access Issues

Permission Denied Errors

Network Configuration Problems

IP Address and Connectivity Issues

Incorrect IP Configuration

DNS Resolution Problems

System Maintenance Issues

Package Management Problems

Update Failures

Storage and Disk Space Issues

Disk Space Exhaustion

Emergency Recovery Procedures

SSH Access Recovery

Complete SSH Lockout

Emergency SSH Key Deployment

VM Recovery and Rebuild

Corrupt VM Recovery

Post-Install Script Recovery

Prevention and Monitoring

Pre-Deployment Validation

Health Monitoring Script

Automated Backup

Quick Reference Commands

Essential VM Management

Recovery Resources

ubuntu-manticore crash recovery — initramfs fsck on wrong device (2026-03-24)

17 KiB

Raw Blame History