claude-home/vm-management/proxmox-upgrades/proxmox-7-to-9-upgrade-plan.md
Cal Corum 66143f6090
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
fix: clean up VM 116 watchstate duplicate and document decommission candidacy (#31)
- Removed stopped watchstate container from VM 116 (duplicate of manticore's canonical instance)
- Pruned 5 orphan images (watchstate, freetube, pihole, hello-world): 3.36 GB reclaimed
- Confirmed manticore watchstate is healthy and syncing Jellyfin state
- VM 116 now runs only Jellyfin (also runs on manticore)
- Added VM 116 (docker-home-servers) to hosts.yml as decommission candidate
- Updated proxmox-7-to-9-upgrade-plan.md status from Stopped/Investigate to Decommission Candidate

Closes #31

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 20:01:13 +00:00

16 KiB

title description type domain tags
Proxmox 7-to-9 Upgrade Plan Two-phase Proxmox upgrade plan (7→8→9) with Phase 1 completed (2026-02-19). Covers backup procedures, upgrade execution, service startup order, lessons learned, rollback procedures, and Phase 2 planning. runbook vm-management
proxmox
upgrade
pve
backup
rollback
infrastructure

Proxmox VE Upgrade Plan: 7.1-7 → 9.1

Executive Summary

Current State: Proxmox VE 8.4.16 (kernel 6.8.12-18-pve) — Phase 1 complete Target State: Proxmox VE 9.1 (latest) Upgrade Path: Two-phase upgrade (7→8→9) - direct upgrade not supported Total Timeline: 3-4 weeks (including stabilization periods) Total Downtime: ~4 hours (2 hours per phase)

Phase 1 Status: COMPLETED (2026-02-19)

  • Upgraded from PVE 7.4-20 → PVE 8.4.16
  • Kernel: 5.13.19-2-pve → 6.8.12-18-pve
  • Total downtime: ~45 minutes (upgrade + reboot + service startup)
  • All services validated and running
  • Stabilization period: monitoring through early March 2026

Infrastructure Overview

Production Services (7 LXC + 7 VMs) — cleaned up 2026-02-19:

  • Critical: Paper Dynasty/Major Domo (VM 115), Discord bots (VM 110), Gitea (LXC 225), n8n (LXC 210), Home Assistant (VM 109), Databases (VM 112), docker-home/Pi-hole 1 (VM 106)
  • Important: Claude Discord Coordinator (LXC 301), arr-stack (LXC 221), Uptime Kuma (LXC 227), Foundry VTT (LXC 223), Memos (LXC 222)
  • Decommission Candidate: docker-home-servers (VM 116) — Jellyfin-only after 2026-04-03 cleanup; watchstate removed (duplicate of manticore); see issue #31
  • Removed (2026-02-19): 108 (ansible), 224 (openclaw), 300 (openclaw-migrated), 101/102/104/111/211 (game servers), 107 (plex), 113 (tdarr - moved to .226), 114 (duplicate arr-stack), 117 (unused), 100/103 (old templates), 105 (docker-vpn - decommissioned 2026-04)

Key Constraints:

  • Home Assistant VM 109 requires dual network (vmbr1 for Matter support)
  • All production Discord bots must minimize downtime
  • Gitea mirrored to GitHub provides backup
  • TrueNAS backup mount at 10.10.0.35

Phase 1: Proxmox 7.1 → 8.4 Upgrade

Pre-Upgrade Preparation (1-2 days)

1. Comprehensive Backups

All production guests (14 total after cleanup):

# Backup all to TrueNAS (PVE storage: home-truenas, mount: /mnt/pve/home-truenas)
# VMs
vzdump 106 --mode snapshot --storage home-truenas --compress zstd  # docker-home (pihole1, NPM)
vzdump 109 --mode snapshot --storage home-truenas --compress zstd  # homeassistant
vzdump 110 --mode snapshot --storage home-truenas --compress zstd  # discord-bots
vzdump 112 --mode snapshot --storage home-truenas --compress zstd  # databases
vzdump 115 --mode snapshot --storage home-truenas --compress zstd  # docker-sba (Paper Dynasty)
# LXCs
vzdump 210 --mode snapshot --storage home-truenas --compress zstd  # n8n
vzdump 221 --mode snapshot --storage home-truenas --compress zstd  # arr-stack
vzdump 222 --mode snapshot --storage home-truenas --compress zstd  # memos
vzdump 223 --mode snapshot --storage home-truenas --compress zstd  # foundry
vzdump 225 --mode snapshot --storage home-truenas --compress zstd  # gitea
vzdump 227 --mode snapshot --storage home-truenas --compress zstd  # uptime-kuma
vzdump 301 --mode snapshot --storage home-truenas --compress zstd  # claude-discord-coordinator
# Optional (stopped/investigate)
# vzdump 105 --mode snapshot --storage home-truenas --compress zstd  # docker-vpn (decommissioning)
# vzdump 116 --mode snapshot --storage home-truenas --compress zstd  # docker-home-servers (investigate)

Backup Proxmox Configuration:

# Already completed 2026-02-19 — refresh before upgrade
tar -czf /mnt/pve/home-truenas/dump/pve-config/pve-config-$(date +%Y%m%d).tar.gz /etc/pve/
cp /etc/network/interfaces /mnt/pve/home-truenas/dump/pve-config/interfaces.backup.$(date +%Y%m%d)

Expected: 2-4 hours, ~500GB-1TB storage required

2. Pre-Upgrade Validation

# Run Proxmox 7-to-8 checker
pve7to8 --full

# Update to latest PVE 7.4
apt update && apt dist-upgrade -y

# Verify minimum version
pveversion  # Must show 7.4-15 or higher

# Document current state
pvesh get /cluster/resources --type vm --output-format yaml > /mnt/truenas/proxmox/vm-inventory-pre-upgrade.yaml

3. Maintenance Window Planning

Recommended Timing: Overnight or early morning weekend Estimated Downtime: 1.5-2.5 hours Notifications Required: Discord bot users, game server players

Upgrade Execution (2-4 hours including downtime)

1. Update to Latest PVE 7.4

apt update && apt dist-upgrade -y
pveversion  # Verify 7.4-XX
reboot

2. Configure PVE 8 Repositories

# Backup current config
cp /etc/apt/sources.list /etc/apt/sources.list.pve7-backup
cp -a /etc/apt/sources.list.d/ /etc/apt/sources.list.d.pve7-backup/

# Update repositories (Bullseye → Bookworm)
sed -i 's/bullseye/bookworm/g' /etc/apt/sources.list
echo "deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription" > /etc/apt/sources.list.d/pve-install-repo.list
sed -i 's/^deb/# deb/' /etc/apt/sources.list.d/pve-enterprise.list 2>/dev/null || true

apt update

3. Execute Distribution Upgrade

apt dist-upgrade
# Duration: 15-45 minutes
# Accept new versions of /etc/issue
# Keep current versions of customized configs

reboot

4. Verify PVE 8 Installation

pveversion  # Should show pve-manager/8.4-X
uname -r    # Should show 6.8.X-X-pve

# Verify services
systemctl status pve-cluster pvedaemon pveproxy pvestatd
pvesm status

Post-Upgrade Validation

Start Services in Dependency Order (stagger with 30s delays per Phase 1 lessons):

# Databases first
pvesh create /nodes/proxmox/qemu/112/status/start  # databases-bots
sleep 30

# Infrastructure + DNS
pvesh create /nodes/proxmox/qemu/106/status/start  # docker-home (pihole1, NPM)
pvesh create /nodes/proxmox/lxc/225/status/start  # gitea
pvesh create /nodes/proxmox/lxc/210/status/start  # n8n
pvesh create /nodes/proxmox/lxc/227/status/start  # uptime-kuma
sleep 30

# Applications
pvesh create /nodes/proxmox/qemu/115/status/start  # docker-sba (Paper Dynasty)
pvesh create /nodes/proxmox/qemu/110/status/start  # discord-bots
pvesh create /nodes/proxmox/lxc/301/status/start  # claude-discord-coordinator
sleep 30

# Restart Pi-hole container proactively (UDP DNS fix from Phase 1)
qm guest exec 106 -- docker restart pihole
sleep 10

# Media & Others
pvesh create /nodes/proxmox/qemu/109/status/start  # homeassistant
pvesh create /nodes/proxmox/lxc/221/status/start  # arr-stack
pvesh create /nodes/proxmox/lxc/222/status/start  # memos
pvesh create /nodes/proxmox/lxc/223/status/start  # foundry-lxc

Service Validation Checklist:

  • Discord bots responding in Discord
  • Database connections working
  • n8n workflows executing
  • Gitea accessible at git.manticorum.com
  • Home Assistant automations running
  • Media servers streaming (Plex/Jellyfin)
  • Web UI accessible and functional

Stabilization Period

Wait 1-2 weeks before PVE 9 upgrade

Monitor for:

  • VM/LXC stability
  • Performance issues
  • Service uptime
  • Error logs

Phase 1 Lessons Learned (2026-02-19)

Issues encountered:

  1. I/O storm on boot: All 15 guests starting simultaneously caused massive I/O delay (~50% for several minutes). Consider staggering guest startup with delays between groups.
  2. Pi-hole 1 UDP DNS failed after boot: Docker iptables NAT rules weren't fully set up. Required container restart. TCP DNS worked immediately — only UDP was affected.
  3. Home Assistant IP changed: HA on VM 109 got a new DHCP address (10.10.0.215 instead of previous). Need DHCP reservation to prevent this.
  4. Local machine DNS failover: Desktop was configured with only one Pi-hole DNS server (10.10.0.226). When Proxmox guests were shut down, Pi-hole on physical server at .226 should have kept working but didn't resolve initially. Added both Pi-holes as DNS servers.
  5. Some VMs ignored ACPI shutdown: VMs 105 and 112 required --forceStop flag.
  6. Several guests had onboot=1: Many guests auto-started before we could bring them up in dependency order. Not harmful but unexpected.

What went well:

  • pve7to8 --full checker caught everything — zero surprises during upgrade
  • DEBIAN_FRONTEND=noninteractive apt dist-upgrade -y -o Dpkg::Options::='--force-confnew' worked cleanly
  • Reboot took ~4 minutes (longer than expected but completed without issues)
  • All backups on TrueNAS were intact and accessible post-upgrade
  • Local disk space dropped from 57% to 14% after upgrade (old kernel/packages cleaned up)

Recommendations for Phase 2:

  • Stagger guest startup: add sleep 30 between dependency groups
  • Restart Pi-hole Docker container proactively after boot
  • Set DHCP reservation for HA VM before Phase 2
  • Switch local DNS to public resolvers (1.1.1.1) before shutting down guests
  • Disable onboot for all guests before upgrade, re-enable after validation

Phase 2: Proxmox 8.4 → 9.1 Upgrade

Pre-Upgrade Preparation (1 day)

1. LXC Compatibility Check (CRITICAL)

# Verify systemd version in each LXC (must be > 230)
for ct in 108 210 211 221 222 223 224 225 227 300 301; do
    echo "=== LXC $ct ==="
    pct exec $ct -- systemd --version | head -1
done

Pre-verified 2026-02-19 (all pass, updated after cleanup):

LXC Name systemd Status
210 n8n 245 Pass
221 arr-stack 245 Pass
222 memos 245 Pass
223 foundry 245 Pass
225 gitea 245 Pass
227 uptime-kuma 249 Pass
301 claude-discord-coord 249 Pass

Expected: All compatible. Re-verify before Phase 2 in case any LXC OS was changed.

2. Fresh Backup Set

vzdump --all --mode snapshot --storage home-truenas --compress zstd
tar -czf /mnt/pve/home-truenas/dump/pve-config/pve8-config-$(date +%Y%m%d).tar.gz /etc/pve/

3. Run PVE 8-to-9 Checker

pve8to9 --full

Upgrade Execution (2-4 hours including downtime)

1. Configure PVE 9 Repositories

# Backup PVE 8 config
cp /etc/apt/sources.list /etc/apt/sources.list.pve8-backup
cp -a /etc/apt/sources.list.d/ /etc/apt/sources.list.d.pve8-backup/

# Update repositories (Bookworm → Trixie)
sed -i 's/bookworm/trixie/g' /etc/apt/sources.list
echo "deb http://download.proxmox.com/debian/pve trixie pve-no-subscription" > /etc/apt/sources.list.d/pve-install-repo.list
sed -i 's/^deb/# deb/' /etc/apt/sources.list.d/pve-enterprise.list 2>/dev/null || true

apt update

2. Execute Distribution Upgrade

apt dist-upgrade
# Duration: 20-60 minutes

reboot

3. Verify PVE 9 Installation

pveversion  # Should show pve-manager/9.1-X
uname -r    # Should show 6.14.X-X-pve

# Verify cgroupv2 (PVE 9 requirement)
mount | grep cgroup2

# Verify services
systemctl status pve-cluster pvedaemon pveproxy pvestatd
pvesm status

Post-Upgrade Validation

Start and validate services using same procedure as PVE 8 upgrade.

Additional PVE 9 Checks:

  • Web UI with cleared browser cache (Ctrl+Shift+R)
  • Memory reporting (PVE 9 includes overhead in VM memory)
  • Storage performance validation

Rollback Procedures

If PVE 8 Upgrade Fails

During dist-upgrade:

apt --fix-broken install
dpkg --configure -a

# If unrecoverable:
cp /etc/apt/sources.list.pve7-backup /etc/apt/sources.list
cp -a /etc/apt/sources.list.d.pve7-backup/* /etc/apt/sources.list.d/
apt update && apt install pve-manager/7.4

After reboot to unstable system:

  • Boot to previous kernel via GRUB → Advanced options
  • Rollback repositories as above

If PVE 9 Upgrade Fails

cp /etc/apt/sources.list.pve8-backup /etc/apt/sources.list
cp -a /etc/apt/sources.list.d.pve8-backup/* /etc/apt/sources.list.d/
apt update && apt dist-upgrade
reboot

If VM/LXC Won't Start

Restore from backup:

# LXC
pct restore <CTID> /mnt/truenas/proxmox/vzdump-lxc-<CTID>-*.tar.zst --storage local-lvm

# VM
qmrestore /mnt/truenas/proxmox/vzdump-qemu-<VMID>-*.vma.zst <VMID>

Complete Reinstallation (Last Resort)

  1. Reinstall Proxmox VE 9 from ISO
  2. Restore configs from /mnt/truenas/proxmox/pve-config-*/
  3. Restore VMs/LXCs from backups
  4. Reconfigure networking if needed

Risk Assessment

Component Risk Impact Mitigation
Production Bots (115, 110) HIGH Service downtime Backup instance ready, notify users
Databases (112) HIGH Data loss Multiple backups, test restore
LXC systemd compatibility MEDIUM Container won't start Pre-verify versions, upgrade OS if needed
Network config MEDIUM Connectivity loss Document config, console access
n8n workflows (210) MEDIUM Automation failures Export workflow configs

Low Risk: Game servers, templates, unused services


Post-Upgrade Tasks

1. Update Documentation

  • Record upgrade completion in /mnt/NV2/Development/claude-home/vm-management/
  • Update Proxmox version references
  • Document issues encountered

2. Performance Validation

pvesh get /cluster/resources

3. Long-Term Monitoring

  • Daily health checks
  • Resource utilization trends
  • Plan next upgrade (PVE 9.x updates)

Timeline Summary

Phase Duration Downtime Activity
Pre-PVE8 Prep 1-2 days None Backups, validation
PVE 7→8 Upgrade 2-4 hours 1.5-2.5 hours Repository update, upgrade
PVE 8 Stabilization 1-2 weeks None Monitor, validate
Pre-PVE9 Prep 1 day None LXC validation, backups
PVE 8→9 Upgrade 2-4 hours 1.5-2.5 hours Repository update, upgrade
Post-Upgrade 1-2 days None Documentation, optimization
TOTAL 3-4 weeks ~4 hours Full upgrade with stabilization

Critical Files

  • /etc/pve/qemu-server/*.conf - VM configurations (backup critical)
  • /etc/pve/lxc/*.conf - LXC configurations (backup critical)
  • /etc/network/interfaces - Network config (document before changes)
  • /etc/apt/sources.list - Repository config (will be modified)
  • /etc/apt/sources.list.d/pve-*.list - Proxmox repos (will be modified)

Verification Checklist

Phase 1 (PVE 7→8) — Completed 2026-02-19

  • Proxmox version correct: pve-manager/8.4.16
  • Kernel version updated: 6.8.12-18-pve
  • All PVE services running (pve-cluster, pvedaemon, pveproxy, pvestatd)
  • Storage accessible: local, local-lvm, home-truenas all active
  • Network functional
  • All VMs/LXCs visible in UI
  • Critical VMs/LXCs started successfully
  • Discord bots responding (confirmed on .88)
  • Databases accessible (VM 112 running)
  • n8n workflows — HTTP 200
  • Gitea accessible — HTTP 200
  • Home Assistant functional — HTTP 200 (new IP: 10.10.0.215)
  • Jellyfin streaming — HTTP 302
  • Uptime Kuma — HTTP 302
  • Pi-hole 1 DNS resolving (after container restart)
  • Pi-hole 2 DNS resolving
  • Web UI functional — HTTP 200

Phase 2 (PVE 8→9) — Pending

  • Proxmox version correct (pveversion)
  • Kernel version updated (uname -r)
  • All services running (systemctl status pve-*)
  • Storage accessible (pvesm status)
  • Network functional (ip addr, ip route)
  • All VMs/LXCs visible in UI
  • Critical VMs/LXCs started successfully
  • Discord bots responding
  • Databases accessible
  • n8n workflows running
  • Gitea accessible
  • Home Assistant functional
  • Media streaming working
  • Web UI functional (clear cache first)

Sources