- Removed stopped watchstate container from VM 116 (duplicate of manticore's canonical instance) - Pruned 5 orphan images (watchstate, freetube, pihole, hello-world): 3.36 GB reclaimed - Confirmed manticore watchstate is healthy and syncing Jellyfin state - VM 116 now runs only Jellyfin (also runs on manticore) - Added VM 116 (docker-home-servers) to hosts.yml as decommission candidate - Updated proxmox-7-to-9-upgrade-plan.md status from Stopped/Investigate to Decommission Candidate Closes #31 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
16 KiB
| title | description | type | domain | tags | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Proxmox 7-to-9 Upgrade Plan | Two-phase Proxmox upgrade plan (7→8→9) with Phase 1 completed (2026-02-19). Covers backup procedures, upgrade execution, service startup order, lessons learned, rollback procedures, and Phase 2 planning. | runbook | vm-management |
|
Proxmox VE Upgrade Plan: 7.1-7 → 9.1
Executive Summary
Current State: Proxmox VE 8.4.16 (kernel 6.8.12-18-pve) — Phase 1 complete Target State: Proxmox VE 9.1 (latest) Upgrade Path: Two-phase upgrade (7→8→9) - direct upgrade not supported Total Timeline: 3-4 weeks (including stabilization periods) Total Downtime: ~4 hours (2 hours per phase)
Phase 1 Status: COMPLETED (2026-02-19)
- Upgraded from PVE 7.4-20 → PVE 8.4.16
- Kernel: 5.13.19-2-pve → 6.8.12-18-pve
- Total downtime: ~45 minutes (upgrade + reboot + service startup)
- All services validated and running
- Stabilization period: monitoring through early March 2026
Infrastructure Overview
Production Services (7 LXC + 7 VMs) — cleaned up 2026-02-19:
- Critical: Paper Dynasty/Major Domo (VM 115), Discord bots (VM 110), Gitea (LXC 225), n8n (LXC 210), Home Assistant (VM 109), Databases (VM 112), docker-home/Pi-hole 1 (VM 106)
- Important: Claude Discord Coordinator (LXC 301), arr-stack (LXC 221), Uptime Kuma (LXC 227), Foundry VTT (LXC 223), Memos (LXC 222)
- Decommission Candidate: docker-home-servers (VM 116) — Jellyfin-only after 2026-04-03 cleanup; watchstate removed (duplicate of manticore); see issue #31
- Removed (2026-02-19): 108 (ansible), 224 (openclaw), 300 (openclaw-migrated), 101/102/104/111/211 (game servers), 107 (plex), 113 (tdarr - moved to .226), 114 (duplicate arr-stack), 117 (unused), 100/103 (old templates), 105 (docker-vpn - decommissioned 2026-04)
Key Constraints:
- Home Assistant VM 109 requires dual network (vmbr1 for Matter support)
- All production Discord bots must minimize downtime
- Gitea mirrored to GitHub provides backup
- TrueNAS backup mount at 10.10.0.35
Phase 1: Proxmox 7.1 → 8.4 Upgrade
Pre-Upgrade Preparation (1-2 days)
1. Comprehensive Backups
All production guests (14 total after cleanup):
# Backup all to TrueNAS (PVE storage: home-truenas, mount: /mnt/pve/home-truenas)
# VMs
vzdump 106 --mode snapshot --storage home-truenas --compress zstd # docker-home (pihole1, NPM)
vzdump 109 --mode snapshot --storage home-truenas --compress zstd # homeassistant
vzdump 110 --mode snapshot --storage home-truenas --compress zstd # discord-bots
vzdump 112 --mode snapshot --storage home-truenas --compress zstd # databases
vzdump 115 --mode snapshot --storage home-truenas --compress zstd # docker-sba (Paper Dynasty)
# LXCs
vzdump 210 --mode snapshot --storage home-truenas --compress zstd # n8n
vzdump 221 --mode snapshot --storage home-truenas --compress zstd # arr-stack
vzdump 222 --mode snapshot --storage home-truenas --compress zstd # memos
vzdump 223 --mode snapshot --storage home-truenas --compress zstd # foundry
vzdump 225 --mode snapshot --storage home-truenas --compress zstd # gitea
vzdump 227 --mode snapshot --storage home-truenas --compress zstd # uptime-kuma
vzdump 301 --mode snapshot --storage home-truenas --compress zstd # claude-discord-coordinator
# Optional (stopped/investigate)
# vzdump 105 --mode snapshot --storage home-truenas --compress zstd # docker-vpn (decommissioning)
# vzdump 116 --mode snapshot --storage home-truenas --compress zstd # docker-home-servers (investigate)
Backup Proxmox Configuration:
# Already completed 2026-02-19 — refresh before upgrade
tar -czf /mnt/pve/home-truenas/dump/pve-config/pve-config-$(date +%Y%m%d).tar.gz /etc/pve/
cp /etc/network/interfaces /mnt/pve/home-truenas/dump/pve-config/interfaces.backup.$(date +%Y%m%d)
Expected: 2-4 hours, ~500GB-1TB storage required
2. Pre-Upgrade Validation
# Run Proxmox 7-to-8 checker
pve7to8 --full
# Update to latest PVE 7.4
apt update && apt dist-upgrade -y
# Verify minimum version
pveversion # Must show 7.4-15 or higher
# Document current state
pvesh get /cluster/resources --type vm --output-format yaml > /mnt/truenas/proxmox/vm-inventory-pre-upgrade.yaml
3. Maintenance Window Planning
Recommended Timing: Overnight or early morning weekend Estimated Downtime: 1.5-2.5 hours Notifications Required: Discord bot users, game server players
Upgrade Execution (2-4 hours including downtime)
1. Update to Latest PVE 7.4
apt update && apt dist-upgrade -y
pveversion # Verify 7.4-XX
reboot
2. Configure PVE 8 Repositories
# Backup current config
cp /etc/apt/sources.list /etc/apt/sources.list.pve7-backup
cp -a /etc/apt/sources.list.d/ /etc/apt/sources.list.d.pve7-backup/
# Update repositories (Bullseye → Bookworm)
sed -i 's/bullseye/bookworm/g' /etc/apt/sources.list
echo "deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription" > /etc/apt/sources.list.d/pve-install-repo.list
sed -i 's/^deb/# deb/' /etc/apt/sources.list.d/pve-enterprise.list 2>/dev/null || true
apt update
3. Execute Distribution Upgrade
apt dist-upgrade
# Duration: 15-45 minutes
# Accept new versions of /etc/issue
# Keep current versions of customized configs
reboot
4. Verify PVE 8 Installation
pveversion # Should show pve-manager/8.4-X
uname -r # Should show 6.8.X-X-pve
# Verify services
systemctl status pve-cluster pvedaemon pveproxy pvestatd
pvesm status
Post-Upgrade Validation
Start Services in Dependency Order (stagger with 30s delays per Phase 1 lessons):
# Databases first
pvesh create /nodes/proxmox/qemu/112/status/start # databases-bots
sleep 30
# Infrastructure + DNS
pvesh create /nodes/proxmox/qemu/106/status/start # docker-home (pihole1, NPM)
pvesh create /nodes/proxmox/lxc/225/status/start # gitea
pvesh create /nodes/proxmox/lxc/210/status/start # n8n
pvesh create /nodes/proxmox/lxc/227/status/start # uptime-kuma
sleep 30
# Applications
pvesh create /nodes/proxmox/qemu/115/status/start # docker-sba (Paper Dynasty)
pvesh create /nodes/proxmox/qemu/110/status/start # discord-bots
pvesh create /nodes/proxmox/lxc/301/status/start # claude-discord-coordinator
sleep 30
# Restart Pi-hole container proactively (UDP DNS fix from Phase 1)
qm guest exec 106 -- docker restart pihole
sleep 10
# Media & Others
pvesh create /nodes/proxmox/qemu/109/status/start # homeassistant
pvesh create /nodes/proxmox/lxc/221/status/start # arr-stack
pvesh create /nodes/proxmox/lxc/222/status/start # memos
pvesh create /nodes/proxmox/lxc/223/status/start # foundry-lxc
Service Validation Checklist:
- Discord bots responding in Discord
- Database connections working
- n8n workflows executing
- Gitea accessible at git.manticorum.com
- Home Assistant automations running
- Media servers streaming (Plex/Jellyfin)
- Web UI accessible and functional
Stabilization Period
Wait 1-2 weeks before PVE 9 upgrade
Monitor for:
- VM/LXC stability
- Performance issues
- Service uptime
- Error logs
Phase 1 Lessons Learned (2026-02-19)
Issues encountered:
- I/O storm on boot: All 15 guests starting simultaneously caused massive I/O delay (~50% for several minutes). Consider staggering guest startup with delays between groups.
- Pi-hole 1 UDP DNS failed after boot: Docker iptables NAT rules weren't fully set up. Required container restart. TCP DNS worked immediately — only UDP was affected.
- Home Assistant IP changed: HA on VM 109 got a new DHCP address (10.10.0.215 instead of previous). Need DHCP reservation to prevent this.
- Local machine DNS failover: Desktop was configured with only one Pi-hole DNS server (10.10.0.226). When Proxmox guests were shut down, Pi-hole on physical server at .226 should have kept working but didn't resolve initially. Added both Pi-holes as DNS servers.
- Some VMs ignored ACPI shutdown: VMs 105 and 112 required
--forceStopflag. - Several guests had onboot=1: Many guests auto-started before we could bring them up in dependency order. Not harmful but unexpected.
What went well:
pve7to8 --fullchecker caught everything — zero surprises during upgradeDEBIAN_FRONTEND=noninteractive apt dist-upgrade -y -o Dpkg::Options::='--force-confnew'worked cleanly- Reboot took ~4 minutes (longer than expected but completed without issues)
- All backups on TrueNAS were intact and accessible post-upgrade
- Local disk space dropped from 57% to 14% after upgrade (old kernel/packages cleaned up)
Recommendations for Phase 2:
- Stagger guest startup: add
sleep 30between dependency groups - Restart Pi-hole Docker container proactively after boot
- Set DHCP reservation for HA VM before Phase 2
- Switch local DNS to public resolvers (1.1.1.1) before shutting down guests
- Disable onboot for all guests before upgrade, re-enable after validation
Phase 2: Proxmox 8.4 → 9.1 Upgrade
Pre-Upgrade Preparation (1 day)
1. LXC Compatibility Check (CRITICAL)
# Verify systemd version in each LXC (must be > 230)
for ct in 108 210 211 221 222 223 224 225 227 300 301; do
echo "=== LXC $ct ==="
pct exec $ct -- systemd --version | head -1
done
Pre-verified 2026-02-19 (all pass, updated after cleanup):
| LXC | Name | systemd | Status |
|---|---|---|---|
| 210 | n8n | 245 | Pass |
| 221 | arr-stack | 245 | Pass |
| 222 | memos | 245 | Pass |
| 223 | foundry | 245 | Pass |
| 225 | gitea | 245 | Pass |
| 227 | uptime-kuma | 249 | Pass |
| 301 | claude-discord-coord | 249 | Pass |
Expected: All compatible. Re-verify before Phase 2 in case any LXC OS was changed.
2. Fresh Backup Set
vzdump --all --mode snapshot --storage home-truenas --compress zstd
tar -czf /mnt/pve/home-truenas/dump/pve-config/pve8-config-$(date +%Y%m%d).tar.gz /etc/pve/
3. Run PVE 8-to-9 Checker
pve8to9 --full
Upgrade Execution (2-4 hours including downtime)
1. Configure PVE 9 Repositories
# Backup PVE 8 config
cp /etc/apt/sources.list /etc/apt/sources.list.pve8-backup
cp -a /etc/apt/sources.list.d/ /etc/apt/sources.list.d.pve8-backup/
# Update repositories (Bookworm → Trixie)
sed -i 's/bookworm/trixie/g' /etc/apt/sources.list
echo "deb http://download.proxmox.com/debian/pve trixie pve-no-subscription" > /etc/apt/sources.list.d/pve-install-repo.list
sed -i 's/^deb/# deb/' /etc/apt/sources.list.d/pve-enterprise.list 2>/dev/null || true
apt update
2. Execute Distribution Upgrade
apt dist-upgrade
# Duration: 20-60 minutes
reboot
3. Verify PVE 9 Installation
pveversion # Should show pve-manager/9.1-X
uname -r # Should show 6.14.X-X-pve
# Verify cgroupv2 (PVE 9 requirement)
mount | grep cgroup2
# Verify services
systemctl status pve-cluster pvedaemon pveproxy pvestatd
pvesm status
Post-Upgrade Validation
Start and validate services using same procedure as PVE 8 upgrade.
Additional PVE 9 Checks:
- Web UI with cleared browser cache (Ctrl+Shift+R)
- Memory reporting (PVE 9 includes overhead in VM memory)
- Storage performance validation
Rollback Procedures
If PVE 8 Upgrade Fails
During dist-upgrade:
apt --fix-broken install
dpkg --configure -a
# If unrecoverable:
cp /etc/apt/sources.list.pve7-backup /etc/apt/sources.list
cp -a /etc/apt/sources.list.d.pve7-backup/* /etc/apt/sources.list.d/
apt update && apt install pve-manager/7.4
After reboot to unstable system:
- Boot to previous kernel via GRUB → Advanced options
- Rollback repositories as above
If PVE 9 Upgrade Fails
cp /etc/apt/sources.list.pve8-backup /etc/apt/sources.list
cp -a /etc/apt/sources.list.d.pve8-backup/* /etc/apt/sources.list.d/
apt update && apt dist-upgrade
reboot
If VM/LXC Won't Start
Restore from backup:
# LXC
pct restore <CTID> /mnt/truenas/proxmox/vzdump-lxc-<CTID>-*.tar.zst --storage local-lvm
# VM
qmrestore /mnt/truenas/proxmox/vzdump-qemu-<VMID>-*.vma.zst <VMID>
Complete Reinstallation (Last Resort)
- Reinstall Proxmox VE 9 from ISO
- Restore configs from
/mnt/truenas/proxmox/pve-config-*/ - Restore VMs/LXCs from backups
- Reconfigure networking if needed
Risk Assessment
| Component | Risk | Impact | Mitigation |
|---|---|---|---|
| Production Bots (115, 110) | HIGH | Service downtime | Backup instance ready, notify users |
| Databases (112) | HIGH | Data loss | Multiple backups, test restore |
| LXC systemd compatibility | MEDIUM | Container won't start | Pre-verify versions, upgrade OS if needed |
| Network config | MEDIUM | Connectivity loss | Document config, console access |
| n8n workflows (210) | MEDIUM | Automation failures | Export workflow configs |
Low Risk: Game servers, templates, unused services
Post-Upgrade Tasks
1. Update Documentation
- Record upgrade completion in
/mnt/NV2/Development/claude-home/vm-management/ - Update Proxmox version references
- Document issues encountered
2. Performance Validation
pvesh get /cluster/resources
3. Long-Term Monitoring
- Daily health checks
- Resource utilization trends
- Plan next upgrade (PVE 9.x updates)
Timeline Summary
| Phase | Duration | Downtime | Activity |
|---|---|---|---|
| Pre-PVE8 Prep | 1-2 days | None | Backups, validation |
| PVE 7→8 Upgrade | 2-4 hours | 1.5-2.5 hours | Repository update, upgrade |
| PVE 8 Stabilization | 1-2 weeks | None | Monitor, validate |
| Pre-PVE9 Prep | 1 day | None | LXC validation, backups |
| PVE 8→9 Upgrade | 2-4 hours | 1.5-2.5 hours | Repository update, upgrade |
| Post-Upgrade | 1-2 days | None | Documentation, optimization |
| TOTAL | 3-4 weeks | ~4 hours | Full upgrade with stabilization |
Critical Files
/etc/pve/qemu-server/*.conf- VM configurations (backup critical)/etc/pve/lxc/*.conf- LXC configurations (backup critical)/etc/network/interfaces- Network config (document before changes)/etc/apt/sources.list- Repository config (will be modified)/etc/apt/sources.list.d/pve-*.list- Proxmox repos (will be modified)
Verification Checklist
Phase 1 (PVE 7→8) — Completed 2026-02-19
- Proxmox version correct: pve-manager/8.4.16
- Kernel version updated: 6.8.12-18-pve
- All PVE services running (pve-cluster, pvedaemon, pveproxy, pvestatd)
- Storage accessible: local, local-lvm, home-truenas all active
- Network functional
- All VMs/LXCs visible in UI
- Critical VMs/LXCs started successfully
- Discord bots responding (confirmed on .88)
- Databases accessible (VM 112 running)
- n8n workflows — HTTP 200
- Gitea accessible — HTTP 200
- Home Assistant functional — HTTP 200 (new IP: 10.10.0.215)
- Jellyfin streaming — HTTP 302
- Uptime Kuma — HTTP 302
- Pi-hole 1 DNS resolving (after container restart)
- Pi-hole 2 DNS resolving
- Web UI functional — HTTP 200
Phase 2 (PVE 8→9) — Pending
- Proxmox version correct (
pveversion) - Kernel version updated (
uname -r) - All services running (
systemctl status pve-*) - Storage accessible (
pvesm status) - Network functional (
ip addr,ip route) - All VMs/LXCs visible in UI
- Critical VMs/LXCs started successfully
- Discord bots responding
- Databases accessible
- n8n workflows running
- Gitea accessible
- Home Assistant functional
- Media streaming working
- Web UI functional (clear cache first)