Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s

Details

fix: clean up VM 116 watchstate duplicate and document decommission candidacy (#31 )

- Removed stopped watchstate container from VM 116 (duplicate of manticore's canonical instance)
- Pruned 5 orphan images (watchstate, freetube, pihole, hello-world): 3.36 GB reclaimed
- Confirmed manticore watchstate is healthy and syncing Jellyfin state
- VM 116 now runs only Jellyfin (also runs on manticore)
- Added VM 116 (docker-home-servers) to hosts.yml as decommission candidate
- Updated proxmox-7-to-9-upgrade-plan.md status from Stopped/Investigate to Decommission Candidate

Closes #31

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-04-03 20:01:13 +00:00

16 KiB

Raw Blame History

title

description

type

domain

Proxmox VE Upgrade Plan: 7.1-7 → 9.1

Executive Summary

Current State: Proxmox VE 8.4.16 (kernel 6.8.12-18-pve) — Phase 1 complete Target State: Proxmox VE 9.1 (latest) Upgrade Path: Two-phase upgrade (7→8→9) - direct upgrade not supported Total Timeline: 3-4 weeks (including stabilization periods) Total Downtime: ~4 hours (2 hours per phase)

Phase 1 Status: COMPLETED (2026-02-19)

Upgraded from PVE 7.4-20 → PVE 8.4.16
Kernel: 5.13.19-2-pve → 6.8.12-18-pve
Total downtime: ~45 minutes (upgrade + reboot + service startup)
All services validated and running
Stabilization period: monitoring through early March 2026

Infrastructure Overview

Production Services (7 LXC + 7 VMs) — cleaned up 2026-02-19:

Critical: Paper Dynasty/Major Domo (VM 115), Discord bots (VM 110), Gitea (LXC 225), n8n (LXC 210), Home Assistant (VM 109), Databases (VM 112), docker-home/Pi-hole 1 (VM 106)
Important: Claude Discord Coordinator (LXC 301), arr-stack (LXC 221), Uptime Kuma (LXC 227), Foundry VTT (LXC 223), Memos (LXC 222)
Decommission Candidate: docker-home-servers (VM 116) — Jellyfin-only after 2026-04-03 cleanup; watchstate removed (duplicate of manticore); see issue #31
Removed (2026-02-19): 108 (ansible), 224 (openclaw), 300 (openclaw-migrated), 101/102/104/111/211 (game servers), 107 (plex), 113 (tdarr - moved to .226), 114 (duplicate arr-stack), 117 (unused), 100/103 (old templates), 105 (docker-vpn - decommissioned 2026-04)

Key Constraints:

Home Assistant VM 109 requires dual network (vmbr1 for Matter support)
All production Discord bots must minimize downtime
Gitea mirrored to GitHub provides backup
TrueNAS backup mount at 10.10.0.35

Phase 1: Proxmox 7.1 → 8.4 Upgrade

Pre-Upgrade Preparation (1-2 days)

1. Comprehensive Backups

All production guests (14 total after cleanup):

# Backup all to TrueNAS (PVE storage: home-truenas, mount: /mnt/pve/home-truenas)
# VMs
vzdump 106 --mode snapshot --storage home-truenas --compress zstd  # docker-home (pihole1, NPM)
vzdump 109 --mode snapshot --storage home-truenas --compress zstd  # homeassistant
vzdump 110 --mode snapshot --storage home-truenas --compress zstd  # discord-bots
vzdump 112 --mode snapshot --storage home-truenas --compress zstd  # databases
vzdump 115 --mode snapshot --storage home-truenas --compress zstd  # docker-sba (Paper Dynasty)
# LXCs
vzdump 210 --mode snapshot --storage home-truenas --compress zstd  # n8n
vzdump 221 --mode snapshot --storage home-truenas --compress zstd  # arr-stack
vzdump 222 --mode snapshot --storage home-truenas --compress zstd  # memos
vzdump 223 --mode snapshot --storage home-truenas --compress zstd  # foundry
vzdump 225 --mode snapshot --storage home-truenas --compress zstd  # gitea
vzdump 227 --mode snapshot --storage home-truenas --compress zstd  # uptime-kuma
vzdump 301 --mode snapshot --storage home-truenas --compress zstd  # claude-discord-coordinator
# Optional (stopped/investigate)
# vzdump 105 --mode snapshot --storage home-truenas --compress zstd  # docker-vpn (decommissioning)
# vzdump 116 --mode snapshot --storage home-truenas --compress zstd  # docker-home-servers (investigate)

Backup Proxmox Configuration:

# Already completed 2026-02-19 — refresh before upgrade
tar -czf /mnt/pve/home-truenas/dump/pve-config/pve-config-$(date +%Y%m%d).tar.gz /etc/pve/
cp /etc/network/interfaces /mnt/pve/home-truenas/dump/pve-config/interfaces.backup.$(date +%Y%m%d)

Expected: 2-4 hours, ~500GB-1TB storage required

2. Pre-Upgrade Validation

# Run Proxmox 7-to-8 checker
pve7to8 --full

# Update to latest PVE 7.4
apt update && apt dist-upgrade -y

# Verify minimum version
pveversion  # Must show 7.4-15 or higher

# Document current state
pvesh get /cluster/resources --type vm --output-format yaml > /mnt/truenas/proxmox/vm-inventory-pre-upgrade.yaml

3. Maintenance Window Planning

Recommended Timing: Overnight or early morning weekend Estimated Downtime: 1.5-2.5 hours Notifications Required: Discord bot users, game server players

Upgrade Execution (2-4 hours including downtime)

1. Update to Latest PVE 7.4

apt update && apt dist-upgrade -y
pveversion  # Verify 7.4-XX
reboot

2. Configure PVE 8 Repositories

# Backup current config
cp /etc/apt/sources.list /etc/apt/sources.list.pve7-backup
cp -a /etc/apt/sources.list.d/ /etc/apt/sources.list.d.pve7-backup/

# Update repositories (Bullseye → Bookworm)
sed -i 's/bullseye/bookworm/g' /etc/apt/sources.list
echo "deb http://download.proxmox.com/debian/pve bookworm pve-no-subscription" > /etc/apt/sources.list.d/pve-install-repo.list
sed -i 's/^deb/# deb/' /etc/apt/sources.list.d/pve-enterprise.list 2>/dev/null || true

apt update

3. Execute Distribution Upgrade

apt dist-upgrade
# Duration: 15-45 minutes
# Accept new versions of /etc/issue
# Keep current versions of customized configs

reboot

4. Verify PVE 8 Installation

pveversion  # Should show pve-manager/8.4-X
uname -r    # Should show 6.8.X-X-pve

# Verify services
systemctl status pve-cluster pvedaemon pveproxy pvestatd
pvesm status

Post-Upgrade Validation

Start Services in Dependency Order (stagger with 30s delays per Phase 1 lessons):

# Databases first
pvesh create /nodes/proxmox/qemu/112/status/start  # databases-bots
sleep 30

# Infrastructure + DNS
pvesh create /nodes/proxmox/qemu/106/status/start  # docker-home (pihole1, NPM)
pvesh create /nodes/proxmox/lxc/225/status/start  # gitea
pvesh create /nodes/proxmox/lxc/210/status/start  # n8n
pvesh create /nodes/proxmox/lxc/227/status/start  # uptime-kuma
sleep 30

# Applications
pvesh create /nodes/proxmox/qemu/115/status/start  # docker-sba (Paper Dynasty)
pvesh create /nodes/proxmox/qemu/110/status/start  # discord-bots
pvesh create /nodes/proxmox/lxc/301/status/start  # claude-discord-coordinator
sleep 30

# Restart Pi-hole container proactively (UDP DNS fix from Phase 1)
qm guest exec 106 -- docker restart pihole
sleep 10

# Media & Others
pvesh create /nodes/proxmox/qemu/109/status/start  # homeassistant
pvesh create /nodes/proxmox/lxc/221/status/start  # arr-stack
pvesh create /nodes/proxmox/lxc/222/status/start  # memos
pvesh create /nodes/proxmox/lxc/223/status/start  # foundry-lxc

Service Validation Checklist:

Discord bots responding in Discord
Database connections working
n8n workflows executing
Gitea accessible at git.manticorum.com
Home Assistant automations running
Media servers streaming (Plex/Jellyfin)
Web UI accessible and functional

Stabilization Period

Wait 1-2 weeks before PVE 9 upgrade

Monitor for:

VM/LXC stability
Performance issues
Service uptime
Error logs

Phase 1 Lessons Learned (2026-02-19)

Issues encountered:

I/O storm on boot: All 15 guests starting simultaneously caused massive I/O delay (~50% for several minutes). Consider staggering guest startup with delays between groups.
Pi-hole 1 UDP DNS failed after boot: Docker iptables NAT rules weren't fully set up. Required container restart. TCP DNS worked immediately — only UDP was affected.
Home Assistant IP changed: HA on VM 109 got a new DHCP address (10.10.0.215 instead of previous). Need DHCP reservation to prevent this.
Local machine DNS failover: Desktop was configured with only one Pi-hole DNS server (10.10.0.226). When Proxmox guests were shut down, Pi-hole on physical server at .226 should have kept working but didn't resolve initially. Added both Pi-holes as DNS servers.
Some VMs ignored ACPI shutdown: VMs 105 and 112 required --forceStop flag.
Several guests had onboot=1: Many guests auto-started before we could bring them up in dependency order. Not harmful but unexpected.

What went well:

pve7to8 --full checker caught everything — zero surprises during upgrade
DEBIAN_FRONTEND=noninteractive apt dist-upgrade -y -o Dpkg::Options::='--force-confnew' worked cleanly
Reboot took ~4 minutes (longer than expected but completed without issues)
All backups on TrueNAS were intact and accessible post-upgrade
Local disk space dropped from 57% to 14% after upgrade (old kernel/packages cleaned up)

Recommendations for Phase 2:

Stagger guest startup: add sleep 30 between dependency groups
Restart Pi-hole Docker container proactively after boot
Set DHCP reservation for HA VM before Phase 2
Switch local DNS to public resolvers (1.1.1.1) before shutting down guests
Disable onboot for all guests before upgrade, re-enable after validation

Phase 2: Proxmox 8.4 → 9.1 Upgrade

Pre-Upgrade Preparation (1 day)

1. LXC Compatibility Check (CRITICAL)

# Verify systemd version in each LXC (must be > 230)
for ct in 108 210 211 221 222 223 224 225 227 300 301; do
    echo "=== LXC $ct ==="
    pct exec $ct -- systemd --version | head -1
done

Pre-verified 2026-02-19 (all pass, updated after cleanup):

LXC	Name	systemd	Status
210	n8n	245	Pass
221	arr-stack	245	Pass
222	memos	245	Pass
223	foundry	245	Pass
225	gitea	245	Pass
227	uptime-kuma	249	Pass
301	claude-discord-coord	249	Pass

Expected: All compatible. Re-verify before Phase 2 in case any LXC OS was changed.

2. Fresh Backup Set

vzdump --all --mode snapshot --storage home-truenas --compress zstd
tar -czf /mnt/pve/home-truenas/dump/pve-config/pve8-config-$(date +%Y%m%d).tar.gz /etc/pve/

3. Run PVE 8-to-9 Checker

pve8to9 --full

Upgrade Execution (2-4 hours including downtime)

1. Configure PVE 9 Repositories

# Backup PVE 8 config
cp /etc/apt/sources.list /etc/apt/sources.list.pve8-backup
cp -a /etc/apt/sources.list.d/ /etc/apt/sources.list.d.pve8-backup/

# Update repositories (Bookworm → Trixie)
sed -i 's/bookworm/trixie/g' /etc/apt/sources.list
echo "deb http://download.proxmox.com/debian/pve trixie pve-no-subscription" > /etc/apt/sources.list.d/pve-install-repo.list
sed -i 's/^deb/# deb/' /etc/apt/sources.list.d/pve-enterprise.list 2>/dev/null || true

apt update

2. Execute Distribution Upgrade

apt dist-upgrade
# Duration: 20-60 minutes

reboot

3. Verify PVE 9 Installation

pveversion  # Should show pve-manager/9.1-X
uname -r    # Should show 6.14.X-X-pve

# Verify cgroupv2 (PVE 9 requirement)
mount | grep cgroup2

# Verify services
systemctl status pve-cluster pvedaemon pveproxy pvestatd
pvesm status

Post-Upgrade Validation

Start and validate services using same procedure as PVE 8 upgrade.

Additional PVE 9 Checks:

Web UI with cleared browser cache (Ctrl+Shift+R)
Memory reporting (PVE 9 includes overhead in VM memory)
Storage performance validation

Rollback Procedures

If PVE 8 Upgrade Fails

During dist-upgrade:

apt --fix-broken install
dpkg --configure -a

# If unrecoverable:
cp /etc/apt/sources.list.pve7-backup /etc/apt/sources.list
cp -a /etc/apt/sources.list.d.pve7-backup/* /etc/apt/sources.list.d/
apt update && apt install pve-manager/7.4

After reboot to unstable system:

Boot to previous kernel via GRUB → Advanced options
Rollback repositories as above

If PVE 9 Upgrade Fails

cp /etc/apt/sources.list.pve8-backup /etc/apt/sources.list
cp -a /etc/apt/sources.list.d.pve8-backup/* /etc/apt/sources.list.d/
apt update && apt dist-upgrade
reboot

If VM/LXC Won't Start

Restore from backup:

# LXC
pct restore <CTID> /mnt/truenas/proxmox/vzdump-lxc-<CTID>-*.tar.zst --storage local-lvm

# VM
qmrestore /mnt/truenas/proxmox/vzdump-qemu-<VMID>-*.vma.zst <VMID>

Complete Reinstallation (Last Resort)

Reinstall Proxmox VE 9 from ISO
Restore configs from /mnt/truenas/proxmox/pve-config-*/
Restore VMs/LXCs from backups
Reconfigure networking if needed

Risk Assessment

Component	Risk	Impact	Mitigation
Production Bots (115, 110)	HIGH	Service downtime	Backup instance ready, notify users
Databases (112)	HIGH	Data loss	Multiple backups, test restore
LXC systemd compatibility	MEDIUM	Container won't start	Pre-verify versions, upgrade OS if needed
Network config	MEDIUM	Connectivity loss	Document config, console access
n8n workflows (210)	MEDIUM	Automation failures	Export workflow configs

Low Risk: Game servers, templates, unused services

Post-Upgrade Tasks

1. Update Documentation

Record upgrade completion in /mnt/NV2/Development/claude-home/vm-management/
Update Proxmox version references
Document issues encountered

2. Performance Validation

pvesh get /cluster/resources

3. Long-Term Monitoring

Daily health checks
Resource utilization trends
Plan next upgrade (PVE 9.x updates)

Timeline Summary

Phase	Duration	Downtime	Activity
Pre-PVE8 Prep	1-2 days	None	Backups, validation
PVE 7→8 Upgrade	2-4 hours	1.5-2.5 hours	Repository update, upgrade
PVE 8 Stabilization	1-2 weeks	None	Monitor, validate
Pre-PVE9 Prep	1 day	None	LXC validation, backups
PVE 8→9 Upgrade	2-4 hours	1.5-2.5 hours	Repository update, upgrade
Post-Upgrade	1-2 days	None	Documentation, optimization
TOTAL	3-4 weeks	~4 hours	Full upgrade with stabilization

Critical Files

/etc/pve/qemu-server/*.conf - VM configurations (backup critical)
/etc/pve/lxc/*.conf - LXC configurations (backup critical)
/etc/network/interfaces - Network config (document before changes)
/etc/apt/sources.list - Repository config (will be modified)
/etc/apt/sources.list.d/pve-*.list - Proxmox repos (will be modified)

Verification Checklist

Phase 1 (PVE 7→8) — Completed 2026-02-19

Proxmox version correct: pve-manager/8.4.16
Kernel version updated: 6.8.12-18-pve
All PVE services running (pve-cluster, pvedaemon, pveproxy, pvestatd)
Storage accessible: local, local-lvm, home-truenas all active
Network functional
All VMs/LXCs visible in UI
Critical VMs/LXCs started successfully
Discord bots responding (confirmed on .88)
Databases accessible (VM 112 running)
n8n workflows — HTTP 200
Gitea accessible — HTTP 200
Home Assistant functional — HTTP 200 (new IP: 10.10.0.215)
Jellyfin streaming — HTTP 302
Uptime Kuma — HTTP 302
Pi-hole 1 DNS resolving (after container restart)
Pi-hole 2 DNS resolving
Web UI functional — HTTP 200

Phase 2 (PVE 8→9) — Pending

Proxmox version correct (pveversion)
Kernel version updated (uname -r)
All services running (systemctl status pve-*)
Storage accessible (pvesm status)
Network functional (ip addr, ip route)
All VMs/LXCs visible in UI
Critical VMs/LXCs started successfully
Discord bots responding
Databases accessible
n8n workflows running
Gitea accessible
Home Assistant functional
Media streaming working
Web UI functional (clear cache first)

16 KiB Raw Blame History

Proxmox VE Upgrade Plan: 7.1-7 → 9.1

Executive Summary

Phase 1 Status: COMPLETED (2026-02-19)

Infrastructure Overview

Phase 1: Proxmox 7.1 → 8.4 Upgrade

Pre-Upgrade Preparation (1-2 days)

1. Comprehensive Backups

2. Pre-Upgrade Validation

3. Maintenance Window Planning

Upgrade Execution (2-4 hours including downtime)

1. Update to Latest PVE 7.4

2. Configure PVE 8 Repositories

3. Execute Distribution Upgrade

4. Verify PVE 8 Installation

Post-Upgrade Validation

Stabilization Period

Phase 1 Lessons Learned (2026-02-19)

Phase 2: Proxmox 8.4 → 9.1 Upgrade

Pre-Upgrade Preparation (1 day)

1. LXC Compatibility Check (CRITICAL)

2. Fresh Backup Set

3. Run PVE 8-to-9 Checker

Upgrade Execution (2-4 hours including downtime)

1. Configure PVE 9 Repositories

2. Execute Distribution Upgrade

3. Verify PVE 9 Installation

Post-Upgrade Validation

Rollback Procedures

If PVE 8 Upgrade Fails

If PVE 9 Upgrade Fails

If VM/LXC Won't Start

Complete Reinstallation (Last Resort)

Risk Assessment

Post-Upgrade Tasks

1. Update Documentation

2. Performance Validation

3. Long-Term Monitoring

Timeline Summary

Critical Files

Verification Checklist

Phase 1 (PVE 7→8) — Completed 2026-02-19

Phase 2 (PVE 8→9) — Pending

Sources

16 KiB

Raw Blame History