claude-home/server-configs/proxmox/maintenance-reboot.md
Cal Corum 29a20fbe06
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 2s
feat: add monthly Proxmox maintenance reboot automation (#26)
Establishes a first-Sunday-of-the-month maintenance window orchestrated
by Ansible on LXC 304. Split into two playbooks to handle the self-reboot
paradox (the controller is a guest on the host being rebooted):

- monthly-reboot.yml: snapshots, tiered shutdown with per-guest polling,
  fire-and-forget host reboot
- post-reboot-startup.yml: controlled tiered startup with staggered delays,
  Pi-hole UDP DNS fix, validation, and snapshot cleanup

Also fixes onboot:1 on VM 109, LXC 221, LXC 223 and creates a recurring
Google Calendar event for the maintenance window.

Closes #26

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 23:33:59 -05:00

9.4 KiB

title description type domain tags
Proxmox Monthly Maintenance Reboot Runbook for the first-Sunday-of-the-month Proxmox host reboot — dependency-aware shutdown/startup order, validation checklist, and Ansible automation. runbook server-configs
proxmox
maintenance
reboot
ansible
operations
systemd

Proxmox Monthly Maintenance Reboot

Overview

Detail Value
Schedule 1st Sunday of every month, 3:00 AM ET (08:00 UTC)
Expected downtime ~15 minutes (host reboot + VM/LXC startup)
Orchestration Ansible on LXC 304 — shutdown playbook → host reboot → post-reboot startup playbook
Calendar Google Calendar recurring event: "Proxmox Monthly Maintenance Reboot"
HA DNS ubuntu-manticore (10.10.0.226) provides Pi-hole 2 during Proxmox downtime

Why

  • Kernel updates accumulate without reboot and never take effect
  • Long uptimes allow memory leaks and process state drift (e.g., avahi busy-loops)
  • Validates that all VMs/LXCs auto-start cleanly with onboot: 1

Architecture

The reboot is split into two playbooks because LXC 304 (the Ansible controller) is itself a guest on the Proxmox host being rebooted:

  1. monthly-reboot.yml — Snapshots all guests, shuts them down in dependency order, issues a fire-and-forget reboot to the Proxmox host, then exits. LXC 304 is killed when the host reboots.
  2. post-reboot-startup.yml — After the host reboots, LXC 304 auto-starts via onboot: 1. A systemd service (ansible-post-reboot.service) waits 120 seconds for the Proxmox API to stabilize, then starts all guests in dependency order with staggered delays.

The onboot: 1 flag on all production guests acts as a safety net — even if the post-reboot playbook fails, Proxmox will start everything (though without controlled ordering).

Prerequisites (Before Maintenance)

  • Verify no active Tdarr transcodes on ubuntu-manticore
  • Verify no running database backups
  • Ensure workstation has Pi-hole 2 (10.10.0.226) as a fallback DNS server so it fails over automatically during downtime
  • Confirm ubuntu-manticore Pi-hole 2 is healthy: ssh manticore "docker exec pihole pihole status"

onboot Audit

All production VMs and LXCs must have onboot: 1 so they restart automatically as a safety net.

Check VMs:

ssh proxmox "for id in \$(qm list | awk 'NR>1{print \$1}'); do \
  name=\$(qm config \$id | grep '^name:' | awk '{print \$2}'); \
  onboot=\$(qm config \$id | grep '^onboot:'); \
  echo \"VM \$id (\$name): \${onboot:-onboot NOT SET}\"; \
done"

Check LXCs:

ssh proxmox "for id in \$(pct list | awk 'NR>1{print \$1}'); do \
  name=\$(pct config \$id | grep '^hostname:' | awk '{print \$2}'); \
  onboot=\$(pct config \$id | grep '^onboot:'); \
  echo \"LXC \$id (\$name): \${onboot:-onboot NOT SET}\"; \
done"

Audit results (2026-04-03):

ID Name Type onboot Status
106 docker-home VM 1 OK
109 homeassistant VM 1 OK (fixed 2026-04-03)
110 discord-bots VM 1 OK
112 databases-bots VM 1 OK
115 docker-sba VM 1 OK
116 docker-home-servers VM 1 OK
210 docker-n8n-lxc LXC 1 OK
221 arr-stack LXC 1 OK (fixed 2026-04-03)
222 memos LXC 1 OK
223 foundry-lxc LXC 1 OK (fixed 2026-04-03)
225 gitea LXC 1 OK
227 uptime-kuma LXC 1 OK
301 claude-discord-coordinator LXC 1 OK
302 claude-runner LXC 1 OK
303 mcp-gateway LXC 0 Intentional (on-demand)
304 ansible-controller LXC 1 OK

If any production guest is missing onboot: 1:

ssh proxmox "qm set <VMID> --onboot 1"   # for VMs
ssh proxmox "pct set <CTID> --onboot 1"   # for LXCs

Shutdown Order (Dependency-Aware)

Reverse of the validated startup sequence. Stop consumers before their dependencies. Each tier polls per-guest status rather than using fixed waits.

Tier 4 — Media & Others (no downstream dependents)
  VM 109  homeassistant
  LXC 221 arr-stack
  LXC 222 memos
  LXC 223 foundry-lxc
  LXC 302 claude-runner

Tier 3 — Applications (depend on databases + infra)
  VM 115  docker-sba (Paper Dynasty, Major Domo)
  VM 110  discord-bots
  LXC 301 claude-discord-coordinator

Tier 2 — Infrastructure + DNS (depend on databases)
  VM 106  docker-home (Pi-hole 1, NPM)
  LXC 225 gitea
  LXC 210 docker-n8n-lxc
  LXC 227 uptime-kuma
  VM 116  docker-home-servers

Tier 1 — Databases (no dependencies, shut down last)
  VM 112  databases-bots (force-stop after 90s if ACPI ignored)

→ LXC 304 issues fire-and-forget reboot to Proxmox host, then is killed

Known quirks:

  • VM 112 (databases-bots) may ignore ACPI shutdown — playbook force-stops after 90s
  • VM 109 (homeassistant) is self-managed via HA Supervisor, excluded from Ansible inventory
  • LXC 303 (mcp-gateway) has onboot: 0 and is operator-managed — not included in shutdown/startup. If it was running before maintenance, bring it up manually afterward

Startup Order (Staggered)

After the Proxmox host reboots, LXC 304 auto-starts and the ansible-post-reboot.service waits 120s before running the controlled startup:

Tier 1 — Databases first
  VM 112  databases-bots
  → wait 30s for DB to accept connections

Tier 2 — Infrastructure + DNS
  VM 106  docker-home (Pi-hole 1, NPM)
  LXC 225 gitea
  LXC 210 docker-n8n-lxc
  LXC 227 uptime-kuma
  VM 116  docker-home-servers
  → wait 30s

Tier 3 — Applications
  VM 115  docker-sba
  VM 110  discord-bots
  LXC 301 claude-discord-coordinator
  → wait 30s

Pi-hole fix — restart container via SSH to clear UDP DNS bug
  ssh docker-home "docker restart pihole"
  → wait 10s

Tier 4 — Media & Others
  VM 109  homeassistant
  LXC 221 arr-stack
  LXC 222 memos
  LXC 223 foundry-lxc
  LXC 302 claude-runner

Post-Reboot Validation

  • Pi-hole 1 DNS resolving: ssh docker-home "docker exec pihole dig google.com @127.0.0.1"
  • Gitea accessible: curl -sf https://git.manticorum.com/api/v1/version
  • n8n healthy: ssh docker-n8n-lxc "docker ps --filter name=n8n --format '{{.Status}}'"
  • Discord bots responding (check Discord)
  • Uptime Kuma dashboard green: curl -sf http://10.10.0.227:3001/api/status-page/homelab
  • Home Assistant running: curl -sf http://10.10.0.109:8123/api/ -H 'Authorization: Bearer <token>'
  • Maintenance snapshots cleaned up (auto, 7-day retention)

Automation

Ansible Playbooks

Both located at /opt/ansible/playbooks/ on LXC 304.

# Dry run — shutdown only
ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --check"

# Manual full execution — shutdown + reboot
ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml"

# Manual post-reboot startup (if automatic startup failed)
ssh ansible "ansible-playbook /opt/ansible/playbooks/post-reboot-startup.yml"

# Shutdown only — skip the host reboot
ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --tags shutdown"

Systemd Units (on LXC 304)

Unit Purpose Schedule
ansible-monthly-reboot.timer Triggers shutdown + reboot playbook 1st Sunday of month, 08:00 UTC
ansible-monthly-reboot.service Runs monthly-reboot.yml Activated by timer
ansible-post-reboot.service Runs post-reboot-startup.yml On boot (multi-user.target), only if uptime < 10 min
# Check timer status
ssh ansible "systemctl status ansible-monthly-reboot.timer"

# Next scheduled run
ssh ansible "systemctl list-timers ansible-monthly-reboot.timer"

# Check post-reboot service status
ssh ansible "systemctl status ansible-post-reboot.service"

# Disable for a month (e.g., during an incident)
ssh ansible "systemctl stop ansible-monthly-reboot.timer"

Deployment (one-time setup on LXC 304)

# Copy playbooks
scp ansible/playbooks/monthly-reboot.yml ansible:/opt/ansible/playbooks/
scp ansible/playbooks/post-reboot-startup.yml ansible:/opt/ansible/playbooks/

# Copy and enable systemd units
scp ansible/systemd/ansible-monthly-reboot.timer ansible:/etc/systemd/system/
scp ansible/systemd/ansible-monthly-reboot.service ansible:/etc/systemd/system/
scp ansible/systemd/ansible-post-reboot.service ansible:/etc/systemd/system/
ssh ansible "sudo systemctl daemon-reload && \
  sudo systemctl enable --now ansible-monthly-reboot.timer && \
  sudo systemctl enable ansible-post-reboot.service"

# Verify SSH key access from LXC 304 to docker-home (needed for Pi-hole restart)
ssh ansible "ssh -o BatchMode=yes docker-home 'echo ok'"

Rollback

If a guest fails to start after reboot:

  1. Check Proxmox web UI or pvesh get /nodes/proxmox/qemu/<VMID>/status/current
  2. Review guest logs: ssh proxmox "journalctl -u pve-guests -n 50"
  3. Manual start: ssh proxmox "pvesh create /nodes/proxmox/qemu/<VMID>/status/start"
  4. If guest is corrupted, restore from the pre-reboot Proxmox snapshot
  5. If post-reboot startup failed entirely, run manually: ssh ansible "ansible-playbook /opt/ansible/playbooks/post-reboot-startup.yml"