Establishes a first-Sunday-of-the-month maintenance window orchestrated by Ansible on LXC 304. Split into two playbooks to handle the self-reboot paradox (the controller is a guest on the host being rebooted): - monthly-reboot.yml: snapshots, tiered shutdown with per-guest polling, fire-and-forget host reboot - post-reboot-startup.yml: controlled tiered startup with staggered delays, Pi-hole UDP DNS fix, validation, and snapshot cleanup Also fixes onboot:1 on VM 109, LXC 221, LXC 223 and creates a recurring Google Calendar event for the maintenance window. Closes #26 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9.4 KiB
| title | description | type | domain | tags | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Proxmox Monthly Maintenance Reboot | Runbook for the first-Sunday-of-the-month Proxmox host reboot — dependency-aware shutdown/startup order, validation checklist, and Ansible automation. | runbook | server-configs |
|
Proxmox Monthly Maintenance Reboot
Overview
| Detail | Value |
|---|---|
| Schedule | 1st Sunday of every month, 3:00 AM ET (08:00 UTC) |
| Expected downtime | ~15 minutes (host reboot + VM/LXC startup) |
| Orchestration | Ansible on LXC 304 — shutdown playbook → host reboot → post-reboot startup playbook |
| Calendar | Google Calendar recurring event: "Proxmox Monthly Maintenance Reboot" |
| HA DNS | ubuntu-manticore (10.10.0.226) provides Pi-hole 2 during Proxmox downtime |
Why
- Kernel updates accumulate without reboot and never take effect
- Long uptimes allow memory leaks and process state drift (e.g., avahi busy-loops)
- Validates that all VMs/LXCs auto-start cleanly with
onboot: 1
Architecture
The reboot is split into two playbooks because LXC 304 (the Ansible controller) is itself a guest on the Proxmox host being rebooted:
monthly-reboot.yml— Snapshots all guests, shuts them down in dependency order, issues a fire-and-forgetrebootto the Proxmox host, then exits. LXC 304 is killed when the host reboots.post-reboot-startup.yml— After the host reboots, LXC 304 auto-starts viaonboot: 1. A systemd service (ansible-post-reboot.service) waits 120 seconds for the Proxmox API to stabilize, then starts all guests in dependency order with staggered delays.
The onboot: 1 flag on all production guests acts as a safety net — even if the post-reboot playbook fails, Proxmox will start everything (though without controlled ordering).
Prerequisites (Before Maintenance)
- Verify no active Tdarr transcodes on ubuntu-manticore
- Verify no running database backups
- Ensure workstation has Pi-hole 2 (10.10.0.226) as a fallback DNS server so it fails over automatically during downtime
- Confirm ubuntu-manticore Pi-hole 2 is healthy:
ssh manticore "docker exec pihole pihole status"
onboot Audit
All production VMs and LXCs must have onboot: 1 so they restart automatically as a safety net.
Check VMs:
ssh proxmox "for id in \$(qm list | awk 'NR>1{print \$1}'); do \
name=\$(qm config \$id | grep '^name:' | awk '{print \$2}'); \
onboot=\$(qm config \$id | grep '^onboot:'); \
echo \"VM \$id (\$name): \${onboot:-onboot NOT SET}\"; \
done"
Check LXCs:
ssh proxmox "for id in \$(pct list | awk 'NR>1{print \$1}'); do \
name=\$(pct config \$id | grep '^hostname:' | awk '{print \$2}'); \
onboot=\$(pct config \$id | grep '^onboot:'); \
echo \"LXC \$id (\$name): \${onboot:-onboot NOT SET}\"; \
done"
Audit results (2026-04-03):
| ID | Name | Type | onboot |
Status |
|---|---|---|---|---|
| 106 | docker-home | VM | 1 | OK |
| 109 | homeassistant | VM | 1 | OK (fixed 2026-04-03) |
| 110 | discord-bots | VM | 1 | OK |
| 112 | databases-bots | VM | 1 | OK |
| 115 | docker-sba | VM | 1 | OK |
| 116 | docker-home-servers | VM | 1 | OK |
| 210 | docker-n8n-lxc | LXC | 1 | OK |
| 221 | arr-stack | LXC | 1 | OK (fixed 2026-04-03) |
| 222 | memos | LXC | 1 | OK |
| 223 | foundry-lxc | LXC | 1 | OK (fixed 2026-04-03) |
| 225 | gitea | LXC | 1 | OK |
| 227 | uptime-kuma | LXC | 1 | OK |
| 301 | claude-discord-coordinator | LXC | 1 | OK |
| 302 | claude-runner | LXC | 1 | OK |
| 303 | mcp-gateway | LXC | 0 | Intentional (on-demand) |
| 304 | ansible-controller | LXC | 1 | OK |
If any production guest is missing onboot: 1:
ssh proxmox "qm set <VMID> --onboot 1" # for VMs
ssh proxmox "pct set <CTID> --onboot 1" # for LXCs
Shutdown Order (Dependency-Aware)
Reverse of the validated startup sequence. Stop consumers before their dependencies. Each tier polls per-guest status rather than using fixed waits.
Tier 4 — Media & Others (no downstream dependents)
VM 109 homeassistant
LXC 221 arr-stack
LXC 222 memos
LXC 223 foundry-lxc
LXC 302 claude-runner
Tier 3 — Applications (depend on databases + infra)
VM 115 docker-sba (Paper Dynasty, Major Domo)
VM 110 discord-bots
LXC 301 claude-discord-coordinator
Tier 2 — Infrastructure + DNS (depend on databases)
VM 106 docker-home (Pi-hole 1, NPM)
LXC 225 gitea
LXC 210 docker-n8n-lxc
LXC 227 uptime-kuma
VM 116 docker-home-servers
Tier 1 — Databases (no dependencies, shut down last)
VM 112 databases-bots (force-stop after 90s if ACPI ignored)
→ LXC 304 issues fire-and-forget reboot to Proxmox host, then is killed
Known quirks:
- VM 112 (databases-bots) may ignore ACPI shutdown — playbook force-stops after 90s
- VM 109 (homeassistant) is self-managed via HA Supervisor, excluded from Ansible inventory
- LXC 303 (mcp-gateway) has
onboot: 0and is operator-managed — not included in shutdown/startup. If it was running before maintenance, bring it up manually afterward
Startup Order (Staggered)
After the Proxmox host reboots, LXC 304 auto-starts and the ansible-post-reboot.service waits 120s before running the controlled startup:
Tier 1 — Databases first
VM 112 databases-bots
→ wait 30s for DB to accept connections
Tier 2 — Infrastructure + DNS
VM 106 docker-home (Pi-hole 1, NPM)
LXC 225 gitea
LXC 210 docker-n8n-lxc
LXC 227 uptime-kuma
VM 116 docker-home-servers
→ wait 30s
Tier 3 — Applications
VM 115 docker-sba
VM 110 discord-bots
LXC 301 claude-discord-coordinator
→ wait 30s
Pi-hole fix — restart container via SSH to clear UDP DNS bug
ssh docker-home "docker restart pihole"
→ wait 10s
Tier 4 — Media & Others
VM 109 homeassistant
LXC 221 arr-stack
LXC 222 memos
LXC 223 foundry-lxc
LXC 302 claude-runner
Post-Reboot Validation
- Pi-hole 1 DNS resolving:
ssh docker-home "docker exec pihole dig google.com @127.0.0.1" - Gitea accessible:
curl -sf https://git.manticorum.com/api/v1/version - n8n healthy:
ssh docker-n8n-lxc "docker ps --filter name=n8n --format '{{.Status}}'" - Discord bots responding (check Discord)
- Uptime Kuma dashboard green:
curl -sf http://10.10.0.227:3001/api/status-page/homelab - Home Assistant running:
curl -sf http://10.10.0.109:8123/api/ -H 'Authorization: Bearer <token>' - Maintenance snapshots cleaned up (auto, 7-day retention)
Automation
Ansible Playbooks
Both located at /opt/ansible/playbooks/ on LXC 304.
# Dry run — shutdown only
ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --check"
# Manual full execution — shutdown + reboot
ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml"
# Manual post-reboot startup (if automatic startup failed)
ssh ansible "ansible-playbook /opt/ansible/playbooks/post-reboot-startup.yml"
# Shutdown only — skip the host reboot
ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --tags shutdown"
Systemd Units (on LXC 304)
| Unit | Purpose | Schedule |
|---|---|---|
ansible-monthly-reboot.timer |
Triggers shutdown + reboot playbook | 1st Sunday of month, 08:00 UTC |
ansible-monthly-reboot.service |
Runs monthly-reboot.yml |
Activated by timer |
ansible-post-reboot.service |
Runs post-reboot-startup.yml |
On boot (multi-user.target), only if uptime < 10 min |
# Check timer status
ssh ansible "systemctl status ansible-monthly-reboot.timer"
# Next scheduled run
ssh ansible "systemctl list-timers ansible-monthly-reboot.timer"
# Check post-reboot service status
ssh ansible "systemctl status ansible-post-reboot.service"
# Disable for a month (e.g., during an incident)
ssh ansible "systemctl stop ansible-monthly-reboot.timer"
Deployment (one-time setup on LXC 304)
# Copy playbooks
scp ansible/playbooks/monthly-reboot.yml ansible:/opt/ansible/playbooks/
scp ansible/playbooks/post-reboot-startup.yml ansible:/opt/ansible/playbooks/
# Copy and enable systemd units
scp ansible/systemd/ansible-monthly-reboot.timer ansible:/etc/systemd/system/
scp ansible/systemd/ansible-monthly-reboot.service ansible:/etc/systemd/system/
scp ansible/systemd/ansible-post-reboot.service ansible:/etc/systemd/system/
ssh ansible "sudo systemctl daemon-reload && \
sudo systemctl enable --now ansible-monthly-reboot.timer && \
sudo systemctl enable ansible-post-reboot.service"
# Verify SSH key access from LXC 304 to docker-home (needed for Pi-hole restart)
ssh ansible "ssh -o BatchMode=yes docker-home 'echo ok'"
Rollback
If a guest fails to start after reboot:
- Check Proxmox web UI or
pvesh get /nodes/proxmox/qemu/<VMID>/status/current - Review guest logs:
ssh proxmox "journalctl -u pve-guests -n 50" - Manual start:
ssh proxmox "pvesh create /nodes/proxmox/qemu/<VMID>/status/start" - If guest is corrupted, restore from the pre-reboot Proxmox snapshot
- If post-reboot startup failed entirely, run manually:
ssh ansible "ansible-playbook /opt/ansible/playbooks/post-reboot-startup.yml"
Related Documentation
- Ansible Controller Setup — LXC 304 details and inventory
- Proxmox 7→9 Upgrade Plan — original startup order and Phase 1 lessons
- VM Decommission Runbook — removing VMs from the rotation