From 64f299aa1af05d75c2bfa2e6b453a9350a42a01e Mon Sep 17 00:00:00 2001 From: Cal Corum Date: Fri, 3 Apr 2026 16:00:22 -0500 Subject: [PATCH] =?UTF-8?q?docs:=20sync=20KB=20=E2=80=94=20maintenance-reb?= =?UTF-8?q?oot.md?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- server-configs/proxmox/maintenance-reboot.md | 210 +++++++++++++++++++ 1 file changed, 210 insertions(+) create mode 100644 server-configs/proxmox/maintenance-reboot.md diff --git a/server-configs/proxmox/maintenance-reboot.md b/server-configs/proxmox/maintenance-reboot.md new file mode 100644 index 0000000..0c72d5a --- /dev/null +++ b/server-configs/proxmox/maintenance-reboot.md @@ -0,0 +1,210 @@ +--- +title: "Proxmox Monthly Maintenance Reboot" +description: "Runbook for the first-Sunday-of-the-month Proxmox host reboot — dependency-aware shutdown/startup order, validation checklist, and Ansible automation." +type: runbook +domain: server-configs +tags: [proxmox, maintenance, reboot, ansible, operations, systemd] +--- + +# Proxmox Monthly Maintenance Reboot + +## Overview + +| Detail | Value | +|--------|-------| +| **Schedule** | 1st Sunday of every month, 3:00 AM ET (08:00 UTC) | +| **Expected downtime** | ~15 minutes (host reboot + VM/LXC startup) | +| **Orchestration** | Ansible playbook on LXC 304 (ansible-controller) | +| **Calendar** | Google Calendar recurring event: "Proxmox Monthly Maintenance Reboot" | +| **HA DNS** | ubuntu-manticore (10.10.0.226) provides Pi-hole 2 during Proxmox downtime | + +## Why + +- Kernel updates accumulate without reboot and never take effect +- Long uptimes allow memory leaks and process state drift (e.g., avahi busy-loops) +- Validates that all VMs/LXCs auto-start cleanly with `onboot: 1` + +## Prerequisites (Before Maintenance) + +- [ ] Verify no active Tdarr transcodes on ubuntu-manticore +- [ ] Verify no running database backups +- [ ] Switch workstation DNS to `1.1.1.1` (Pi-hole 1 on VM 106 will be offline) +- [ ] Confirm ubuntu-manticore Pi-hole 2 is healthy: `ssh manticore "docker exec pihole pihole status"` + +## `onboot` Audit + +All production VMs and LXCs must have `onboot: 1` so they restart automatically if the playbook fails mid-sequence. + +**Check VMs:** +```bash +ssh proxmox "for id in \$(qm list | awk 'NR>1{print \$1}'); do \ + name=\$(qm config \$id | grep '^name:' | awk '{print \$2}'); \ + onboot=\$(qm config \$id | grep '^onboot:'); \ + echo \"VM \$id (\$name): \${onboot:-onboot NOT SET}\"; \ +done" +``` + +**Check LXCs:** +```bash +ssh proxmox "for id in \$(pct list | awk 'NR>1{print \$1}'); do \ + name=\$(pct config \$id | grep '^hostname:' | awk '{print \$2}'); \ + onboot=\$(pct config \$id | grep '^onboot:'); \ + echo \"LXC \$id (\$name): \${onboot:-onboot NOT SET}\"; \ +done" +``` + +**Audit results (2026-04-03):** + +| ID | Name | Type | `onboot` | Action needed | +|----|------|------|----------|---------------| +| 106 | docker-home | VM | 1 | OK | +| 109 | homeassistant | VM | NOT SET | **Add `onboot: 1`** | +| 110 | discord-bots | VM | 1 | OK | +| 112 | databases-bots | VM | 1 | OK | +| 115 | docker-sba | VM | 1 | OK | +| 116 | docker-home-servers | VM | 1 | OK | +| 210 | docker-n8n-lxc | LXC | 1 | OK | +| 221 | arr-stack | LXC | NOT SET | **Add `onboot: 1`** | +| 222 | memos | LXC | 1 | OK | +| 223 | foundry-lxc | LXC | NOT SET | **Add `onboot: 1`** | +| 225 | gitea | LXC | 1 | OK | +| 227 | uptime-kuma | LXC | 1 | OK | +| 301 | claude-discord-coordinator | LXC | 1 | OK | +| 302 | claude-runner | LXC | 1 | OK | +| 303 | mcp-gateway | LXC | 0 | Intentional (on-demand) | +| 304 | ansible-controller | LXC | 1 | OK | + +**Fix missing `onboot`:** +```bash +ssh proxmox "qm set 109 --onboot 1" +ssh proxmox "pct set 221 --onboot 1" +ssh proxmox "pct set 223 --onboot 1" +``` + +## Shutdown Order (Dependency-Aware) + +Reverse of the validated startup sequence. Stop consumers before their dependencies. + +``` +Tier 4 — Media & Others (no downstream dependents) + VM 109 homeassistant + LXC 221 arr-stack + LXC 222 memos + LXC 223 foundry-lxc + LXC 302 claude-runner + LXC 303 mcp-gateway (if running) + +Tier 3 — Applications (depend on databases + infra) + VM 115 docker-sba (Paper Dynasty, Major Domo) + VM 110 discord-bots + LXC 301 claude-discord-coordinator + +Tier 2 — Infrastructure + DNS (depend on databases) + VM 106 docker-home (Pi-hole 1, NPM) + LXC 225 gitea + LXC 210 docker-n8n-lxc + LXC 227 uptime-kuma + VM 116 docker-home-servers + +Tier 1 — Databases (no dependencies, shut down last) + VM 112 databases-bots + +Tier 0 — Ansible controller shuts itself down last + LXC 304 ansible-controller + +→ Proxmox host reboots +``` + +**Known quirks:** +- VM 112 (databases-bots) may ignore ACPI shutdown — use `--forceStop` after timeout +- VM 109 (homeassistant) is self-managed via HA Supervisor, excluded from Ansible inventory + +## Startup Order (Staggered) + +After the Proxmox host reboots, guests with `onboot: 1` will auto-start. The Ansible playbook overrides this with a controlled sequence: + +``` +Tier 1 — Databases first + VM 112 databases-bots + → wait 30s for DB to accept connections + +Tier 2 — Infrastructure + DNS + VM 106 docker-home (Pi-hole 1, NPM) + LXC 225 gitea + LXC 210 docker-n8n-lxc + LXC 227 uptime-kuma + VM 116 docker-home-servers + → wait 30s + +Tier 3 — Applications + VM 115 docker-sba + VM 110 discord-bots + LXC 301 claude-discord-coordinator + → wait 30s + +Pi-hole fix — restart container to clear UDP DNS bug + qm guest exec 106 -- docker restart pihole + → wait 10s + +Tier 4 — Media & Others + VM 109 homeassistant + LXC 221 arr-stack + LXC 222 memos + LXC 223 foundry-lxc +``` + +## Post-Reboot Validation + +- [ ] Pi-hole 1 DNS resolving: `ssh docker-home "docker exec pihole dig google.com @127.0.0.1"` +- [ ] Gitea accessible: `curl -sf https://git.manticorum.com/api/v1/version` +- [ ] n8n healthy: `ssh docker-n8n-lxc "docker ps --filter name=n8n --format '{{.Status}}'"` +- [ ] Discord bots responding (check Discord) +- [ ] Uptime Kuma dashboard green: `curl -sf http://10.10.0.227:3001/api/status-page/homelab` +- [ ] Home Assistant running: `curl -sf http://10.10.0.109:8123/api/ -H 'Authorization: Bearer '` +- [ ] Switch workstation DNS back from `1.1.1.1` to Pi-hole + +## Automation + +### Ansible Playbook + +Located at `/opt/ansible/playbooks/monthly-reboot.yml` on LXC 304. + +```bash +# Dry run (check mode) +ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --check" + +# Manual execution +ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml" + +# Limit to shutdown only (skip reboot) +ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --tags shutdown" +``` + +### Systemd Timer + +The playbook runs automatically via systemd timer on LXC 304: + +```bash +# Check timer status +ssh ansible "systemctl status ansible-monthly-reboot.timer" + +# Next scheduled run +ssh ansible "systemctl list-timers ansible-monthly-reboot.timer" + +# Disable for a month (e.g., during an incident) +ssh ansible "systemctl stop ansible-monthly-reboot.timer" +``` + +## Rollback + +If a guest fails to start after reboot: +1. Check Proxmox web UI or `pvesh get /nodes/proxmox/qemu//status/current` +2. Review guest logs: `ssh proxmox "journalctl -u pve-guests -n 50"` +3. Manual start: `ssh proxmox "pvesh create /nodes/proxmox/qemu//status/start"` +4. If guest is corrupted, restore from the pre-reboot Proxmox snapshot + +## Related Documentation + +- [Ansible Controller Setup](../../vm-management/ansible-controller-setup.md) — LXC 304 details and inventory +- [Proxmox 7→9 Upgrade Plan](../../vm-management/proxmox-upgrades/proxmox-7-to-9-upgrade-plan.md) — original startup order and Phase 1 lessons +- [VM Decommission Runbook](../../vm-management/vm-decommission-runbook.md) — removing VMs from the rotation