docs: sync KB — maintenance-reboot.md

2026-04-03 16:00:22 -05:00 · 2026-04-03 16:00:22 -05:00 · 64f299aa1a
commit 64f299aa1a
parent a9a778f53c
1 changed files with 210 additions and 0 deletions
--- a/server-configs/proxmox/maintenance-reboot.md
+++ b/server-configs/proxmox/maintenance-reboot.md
@ -0,0 +1,210 @@
+---
+title: "Proxmox Monthly Maintenance Reboot"
+description: "Runbook for the first-Sunday-of-the-month Proxmox host reboot — dependency-aware shutdown/startup order, validation checklist, and Ansible automation."
+type: runbook
+domain: server-configs
+tags: [proxmox, maintenance, reboot, ansible, operations, systemd]
+---
+
+# Proxmox Monthly Maintenance Reboot
+
+## Overview
+
+| Detail | Value |
+|--------|-------|
+| **Schedule** | 1st Sunday of every month, 3:00 AM ET (08:00 UTC) |
+| **Expected downtime** | ~15 minutes (host reboot + VM/LXC startup) |
+| **Orchestration** | Ansible playbook on LXC 304 (ansible-controller) |
+| **Calendar** | Google Calendar recurring event: "Proxmox Monthly Maintenance Reboot" |
+| **HA DNS** | ubuntu-manticore (10.10.0.226) provides Pi-hole 2 during Proxmox downtime |
+
+## Why
+
+- Kernel updates accumulate without reboot and never take effect
+- Long uptimes allow memory leaks and process state drift (e.g., avahi busy-loops)
+- Validates that all VMs/LXCs auto-start cleanly with `onboot: 1`
+
+## Prerequisites (Before Maintenance)
+
+- [ ] Verify no active Tdarr transcodes on ubuntu-manticore
+- [ ] Verify no running database backups
+- [ ] Switch workstation DNS to `1.1.1.1` (Pi-hole 1 on VM 106 will be offline)
+- [ ] Confirm ubuntu-manticore Pi-hole 2 is healthy: `ssh manticore "docker exec pihole pihole status"`
+
+## `onboot` Audit
+
+All production VMs and LXCs must have `onboot: 1` so they restart automatically if the playbook fails mid-sequence.
+
+**Check VMs:**
+```bash
+ssh proxmox "for id in \$(qm list | awk 'NR>1{print \$1}'); do \
+  name=\$(qm config \$id | grep '^name:' | awk '{print \$2}'); \
+  onboot=\$(qm config \$id | grep '^onboot:'); \
+  echo \"VM \$id (\$name): \${onboot:-onboot NOT SET}\"; \
+done"
+```
+
+**Check LXCs:**
+```bash
+ssh proxmox "for id in \$(pct list | awk 'NR>1{print \$1}'); do \
+  name=\$(pct config \$id | grep '^hostname:' | awk '{print \$2}'); \
+  onboot=\$(pct config \$id | grep '^onboot:'); \
+  echo \"LXC \$id (\$name): \${onboot:-onboot NOT SET}\"; \
+done"
+```
+
+**Audit results (2026-04-03):**
+
+| ID | Name | Type | `onboot` | Action needed |
+|----|------|------|----------|---------------|
+| 106 | docker-home | VM | 1 | OK |
+| 109 | homeassistant | VM | NOT SET | **Add `onboot: 1`** |
+| 110 | discord-bots | VM | 1 | OK |
+| 112 | databases-bots | VM | 1 | OK |
+| 115 | docker-sba | VM | 1 | OK |
+| 116 | docker-home-servers | VM | 1 | OK |
+| 210 | docker-n8n-lxc | LXC | 1 | OK |
+| 221 | arr-stack | LXC | NOT SET | **Add `onboot: 1`** |
+| 222 | memos | LXC | 1 | OK |
+| 223 | foundry-lxc | LXC | NOT SET | **Add `onboot: 1`** |
+| 225 | gitea | LXC | 1 | OK |
+| 227 | uptime-kuma | LXC | 1 | OK |
+| 301 | claude-discord-coordinator | LXC | 1 | OK |
+| 302 | claude-runner | LXC | 1 | OK |
+| 303 | mcp-gateway | LXC | 0 | Intentional (on-demand) |
+| 304 | ansible-controller | LXC | 1 | OK |
+
+**Fix missing `onboot`:**
+```bash
+ssh proxmox "qm set 109 --onboot 1"
+ssh proxmox "pct set 221 --onboot 1"
+ssh proxmox "pct set 223 --onboot 1"
+```
+
+## Shutdown Order (Dependency-Aware)
+
+Reverse of the validated startup sequence. Stop consumers before their dependencies.
+
+```
+Tier 4 — Media & Others (no downstream dependents)
+  VM 109  homeassistant
+  LXC 221 arr-stack
+  LXC 222 memos
+  LXC 223 foundry-lxc
+  LXC 302 claude-runner
+  LXC 303 mcp-gateway (if running)
+
+Tier 3 — Applications (depend on databases + infra)
+  VM 115  docker-sba (Paper Dynasty, Major Domo)
+  VM 110  discord-bots
+  LXC 301 claude-discord-coordinator
+
+Tier 2 — Infrastructure + DNS (depend on databases)
+  VM 106  docker-home (Pi-hole 1, NPM)
+  LXC 225 gitea
+  LXC 210 docker-n8n-lxc
+  LXC 227 uptime-kuma
+  VM 116  docker-home-servers
+
+Tier 1 — Databases (no dependencies, shut down last)
+  VM 112  databases-bots
+
+Tier 0 — Ansible controller shuts itself down last
+  LXC 304 ansible-controller
+
+→ Proxmox host reboots
+```
+
+**Known quirks:**
+- VM 112 (databases-bots) may ignore ACPI shutdown — use `--forceStop` after timeout
+- VM 109 (homeassistant) is self-managed via HA Supervisor, excluded from Ansible inventory
+
+## Startup Order (Staggered)
+
+After the Proxmox host reboots, guests with `onboot: 1` will auto-start. The Ansible playbook overrides this with a controlled sequence:
+
+```
+Tier 1 — Databases first
+  VM 112  databases-bots
+  → wait 30s for DB to accept connections
+
+Tier 2 — Infrastructure + DNS
+  VM 106  docker-home (Pi-hole 1, NPM)
+  LXC 225 gitea
+  LXC 210 docker-n8n-lxc
+  LXC 227 uptime-kuma
+  VM 116  docker-home-servers
+  → wait 30s
+
+Tier 3 — Applications
+  VM 115  docker-sba
+  VM 110  discord-bots
+  LXC 301 claude-discord-coordinator
+  → wait 30s
+
+Pi-hole fix — restart container to clear UDP DNS bug
+  qm guest exec 106 -- docker restart pihole
+  → wait 10s
+
+Tier 4 — Media & Others
+  VM 109  homeassistant
+  LXC 221 arr-stack
+  LXC 222 memos
+  LXC 223 foundry-lxc
+```
+
+## Post-Reboot Validation
+
+- [ ] Pi-hole 1 DNS resolving: `ssh docker-home "docker exec pihole dig google.com @127.0.0.1"`
+- [ ] Gitea accessible: `curl -sf https://git.manticorum.com/api/v1/version`
+- [ ] n8n healthy: `ssh docker-n8n-lxc "docker ps --filter name=n8n --format '{{.Status}}'"`
+- [ ] Discord bots responding (check Discord)
+- [ ] Uptime Kuma dashboard green: `curl -sf http://10.10.0.227:3001/api/status-page/homelab`
+- [ ] Home Assistant running: `curl -sf http://10.10.0.109:8123/api/ -H 'Authorization: Bearer <token>'`
+- [ ] Switch workstation DNS back from `1.1.1.1` to Pi-hole
+
+## Automation
+
+### Ansible Playbook
+
+Located at `/opt/ansible/playbooks/monthly-reboot.yml` on LXC 304.
+
+```bash
+# Dry run (check mode)
+ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --check"
+
+# Manual execution
+ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml"
+
+# Limit to shutdown only (skip reboot)
+ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --tags shutdown"
+```
+
+### Systemd Timer
+
+The playbook runs automatically via systemd timer on LXC 304:
+
+```bash
+# Check timer status
+ssh ansible "systemctl status ansible-monthly-reboot.timer"
+
+# Next scheduled run
+ssh ansible "systemctl list-timers ansible-monthly-reboot.timer"
+
+# Disable for a month (e.g., during an incident)
+ssh ansible "systemctl stop ansible-monthly-reboot.timer"
+```
+
+## Rollback
+
+If a guest fails to start after reboot:
+1. Check Proxmox web UI or `pvesh get /nodes/proxmox/qemu/<VMID>/status/current`
+2. Review guest logs: `ssh proxmox "journalctl -u pve-guests -n 50"`
+3. Manual start: `ssh proxmox "pvesh create /nodes/proxmox/qemu/<VMID>/status/start"`
+4. If guest is corrupted, restore from the pre-reboot Proxmox snapshot
+
+## Related Documentation
+
+- [Ansible Controller Setup](../../vm-management/ansible-controller-setup.md) — LXC 304 details and inventory
+- [Proxmox 7→9 Upgrade Plan](../../vm-management/proxmox-upgrades/proxmox-7-to-9-upgrade-plan.md) — original startup order and Phase 1 lessons
+- [VM Decommission Runbook](../../vm-management/vm-decommission-runbook.md) — removing VMs from the rotation