cal/claude-home

Fork 0

Cal Corum 64f299aa1a

Reindex Knowledge Base / reindex (push) Successful in 2s

Details

docs: sync KB — maintenance-reboot.md

2026-04-03 16:00:22 -05:00

6.9 KiB

Raw Blame History

title

description

type

domain

Proxmox Monthly Maintenance Reboot

Overview

Detail	Value
Schedule	1st Sunday of every month, 3:00 AM ET (08:00 UTC)
Expected downtime	~15 minutes (host reboot + VM/LXC startup)
Orchestration	Ansible playbook on LXC 304 (ansible-controller)
Calendar	Google Calendar recurring event: "Proxmox Monthly Maintenance Reboot"
HA DNS	ubuntu-manticore (10.10.0.226) provides Pi-hole 2 during Proxmox downtime

Why

Kernel updates accumulate without reboot and never take effect
Long uptimes allow memory leaks and process state drift (e.g., avahi busy-loops)
Validates that all VMs/LXCs auto-start cleanly with onboot: 1

Prerequisites (Before Maintenance)

Verify no active Tdarr transcodes on ubuntu-manticore
Verify no running database backups
Switch workstation DNS to 1.1.1.1 (Pi-hole 1 on VM 106 will be offline)
Confirm ubuntu-manticore Pi-hole 2 is healthy: ssh manticore "docker exec pihole pihole status"

`onboot` Audit

All production VMs and LXCs must have onboot: 1 so they restart automatically if the playbook fails mid-sequence.

Check VMs:

ssh proxmox "for id in \$(qm list | awk 'NR>1{print \$1}'); do \
  name=\$(qm config \$id | grep '^name:' | awk '{print \$2}'); \
  onboot=\$(qm config \$id | grep '^onboot:'); \
  echo \"VM \$id (\$name): \${onboot:-onboot NOT SET}\"; \
done"

Check LXCs:

ssh proxmox "for id in \$(pct list | awk 'NR>1{print \$1}'); do \
  name=\$(pct config \$id | grep '^hostname:' | awk '{print \$2}'); \
  onboot=\$(pct config \$id | grep '^onboot:'); \
  echo \"LXC \$id (\$name): \${onboot:-onboot NOT SET}\"; \
done"

Audit results (2026-04-03):

ID	Name	Type	`onboot`	Action needed
106	docker-home	VM	1	OK
109	homeassistant	VM	NOT SET	Add `onboot: 1`
110	discord-bots	VM	1	OK
112	databases-bots	VM	1	OK
115	docker-sba	VM	1	OK
116	docker-home-servers	VM	1	OK
210	docker-n8n-lxc	LXC	1	OK
221	arr-stack	LXC	NOT SET	Add `onboot: 1`
222	memos	LXC	1	OK
223	foundry-lxc	LXC	NOT SET	Add `onboot: 1`
225	gitea	LXC	1	OK
227	uptime-kuma	LXC	1	OK
301	claude-discord-coordinator	LXC	1	OK
302	claude-runner	LXC	1	OK
303	mcp-gateway	LXC	0	Intentional (on-demand)
304	ansible-controller	LXC	1	OK

Fix missing onboot:

ssh proxmox "qm set 109 --onboot 1"
ssh proxmox "pct set 221 --onboot 1"
ssh proxmox "pct set 223 --onboot 1"

Shutdown Order (Dependency-Aware)

Reverse of the validated startup sequence. Stop consumers before their dependencies.

Tier 4 — Media & Others (no downstream dependents)
  VM 109  homeassistant
  LXC 221 arr-stack
  LXC 222 memos
  LXC 223 foundry-lxc
  LXC 302 claude-runner
  LXC 303 mcp-gateway (if running)

Tier 3 — Applications (depend on databases + infra)
  VM 115  docker-sba (Paper Dynasty, Major Domo)
  VM 110  discord-bots
  LXC 301 claude-discord-coordinator

Tier 2 — Infrastructure + DNS (depend on databases)
  VM 106  docker-home (Pi-hole 1, NPM)
  LXC 225 gitea
  LXC 210 docker-n8n-lxc
  LXC 227 uptime-kuma
  VM 116  docker-home-servers

Tier 1 — Databases (no dependencies, shut down last)
  VM 112  databases-bots

Tier 0 — Ansible controller shuts itself down last
  LXC 304 ansible-controller

→ Proxmox host reboots

Known quirks:

VM 112 (databases-bots) may ignore ACPI shutdown — use --forceStop after timeout
VM 109 (homeassistant) is self-managed via HA Supervisor, excluded from Ansible inventory

Startup Order (Staggered)

After the Proxmox host reboots, guests with onboot: 1 will auto-start. The Ansible playbook overrides this with a controlled sequence:

Tier 1 — Databases first
  VM 112  databases-bots
  → wait 30s for DB to accept connections

Tier 2 — Infrastructure + DNS
  VM 106  docker-home (Pi-hole 1, NPM)
  LXC 225 gitea
  LXC 210 docker-n8n-lxc
  LXC 227 uptime-kuma
  VM 116  docker-home-servers
  → wait 30s

Tier 3 — Applications
  VM 115  docker-sba
  VM 110  discord-bots
  LXC 301 claude-discord-coordinator
  → wait 30s

Pi-hole fix — restart container to clear UDP DNS bug
  qm guest exec 106 -- docker restart pihole
  → wait 10s

Tier 4 — Media & Others
  VM 109  homeassistant
  LXC 221 arr-stack
  LXC 222 memos
  LXC 223 foundry-lxc

Post-Reboot Validation

Pi-hole 1 DNS resolving: ssh docker-home "docker exec pihole dig google.com @127.0.0.1"
Gitea accessible: curl -sf https://git.manticorum.com/api/v1/version
n8n healthy: ssh docker-n8n-lxc "docker ps --filter name=n8n --format '{{.Status}}'"
Discord bots responding (check Discord)
Uptime Kuma dashboard green: curl -sf http://10.10.0.227:3001/api/status-page/homelab
Home Assistant running: curl -sf http://10.10.0.109:8123/api/ -H 'Authorization: Bearer <token>'
Switch workstation DNS back from 1.1.1.1 to Pi-hole

Automation

Ansible Playbook

Located at /opt/ansible/playbooks/monthly-reboot.yml on LXC 304.

# Dry run (check mode)
ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --check"

# Manual execution
ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml"

# Limit to shutdown only (skip reboot)
ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --tags shutdown"

Systemd Timer

The playbook runs automatically via systemd timer on LXC 304:

# Check timer status
ssh ansible "systemctl status ansible-monthly-reboot.timer"

# Next scheduled run
ssh ansible "systemctl list-timers ansible-monthly-reboot.timer"

# Disable for a month (e.g., during an incident)
ssh ansible "systemctl stop ansible-monthly-reboot.timer"

Rollback

If a guest fails to start after reboot:

Check Proxmox web UI or pvesh get /nodes/proxmox/qemu/<VMID>/status/current
Review guest logs: ssh proxmox "journalctl -u pve-guests -n 50"
Manual start: ssh proxmox "pvesh create /nodes/proxmox/qemu/<VMID>/status/start"
If guest is corrupted, restore from the pre-reboot Proxmox snapshot

Ansible Controller Setup — LXC 304 details and inventory
Proxmox 7→9 Upgrade Plan — original startup order and Phase 1 lessons
VM Decommission Runbook — removing VMs from the rotation

6.9 KiB Raw Blame History