Set up monthly Proxmox maintenance reboot schedule #26

New Issue

cal · 2026-04-03T01:09:24Z

cal commented

2026-04-03 01:09:24 +00:00

Context

The Proxmox host has been up 42 days with no maintenance reboot. Kernel updates accumulate, memory leaks go uncleared, and process state drifts (as evidenced by the avahi busy-loops and stuck processes found in this audit).

Plan

Establish a recurring first-Sunday-of-the-month maintenance window.

Tasks

Document the VM/CT shutdown order (dependency-aware):
1. Stop services that depend on databases first (discord bots, web apps)
2. Stop database VMs
3. Stop infrastructure (NPM, Pi-hole — manticore provides HA DNS)
4. Stop remaining VMs/CTs
5. Reboot Proxmox
6. Verify all VMs/CTs auto-start (check onboot: 1 on all)
7. Validate services come back
Verify all VMs have onboot: 1 set: ssh proxmox "for id in \$(qm list | awk 'NR>1{print \$1}'); do echo \$id: \$(qm config \$id | grep onboot); done"
Same for LXCs: ssh proxmox "for id in \$(pct list | awk 'NR>1{print \$1}'); do echo \$id: \$(pct config \$id | grep onboot); done"
Create a Google Calendar recurring event for the maintenance window
Consider: can Ansible (LXC 304) orchestrate the shutdown/startup sequence?
Document the procedure in the KB: server-configs/proxmox/maintenance-reboot.md

SRE Notes

Live migration is not available (single Proxmox node)
Total downtime should be ~15 minutes (reboot + VM startup)
Manticore provides HA DNS during Proxmox downtime

Labels

infra-audit, operations, proxmox

## Context The Proxmox host has been up 42 days with no maintenance reboot. Kernel updates accumulate, memory leaks go uncleared, and process state drifts (as evidenced by the avahi busy-loops and stuck processes found in this audit). ## Plan Establish a recurring first-Sunday-of-the-month maintenance window. ## Tasks - [ ] Document the VM/CT shutdown order (dependency-aware): 1. Stop services that depend on databases first (discord bots, web apps) 2. Stop database VMs 3. Stop infrastructure (NPM, Pi-hole — manticore provides HA DNS) 4. Stop remaining VMs/CTs 5. Reboot Proxmox 6. Verify all VMs/CTs auto-start (check `onboot: 1` on all) 7. Validate services come back - [ ] Verify all VMs have `onboot: 1` set: `ssh proxmox "for id in \$(qm list | awk 'NR>1{print \$1}'); do echo \$id: \$(qm config \$id | grep onboot); done"` - [ ] Same for LXCs: `ssh proxmox "for id in \$(pct list | awk 'NR>1{print \$1}'); do echo \$id: \$(pct config \$id | grep onboot); done"` - [ ] Create a Google Calendar recurring event for the maintenance window - [ ] Consider: can Ansible (LXC 304) orchestrate the shutdown/startup sequence? - [ ] Document the procedure in the KB: `server-configs/proxmox/maintenance-reboot.md` ## SRE Notes - Live migration is not available (single Proxmox node) - Total downtime should be ~15 minutes (reboot + VM startup) - Manticore provides HA DNS during Proxmox downtime ## Labels `infra-audit`, `operations`, `proxmox`

cal added the

infra-audit

operations

proxmox

labels 2026-04-03 01:10:20 +00:00

cal referenced this issue from a commit

2026-04-03 21:33:31 +00:00

feat: add monthly Proxmox maintenance reboot automation (#26)

~~cal referenced this issue 2026-04-03 21:58:18 +00:00~~

chore: add --hosts test coverage and right-size VM 115 socket config #46

cal referenced this issue from a commit

2026-04-04 04:34:39 +00:00