Set up monthly Proxmox maintenance reboot schedule #26

Closed
opened 2026-04-03 01:09:24 +00:00 by cal · 0 comments
Owner

Context

The Proxmox host has been up 42 days with no maintenance reboot. Kernel updates accumulate, memory leaks go uncleared, and process state drifts (as evidenced by the avahi busy-loops and stuck processes found in this audit).

Plan

Establish a recurring first-Sunday-of-the-month maintenance window.

Tasks

  • Document the VM/CT shutdown order (dependency-aware):
    1. Stop services that depend on databases first (discord bots, web apps)
    2. Stop database VMs
    3. Stop infrastructure (NPM, Pi-hole — manticore provides HA DNS)
    4. Stop remaining VMs/CTs
    5. Reboot Proxmox
    6. Verify all VMs/CTs auto-start (check onboot: 1 on all)
    7. Validate services come back
  • Verify all VMs have onboot: 1 set: ssh proxmox "for id in \$(qm list | awk 'NR>1{print \$1}'); do echo \$id: \$(qm config \$id | grep onboot); done"
  • Same for LXCs: ssh proxmox "for id in \$(pct list | awk 'NR>1{print \$1}'); do echo \$id: \$(pct config \$id | grep onboot); done"
  • Create a Google Calendar recurring event for the maintenance window
  • Consider: can Ansible (LXC 304) orchestrate the shutdown/startup sequence?
  • Document the procedure in the KB: server-configs/proxmox/maintenance-reboot.md

SRE Notes

  • Live migration is not available (single Proxmox node)
  • Total downtime should be ~15 minutes (reboot + VM startup)
  • Manticore provides HA DNS during Proxmox downtime

Labels

infra-audit, operations, proxmox

## Context The Proxmox host has been up 42 days with no maintenance reboot. Kernel updates accumulate, memory leaks go uncleared, and process state drifts (as evidenced by the avahi busy-loops and stuck processes found in this audit). ## Plan Establish a recurring first-Sunday-of-the-month maintenance window. ## Tasks - [ ] Document the VM/CT shutdown order (dependency-aware): 1. Stop services that depend on databases first (discord bots, web apps) 2. Stop database VMs 3. Stop infrastructure (NPM, Pi-hole — manticore provides HA DNS) 4. Stop remaining VMs/CTs 5. Reboot Proxmox 6. Verify all VMs/CTs auto-start (check `onboot: 1` on all) 7. Validate services come back - [ ] Verify all VMs have `onboot: 1` set: `ssh proxmox "for id in \$(qm list | awk 'NR>1{print \$1}'); do echo \$id: \$(qm config \$id | grep onboot); done"` - [ ] Same for LXCs: `ssh proxmox "for id in \$(pct list | awk 'NR>1{print \$1}'); do echo \$id: \$(pct config \$id | grep onboot); done"` - [ ] Create a Google Calendar recurring event for the maintenance window - [ ] Consider: can Ansible (LXC 304) orchestrate the shutdown/startup sequence? - [ ] Document the procedure in the KB: `server-configs/proxmox/maintenance-reboot.md` ## SRE Notes - Live migration is not available (single Proxmox node) - Total downtime should be ~15 minutes (reboot + VM startup) - Manticore provides HA DNS during Proxmox downtime ## Labels `infra-audit`, `operations`, `proxmox`
cal added the
infra-audit
operations
proxmox
labels 2026-04-03 01:10:20 +00:00
cal closed this issue 2026-04-04 04:34:39 +00:00
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: cal/claude-home#26
No description provided.