feat: add backup recency, cert expiry, and I/O wait checks (#25) #36
No reviewers
Labels
No Label
ai-changes-requested
ai-failed
ai-pr-opened
ai-reviewed
ai-reviewing
ai-working
infra-audit
monitoring
operations
proxmox
script
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: cal/claude-home#36
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "issue/25-homelab-audit-sh-add-backup-recency-and-certificat"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #25
Summary
Adds four new checks to
homelab-audit.shto close the audit gaps identified in the SRE review:1. Proxmox backup recency (
check_backup_recency)pvesh get /nodes/proxmox/tasks --typefilter vzdumpfor the last 50 vzdump tasks2. Certificate expiry (
check_cert_expiry)parse_and_reportfor each discovered VM/LXCopenssl s_clientwith a 10s timeout3. Disk I/O wait (
io_wait_pctin COLLECTOR_SCRIPT)vmstat 1 2to take a 1-second I/O wait sample on each remote hostIO_WAIT=<pct>alongside existing metrics4. OOM kill history
oom_events()collector (journalctl kernel log, 7-day window)Files changed
monitoring/scripts/homelab-audit.shAI Code Review
Files Reviewed
monitoring/scripts/homelab-audit.sh(modified)Findings
Correctness
1. Hardcoded Proxmox node name
proxmoxincheck_backup_recencyThe Proxmox default node name is
pve, notproxmox. If this node is named anything other thanproxmox, thepveshcall returns an error and the function falls through to:This silently kills the entire backup recency check with a misleading warning. Fix: resolve the node name dynamically or make it a configurable variable:
2. LXC containers mislabeled as
vm-{id}in Python outputrunning_idsis built by merging LXC IDs (frompct list) and VM IDs (fromqm list) into a single flat array with no type tag. The Python inline script then labels every entry asproxmox/vm-{vmid}:LXC container 301 would appear as
proxmox/vm-301, inconsistent with the rest of the audit which useslxc-301. The array needs to carry type metadata (e.g.lxc:301,vm:100) so the Python script can emit the correct prefix.Security
addrincheck_cert_expiryoriginates frompct list/qm list(trusted local Proxmox commands), not user input.Style & Conventions
json.load(open(tasks_file))in the inline Python script leaks a file handle. Usewith open(tasks_file) as f: tasks = json.load(f). Not a runtime bug in a short-lived subprocess, but inconsistent with Python best practices.set -uo pipefail, error guard patterns,|| continuein cert check, CRIT/WARN thresholds) matches the existing script's conventions cleanly.Suggestions
vmstat 1 2adds a mandatory 1-second delay per audited host. On a large inventory this accumulates. Consider documenting the timing impact or making the sample count configurable.Verdict: REQUEST_CHANGES (posted as COMMENT — Gitea prevents self-review)
Two correctness issues should be fixed before merge:
Hardcoded
proxmoxnode name inpvesh get /nodes/proxmox/tasks— will silently fail on any Proxmox install not namedproxmox(installer default ispve). UsePROXMOX_NODE="${PROXMOX_NODE:-$(hostname -s)}"and reference that variable.LXC containers labeled
vm-{id}in backup recency output — breaks label consistency. Track LXC and VM IDs separately so the Python script can printlxc-{id}vsvm-{id}.The cert expiry logic, IO wait collection, graceful SSH failure handling, and Python JSON processing are all solid.
Automated review by Claude PR Reviewer