homelab-audit.sh: Add backup recency and certificate checks #25

Closed
opened 2026-04-03 01:09:13 +00:00 by cal · 1 comment
Owner

Context

SRE review identified two critical audit gaps: no backup validation and no TLS certificate expiry checks. These are predictable failure modes that the audit should surface.

New Checks to Add

1. Proxmox backup recency

  • Query vzdump backup job logs: ssh proxmox "pvesh get /nodes/proxmox/tasks --typefilter vzdump --limit 50 --output-format json"
  • For each running VM/CT, check if a successful backup exists within the last 7 days
  • Flag VMs with no recent backup as WARN, VMs with no backup at all as CRIT
  • Show last backup date per VM in the inventory table

2. Certificate expiration

  • For hosts with listening HTTPS ports (443, 8443, etc.), check cert expiry
  • Can be done from the audit host: echo | openssl s_client -connect $ip:443 2>/dev/null | openssl x509 -noout -enddate
  • Flag certs expiring within 14 days as WARN, within 7 days as CRIT
  • Primarily relevant for NPM-managed domains and Gitea

3. OOM kill history

  • Check dmesg | grep -i "oom-kill" on each host
  • Flag any OOM events in the last 7 days as WARN
  • Particularly relevant for manticore (978 MB swap, 10 zombies)

4. Disk I/O check

  • Add cat /proc/diskstats delta or simple vmstat 1 2 to detect I/O wait
  • Flag I/O wait > 20% as WARN

Files

  • monitoring/scripts/homelab-audit.sh

Labels

infra-audit, script, monitoring

## Context SRE review identified two critical audit gaps: no backup validation and no TLS certificate expiry checks. These are predictable failure modes that the audit should surface. ## New Checks to Add ### 1. Proxmox backup recency - [ ] Query vzdump backup job logs: `ssh proxmox "pvesh get /nodes/proxmox/tasks --typefilter vzdump --limit 50 --output-format json"` - [ ] For each running VM/CT, check if a successful backup exists within the last 7 days - [ ] Flag VMs with no recent backup as WARN, VMs with no backup at all as CRIT - [ ] Show last backup date per VM in the inventory table ### 2. Certificate expiration - [ ] For hosts with listening HTTPS ports (443, 8443, etc.), check cert expiry - [ ] Can be done from the audit host: `echo | openssl s_client -connect $ip:443 2>/dev/null | openssl x509 -noout -enddate` - [ ] Flag certs expiring within 14 days as WARN, within 7 days as CRIT - [ ] Primarily relevant for NPM-managed domains and Gitea ### 3. OOM kill history - [ ] Check `dmesg | grep -i "oom-kill"` on each host - [ ] Flag any OOM events in the last 7 days as WARN - [ ] Particularly relevant for manticore (978 MB swap, 10 zombies) ### 4. Disk I/O check - [ ] Add `cat /proc/diskstats` delta or simple `vmstat 1 2` to detect I/O wait - [ ] Flag I/O wait > 20% as WARN ## Files - `monitoring/scripts/homelab-audit.sh` ## Labels `infra-audit`, `script`, `monitoring`
cal added the
infra-audit
monitoring
script
labels 2026-04-03 01:10:19 +00:00
Claude added the
ai-working
label 2026-04-03 02:00:44 +00:00
Claude removed the
ai-working
label 2026-04-03 02:07:04 +00:00
Collaborator

Implemented in PR #36.

Approach:

  • check_backup_recency(): runs pvesh get /nodes/proxmox/tasks --typefilter vzdump locally, uses Python to parse JSON and find the most recent successful backup per VM/CT — CRIT for no backup ever, WARN for no backup in 7 days
  • check_cert_expiry(): called per-host after SSH collection; probes ports 443 and 8443 via openssl s_client, skips silently if no HTTPS listener — WARN ≤14 days, CRIT ≤7 days
  • io_wait_pct(): added to the remote COLLECTOR_SCRIPT using vmstat 1 2; flagged WARN if I/O wait > 20%
  • OOM kill history was already implemented via oom_events() (journalctl kernel log, 7-day window) — no changes needed
Implemented in PR #36. **Approach:** - `check_backup_recency()`: runs `pvesh get /nodes/proxmox/tasks --typefilter vzdump` locally, uses Python to parse JSON and find the most recent successful backup per VM/CT — CRIT for no backup ever, WARN for no backup in 7 days - `check_cert_expiry()`: called per-host after SSH collection; probes ports 443 and 8443 via `openssl s_client`, skips silently if no HTTPS listener — WARN ≤14 days, CRIT ≤7 days - `io_wait_pct()`: added to the remote COLLECTOR_SCRIPT using `vmstat 1 2`; flagged WARN if I/O wait > 20% - OOM kill history was already implemented via `oom_events()` (journalctl kernel log, 7-day window) — no changes needed
Claude added the
ai-pr-opened
label 2026-04-03 02:07:13 +00:00
cal closed this issue 2026-04-03 02:15:43 +00:00
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: cal/claude-home#25
No description provided.