feat: add backup recency, cert expiry, and I/O wait checks (#25) #36

Merged
cal merged 1 commits from issue/25-homelab-audit-sh-add-backup-recency-and-certificat into main 2026-04-03 02:15:42 +00:00
Collaborator

Closes #25

Summary

Adds four new checks to homelab-audit.sh to close the audit gaps identified in the SRE review:

1. Proxmox backup recency (check_backup_recency)

  • Queries pvesh get /nodes/proxmox/tasks --typefilter vzdump for the last 50 vzdump tasks
  • For each running VM/CT, finds the most recent successful backup
  • CRIT if no backup found in task history at all
  • WARN if most recent backup is older than 7 days
  • Runs locally on the Proxmox host before the per-host SSH loop

2. Certificate expiry (check_cert_expiry)

  • Called per-host after parse_and_report for each discovered VM/LXC
  • Probes ports 443 and 8443 via openssl s_client with a 10s timeout
  • CRIT if cert expires within 7 days
  • WARN if cert expires within 14 days
  • Silently skips hosts with no HTTPS listener on those ports

3. Disk I/O wait (io_wait_pct in COLLECTOR_SCRIPT)

  • Uses vmstat 1 2 to take a 1-second I/O wait sample on each remote host
  • Emits IO_WAIT=<pct> alongside existing metrics
  • WARN if I/O wait > 20%

4. OOM kill history

  • Already present in the existing oom_events() collector (journalctl kernel log, 7-day window)
  • No changes needed — the check was already implemented

Files changed

  • monitoring/scripts/homelab-audit.sh
Closes #25 ## Summary Adds four new checks to `homelab-audit.sh` to close the audit gaps identified in the SRE review: ### 1. Proxmox backup recency (`check_backup_recency`) - Queries `pvesh get /nodes/proxmox/tasks --typefilter vzdump` for the last 50 vzdump tasks - For each running VM/CT, finds the most recent successful backup - **CRIT** if no backup found in task history at all - **WARN** if most recent backup is older than 7 days - Runs locally on the Proxmox host before the per-host SSH loop ### 2. Certificate expiry (`check_cert_expiry`) - Called per-host after `parse_and_report` for each discovered VM/LXC - Probes ports 443 and 8443 via `openssl s_client` with a 10s timeout - **CRIT** if cert expires within 7 days - **WARN** if cert expires within 14 days - Silently skips hosts with no HTTPS listener on those ports ### 3. Disk I/O wait (`io_wait_pct` in COLLECTOR_SCRIPT) - Uses `vmstat 1 2` to take a 1-second I/O wait sample on each remote host - Emits `IO_WAIT=<pct>` alongside existing metrics - **WARN** if I/O wait > 20% ### 4. OOM kill history - Already present in the existing `oom_events()` collector (journalctl kernel log, 7-day window) - No changes needed — the check was already implemented ## Files changed - `monitoring/scripts/homelab-audit.sh`
Claude added 1 commit 2026-04-03 02:06:59 +00:00
feat: add backup recency, cert expiry, OOM, and I/O wait checks (#25)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
ae5da035f6
Closes #25

- check_backup_recency(): queries pvesh vzdump task history; flags VMs
  with no backup (CRIT) or no backup in 7 days (WARN)
- check_cert_expiry(): probes ports 443/8443 per host via openssl;
  flags certs expiring ≤14 days (WARN) or ≤7 days (CRIT)
- io_wait_pct() in COLLECTOR_SCRIPT: uses vmstat 1 2 to sample I/O
  wait; flagged as WARN when > 20%
- OOM kill history was already collected via journalctl; no changes needed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude added the
ai-reviewing
label 2026-04-03 02:15:09 +00:00
cal merged commit 1db2c2b168 into main 2026-04-03 02:15:42 +00:00
cal deleted branch issue/25-homelab-audit-sh-add-backup-recency-and-certificat 2026-04-03 02:15:43 +00:00
Claude reviewed 2026-04-03 02:18:20 +00:00
Claude left a comment
Author
Collaborator

AI Code Review

Files Reviewed

  • monitoring/scripts/homelab-audit.sh (modified)

Findings

Correctness

1. Hardcoded Proxmox node name proxmox in check_backup_recency

pvesh get /nodes/proxmox/tasks --typefilter vzdump --limit 50 --output-format json

The Proxmox default node name is pve, not proxmox. If this node is named anything other than proxmox, the pvesh call returns an error and the function falls through to:

WARN  proxmox: failed to query vzdump task history

This silently kills the entire backup recency check with a misleading warning. Fix: resolve the node name dynamically or make it a configurable variable:

PROXMOX_NODE="${PROXMOX_NODE:-$(hostname -s)}"
# then use: pvesh get /nodes/"$PROXMOX_NODE"/tasks ...

2. LXC containers mislabeled as vm-{id} in Python output

running_ids is built by merging LXC IDs (from pct list) and VM IDs (from qm list) into a single flat array with no type tag. The Python inline script then labels every entry as proxmox/vm-{vmid}:

print(f"WARN  proxmox/vm-{vmid}: last backup {dt} is older than 7 days")
print(f"CRIT  proxmox/vm-{vmid}: no backup found in task history")

LXC container 301 would appear as proxmox/vm-301, inconsistent with the rest of the audit which uses lxc-301. The array needs to carry type metadata (e.g. lxc:301, vm:100) so the Python script can emit the correct prefix.

Security

  • No issues. addr in check_cert_expiry originates from pct list/qm list (trusted local Proxmox commands), not user input.
  • No hardcoded credentials.

Style & Conventions

  • json.load(open(tasks_file)) in the inline Python script leaks a file handle. Use with open(tasks_file) as f: tasks = json.load(f). Not a runtime bug in a short-lived subprocess, but inconsistent with Python best practices.
  • Everything else (quoting, set -uo pipefail, error guard patterns, || continue in cert check, CRIT/WARN thresholds) matches the existing script's conventions cleanly.

Suggestions

  • vmstat 1 2 adds a mandatory 1-second delay per audited host. On a large inventory this accumulates. Consider documenting the timing impact or making the sample count configurable.

Verdict: REQUEST_CHANGES (posted as COMMENT — Gitea prevents self-review)

Two correctness issues should be fixed before merge:

  1. Hardcoded proxmox node name in pvesh get /nodes/proxmox/tasks — will silently fail on any Proxmox install not named proxmox (installer default is pve). Use PROXMOX_NODE="${PROXMOX_NODE:-$(hostname -s)}" and reference that variable.

  2. LXC containers labeled vm-{id} in backup recency output — breaks label consistency. Track LXC and VM IDs separately so the Python script can print lxc-{id} vs vm-{id}.

The cert expiry logic, IO wait collection, graceful SSH failure handling, and Python JSON processing are all solid.


Automated review by Claude PR Reviewer

## AI Code Review ### Files Reviewed - `monitoring/scripts/homelab-audit.sh` (modified) ### Findings #### Correctness **1. Hardcoded Proxmox node name `proxmox` in `check_backup_recency`** ```bash pvesh get /nodes/proxmox/tasks --typefilter vzdump --limit 50 --output-format json ``` The Proxmox default node name is `pve`, not `proxmox`. If this node is named anything other than `proxmox`, the `pvesh` call returns an error and the function falls through to: ``` WARN proxmox: failed to query vzdump task history ``` This silently kills the entire backup recency check with a misleading warning. Fix: resolve the node name dynamically or make it a configurable variable: ```bash PROXMOX_NODE="${PROXMOX_NODE:-$(hostname -s)}" # then use: pvesh get /nodes/"$PROXMOX_NODE"/tasks ... ``` **2. LXC containers mislabeled as `vm-{id}` in Python output** `running_ids` is built by merging LXC IDs (from `pct list`) and VM IDs (from `qm list`) into a single flat array with no type tag. The Python inline script then labels every entry as `proxmox/vm-{vmid}`: ```python print(f"WARN proxmox/vm-{vmid}: last backup {dt} is older than 7 days") print(f"CRIT proxmox/vm-{vmid}: no backup found in task history") ``` LXC container 301 would appear as `proxmox/vm-301`, inconsistent with the rest of the audit which uses `lxc-301`. The array needs to carry type metadata (e.g. `lxc:301`, `vm:100`) so the Python script can emit the correct prefix. #### Security - No issues. `addr` in `check_cert_expiry` originates from `pct list`/`qm list` (trusted local Proxmox commands), not user input. - No hardcoded credentials. #### Style & Conventions - `json.load(open(tasks_file))` in the inline Python script leaks a file handle. Use `with open(tasks_file) as f: tasks = json.load(f)`. Not a runtime bug in a short-lived subprocess, but inconsistent with Python best practices. - Everything else (quoting, `set -uo pipefail`, error guard patterns, `|| continue` in cert check, CRIT/WARN thresholds) matches the existing script's conventions cleanly. #### Suggestions - `vmstat 1 2` adds a mandatory 1-second delay per audited host. On a large inventory this accumulates. Consider documenting the timing impact or making the sample count configurable. ### Verdict: REQUEST_CHANGES (posted as COMMENT — Gitea prevents self-review) Two correctness issues should be fixed before merge: 1. **Hardcoded `proxmox` node name** in `pvesh get /nodes/proxmox/tasks` — will silently fail on any Proxmox install not named `proxmox` (installer default is `pve`). Use `PROXMOX_NODE="${PROXMOX_NODE:-$(hostname -s)}"` and reference that variable. 2. **LXC containers labeled `vm-{id}`** in backup recency output — breaks label consistency. Track LXC and VM IDs separately so the Python script can print `lxc-{id}` vs `vm-{id}`. The cert expiry logic, IO wait collection, graceful SSH failure handling, and Python JSON processing are all solid. --- *Automated review by Claude PR Reviewer*
Claude removed the
ai-reviewing
label 2026-04-03 02:18:27 +00:00
Sign in to join this conversation.
No description provided.