homelab-audit.sh: Fix variable interpolation and collector bugs #23

Closed
opened 2026-04-03 01:08:54 +00:00 by cal · 1 comment
Owner

Context

SRE review of monitoring/scripts/homelab-audit.sh identified several bugs in the collector script and variable handling.

Bugs to Fix

1. STUCK_PROC_CPU_WARN not reaching the remote collector

The COLLECTOR_SCRIPT heredoc is single-quoted (COLLECTOR_SCRIPT='...'), so $STUCK_PROC_CPU_WARN is never interpolated. The collector hardcodes 10 instead of using the configurable threshold.

Fix options:

  • a) Interpolate at definition time: change to double-quoted heredoc and escape internal $ for remote variables
  • b) Pass the threshold as an argument to the remote bash session
  • c) Inject the value via sed before sending: echo "$COLLECTOR_SCRIPT" | sed "s/THRESHOLD_PLACEHOLDER/$STUCK_PROC_CPU_WARN/"

2. LXC IP discovery unreliable for static-IP containers

lxc-info -n $ctid -iH only works for containers using Proxmox-managed bridges with DHCP. Containers with static IPs set inside the container (not via Proxmox config) return no IP and are silently skipped.

Fix: Fall back to parsing pct config $ctid | grep "ip=" for containers where lxc-info returns empty.

3. SSH failures silently dropped

2>/dev/null on ssh_cmd suppresses all errors including host key changes and connection failures. A re-provisioned host silently disappears from the audit.

Fix: Log SSH failures to $REPORT_DIR/ssh-failures.log and include a count in the summary.

4. set -uo pipefail comment

Add explicit comment: # -e omitted intentionally — unreachable hosts should not abort the full audit

Files

  • monitoring/scripts/homelab-audit.sh

Labels

infra-audit, script

## Context SRE review of `monitoring/scripts/homelab-audit.sh` identified several bugs in the collector script and variable handling. ## Bugs to Fix ### 1. STUCK_PROC_CPU_WARN not reaching the remote collector The `COLLECTOR_SCRIPT` heredoc is **single-quoted** (`COLLECTOR_SCRIPT='...'`), so `$STUCK_PROC_CPU_WARN` is never interpolated. The collector hardcodes `10` instead of using the configurable threshold. **Fix options:** - a) Interpolate at definition time: change to double-quoted heredoc and escape internal `$` for remote variables - b) Pass the threshold as an argument to the remote bash session - c) Inject the value via sed before sending: `echo "$COLLECTOR_SCRIPT" | sed "s/THRESHOLD_PLACEHOLDER/$STUCK_PROC_CPU_WARN/"` ### 2. LXC IP discovery unreliable for static-IP containers `lxc-info -n $ctid -iH` only works for containers using Proxmox-managed bridges with DHCP. Containers with static IPs set inside the container (not via Proxmox config) return no IP and are silently skipped. **Fix:** Fall back to parsing `pct config $ctid | grep "ip="` for containers where `lxc-info` returns empty. ### 3. SSH failures silently dropped `2>/dev/null` on `ssh_cmd` suppresses all errors including host key changes and connection failures. A re-provisioned host silently disappears from the audit. **Fix:** Log SSH failures to `$REPORT_DIR/ssh-failures.log` and include a count in the summary. ### 4. set -uo pipefail comment Add explicit comment: `# -e omitted intentionally — unreachable hosts should not abort the full audit` ## Files - `monitoring/scripts/homelab-audit.sh` ## Labels `infra-audit`, `script`
cal added the
infra-audit
script
labels 2026-04-03 01:10:18 +00:00
Claude added the
ai-working
label 2026-04-03 01:30:30 +00:00
Claude added the
ai-pr-opened
label 2026-04-03 01:34:20 +00:00
Collaborator

PR opened: #34

Created monitoring/scripts/homelab-audit.sh with all four fixes:

  1. STUCK_PROC_CPU_WARN interpolation — heredoc stays single-quoted; threshold passed as $1 to the remote bash -s session
  2. LXC IP fallbackget_lxc_ip() tries lxc-info first, then pct config … | grep -oP 'ip=…' for static-IP containers
  3. SSH failure logging — stderr goes to $REPORT_DIR/ssh-failures.log; failure count shown in summary
  4. set comment# -e omitted intentionally — unreachable hosts should not abort the full audit
PR opened: https://git.manticorum.com/cal/claude-home/pulls/34 Created `monitoring/scripts/homelab-audit.sh` with all four fixes: 1. **STUCK_PROC_CPU_WARN interpolation** — heredoc stays single-quoted; threshold passed as `$1` to the remote `bash -s` session 2. **LXC IP fallback** — `get_lxc_ip()` tries `lxc-info` first, then `pct config … | grep -oP 'ip=…'` for static-IP containers 3. **SSH failure logging** — stderr goes to `$REPORT_DIR/ssh-failures.log`; failure count shown in summary 4. **set comment** — `# -e omitted intentionally — unreachable hosts should not abort the full audit`
Claude removed the
ai-working
label 2026-04-03 01:34:30 +00:00
cal closed this issue 2026-04-03 01:49:12 +00:00
Sign in to join this conversation.
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: cal/claude-home#23
No description provided.