fix: homelab-audit.sh variable interpolation and collector bugs (#23) #34
No reviewers
Labels
No Label
ai-changes-requested
ai-failed
ai-pr-opened
ai-reviewed
ai-reviewing
ai-working
infra-audit
monitoring
operations
proxmox
script
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: cal/claude-home#34
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "issue/23-homelab-audit-sh-fix-variable-interpolation-and-co"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #23
Changes
Creates
monitoring/scripts/homelab-audit.shwith all four bugs from the SRE review addressed:1. STUCK_PROC_CPU_WARN threshold now reaches the remote collector
COLLECTOR_SCRIPTheredoc is kept single-quoted (no local variable leakage). Instead,STUCK_PROC_CPU_WARNis passed as$1when invoking the remote bash session:The collector receives it as
STUCK_PROC_CPU_WARN="${1:-10}".2. LXC IP discovery with static-IP fallback
get_lxc_ip()first trieslxc-info -n "$ctid" -iH. If that returns empty, it falls back to:3. SSH failures logged, not silently dropped
ssh_cmd()redirects stderr to$REPORT_DIR/ssh-failures.loginstead of/dev/null. SSH failures are counted and printed in the audit summary. ASSH_FAILUREsentinel line is written per failed host.4. Explicit set options comment
Files Changed
monitoring/scripts/homelab-audit.sh(new file, 268 lines)AI Code Review
Files Reviewed
monitoring/scripts/homelab-audit.sh(added, 268 lines)Second-Round Fixes Confirmed
Both additions from the first review round are correctly implemented:
--output-dirguard: the$# -lt 2check correctly detects a missing argument beforeshift 2would underrun the argument count.ZOMBIE_WARN=1default and((zombies >= ZOMBIE_WARN))comparison: correct substitution of the hardcoded> 0— the configurable threshold is now respected.Findings
Correctness
Bug:
awkfield references inside single-quoted heredoc use unquoted$— they will not reach the remote shell correctly.Inside
COLLECTOR_SCRIPT(single-quoted), thestuck_procs()function contains:Because the heredoc is single-quoted, the literal text
$1,$2,$3is transmitted to the remote shell verbatim, which is the desired behaviour for the awk positional fields. However the awk program string itself is passed as a double-quoted shell argument on the remote side, meaning the remote shell will try to expand$1,$2,$3as shell positional parameters before awk sees them — and since the remote bash invocation has only$1=STUCK_PROC_CPU_WARN value,$2and$3will expand to empty strings, making the awk filter always produce no output.The fix is to single-quote the awk program on the remote side:
But because
COLLECTOR_SCRIPTis itself single-quoted, inner single quotes cannot be used directly. The standard workaround is'\''(end single quote, literal', resume single quote) or using$'...'quoting for the outer variable. This is a functional bug:stuck_procs()will always return empty output.Minor:
host_countincrement uses|| trueto suppress errexit, butset -eis already omitted.((host_count++)) || true— the|| trueguard is necessary whenset -eis active because((expr))exits with code 1 when the expression evaluates to 0 (i.e. on the first increment from 0). Sinceset -eis intentionally omitted here, the guard is harmless but slightly misleading. Not a bug, just worth noting.Minor:
generate_summarysorts$FINDINGS_FILEdirectly — if the file is large, this is fine, butsortis called without-oso it reads from the file while potentially still being written to by a concurrentparse_and_reportcall. In practicemain()calls these sequentially so there is no race, but the ordering assumption is implicit.Security
SSH_OPTSusesStrictHostKeyChecking=accept-new, which silently accepts new host keys but will refuse changed ones. This is a reasonable posture for a homelab audit tool against a known-stable inventory, but it is worth documenting the trade-off (an attacker who can intercept a first-time connection could MITM it). No change required — just an observation.SSH_USERdefaults toroot, which is documented.python3snippet insidecollect_inventoryusesjson.load(sys.stdin)on data fromqm guest cmd— this is trusted Proxmox tooling output, not user-supplied input, so injection risk is negligible.REPORT_DIRis created withmkdir -pwithout achmod; it inherits the umask. SSH failure logs and findings may contain hostnames and IPs. For a root-run audit script this is acceptable, but worth knowing.Style & Conventions
#!/usr/bin/env bash��� correct per convention.set -uo pipefailwith an explanatory comment for the omitted-e— good practice, matches the bash-CONTEXT.md guidance.DISK_WARN,DISK_CRIT, etc.) are grouped at the top.shellcheck disable=SC2086is used narrowly and correctly for the intentionally unquoted$SSH_OPTS.COLLECTOR_SCRIPTsingle-quoting strategy and rationale are clearly commented.parse_and_reportusesIFS= read -r linecorrectly.Summary of Required Fix
The
awkprogram string instuck_procs()insideCOLLECTOR_SCRIPTuses double-quoted shell argument syntax on the remote, causing$1/$2/$3to be expanded as shell positionals instead of awk field references.stuck_procs()will always produce empty output. The awk program needs to be single-quoted on the remote side; this requires escaping within the outer single-quoted variable (e.g.,'\''quoting or restructuring).Verdict: REQUEST_CHANGES
One functional bug blocks approval: the
awkfield references instuck_procs()are subject to remote shell expansion, causing the D-state process detector to silently produce no output regardless of actual stuck processes. This directly undermines the fix described in PR item #1 (theSTUCK_PROC_CPU_WARNthreshold). All other aspects of the implementation are solid.Automated review by Claude PR Reviewer
@ -0,0 +71,4 @@disk_usage() {df --output=pcent,target -x tmpfs -x devtmpfs 2>/dev/null | tail -n +2 | \while read -r pct mnt; do echo "${pct%%%} $mnt"; done}The awk program is double-quoted on the remote side. When
bash -s -- "$STUCK_PROC_CPU_WARN"runs on the remote host, the remote shell will expand$1,$2,$3as its own positional parameters before passing the string to awk. Since only$1has a value (the CPU threshold),$2and$3expand to empty string — the awk stat/pcpu/comm field references vanish and the filter never matches anything.Fix: single-quote the awk program on the remote side. Since
COLLECTOR_SCRIPTis already outer-single-quoted, use'\''to embed a literal single quote:AI Code Review (Round 3)
Files Reviewed
monitoring/scripts/homelab-audit.sh(added, 273 lines)Verification of Round 1-2-3 Fixes
Fix 1 —
--output-dirguard (Round 1)Correct. Lines 35-38 check
$# -lt 2immediately after matching--output-dir, before any shift. Callinghomelab-audit.sh --output-dirwith no argument correctly prints the error and exits 1.Fix 2 —
ZOMBIE_WARNconfigurable (Round 2)Correct per the design intent.
ZOMBIE_WARN=1on line 30 is a named constant, consistent withDISK_WARN=80,DISK_CRIT=90,LOAD_WARN=2.0, andMEM_WARN=85— none of which use the env-override pattern. The three variables documented in the header (STUCK_PROC_CPU_WARN,REPORT_DIR,SSH_USER) are the only runtime env overrides. Using a named constant is the right call here; it avoids cluttering the env-override interface with thresholds that rarely need per-run tuning.Fix 3 —
awksingle-quoting instuck_procs()(Round 3)Correct. The
'\''escape sequence inside the single-quotedCOLLECTOR_SCRIPTstring produces a literal single quote in the stored script text. The remote shell therefore receives:The awk program is single-quoted, so awk field references
$1,$2,$3are never interpolated by the remote shell. The-v t=binding correctly passesSTUCK_PROC_CPU_WARNfrom the remote variable (set from$1at the top of the collector script). This is the right fix.Findings
Correctness
One minor issue remains:
$SSH_FAILURES_LOGis never truncated at startup, but$FINDINGS_FILEis (line 261:>"$FINDINGS_FILE"). If--output-dirpoints to a reused directory, stale SSH failure sentinel lines from a prior run will accumulate inssh-failures.log. The summary'sssh_failure_countwill therefore overcount on reruns. The default REPORT_DIR uses a timestamp so this only affects the--output-dirpath, and the impact is limited to the summary counter being wrong. This is not a blocker, but worth noting.The awk programs in
cpu_load()andmem_pct()use double-quoted programs with\$N(e.g.,"{print \$2}"). Inside the single-quotedCOLLECTOR_SCRIPT, these literal characters reach the remote bash verbatim. On the remote, the double-quoted awk program has\$2which bash reduces to a literal$2before passing it to awk. This is correct behavior.The
awk "BEGIN{exit !($load > $LOAD_WARN)}"on line 175 interpolates$load(which comes from remote host output) directly into an awk program. A compromised host could inject awk code via a craftedCPU_LOAD=line. For a homelab auditing trusted hosts this is an acceptable risk, but worth noting for anyone adapting this script to untrusted networks.Security
No secrets, hardcoded credentials, or data exposure issues.
SSH_OPTSusesBatchMode=yes(prevents password prompts),StrictHostKeyChecking=accept-new(first-connect trust, no TOFU bypass), andConnectTimeout=10(avoids hangs). SSH stderr goes to a log file rather than being silently swallowed — this was the Round 2 fix and it's implemented correctly.Style and Conventions
Matches project bash conventions:
set -uo pipefail, functions withlocalvariables, consistent>>"$FINDINGS_FILE"append patterns,((expr)) || truefor arithmetic in non--econtext, and thorough section comments. The# shellcheck disable=SC2086on line 101 is justified by the intentional word-splitting of$SSH_OPTS.Suggestions
>"$SSH_FAILURES_LOG"alongside>"$FINDINGS_FILE"inmain()to ensure reruns with--output-dirproduce a clean log. One line fix.[--output-dir DIR]but does not mention thatZOMBIE_WARN,DISK_WARN,DISK_CRIT,LOAD_WARN, andMEM_WARNcan be changed by editing the constants block. A short comment above that block (e.g.,# Thresholds — edit to adjust sensitivity) would make it discoverable.Verdict: APPROVED
All three fixes from prior review rounds are present and correct. The remaining items (stale ssh-failures.log on reuse, awk injection from trusted-host output) are low-severity and the latter is an accepted risk for a homelab context. The script is ready to merge.
Automated review by Claude PR Reviewer
AI Code Review
Files Reviewed
monitoring/scripts/homelab-audit.sh(added, 268 lines)⚠️ Process Concern: Duplicate PR / Merge Conflict
This PR is currently marked not mergeable and a competing PR #33 addresses the same issue (#23) with virtually identical changes. One of these two PRs needs to be closed before the other can land. The code review below covers the code itself.
Findings
Correctness
All four claimed bug fixes are implemented correctly:
STUCK_PROC_CPU_WARN threshold propagation —
COLLECTOR_SCRIPTis single-quoted (no local interpolation) and the threshold is passed as positional arg$1viabash -s -- "$STUCK_PROC_CPU_WARN". The collector receives it asSTUCK_PROC_CPU_WARN="${1:-10}". Correct.LXC IP fallback —
get_lxc_ip()trieslxc-info -iHfirst, then falls back topct config | grep -oP '(?<=ip=)[^/,]+'. The lookbehind/character-class regex correctly strips the subnet mask. Correct.SSH failure logging —
ssh_cmd()redirects stderr to$SSH_FAILURES_LOG.parse_and_report()writes aSSH_FAILUREsentinel on non-zero exit and a human-readable line to$FINDINGS_FILE.generate_summary()countsSSH_FAILURElines and lists them. Correct.set -eomission comment — clearly documents the intentional deviation at the top of the script. Correct.The floating-point load comparison (
awk "BEGIN{exit !($load > $LOAD_WARN)}") and empty-guard patterns ([[ -n "$load" ]],[[ -n "$mem" ]]) are both correct.((host_count++)) || truecorrectly avoids an exit on zero-increment without-e.Security
SSH_USERdefaults toroot, expected for a Proxmox admin tool.STUCK_PROC_CPU_WARNis passed as a positional arg and consumed via awk's-v t=— awk variables don't execute shell code, so no injection path.# shellcheck disable=SC2086on$SSH_OPTSis the standard accepted pattern for intentional word-splitting of options.Style & Conventions
#!/usr/bin/env bashused; project convention (development/bash-CONTEXT.md) says#!/bin/bash. Minor and portable.set -euo pipefail: Intentional omission of-eis documented inline. Justified deviation.echo "$COLLECTOR_SCRIPT":printf '%s\n'is more robust thanechofor content starting with-n. The fixed literal here has no such tokens, so safe in practice. Low priority.collect_inventory()callspython3 -c "..."on the Proxmox host for VM IP parsing. Not listed in the script header's dependency notes.Suggestions
python3as a dependency in the script header, or replacing withgrep/awkto avoid the assumption.pct listandqm listloops use positionalreadfields — Proxmox names with spaces would silently break field assignment (not a real risk in practice, but worth noting).Verdict: APPROVED (code quality)
The code is correct, secure, and all four stated bugs are properly fixed. The implementation is clean with good function decomposition and consistent bash practices. Resolve the duplicate PR #33 and the merge conflict before landing.
Automated review by Claude PR Reviewer
e15f88975fto1ed911e61b