fix: resolve variable interpolation and collector bugs in homelab-audit.sh #33

Closed
cal wants to merge 16 commits from bug/23-fix-audit-variable-interpolation into main
Owner

Summary

  • Pass STUCK_PROC_CPU_WARN as env var to remote bash sessions so the configurable threshold actually reaches the collector (was hardcoded 10)
  • Add pct config fallback for LXC IP discovery when lxc-info returns empty (static-IP containers)
  • Log SSH failures to $REPORT_DIR/ssh-failures.log instead of /dev/null, with a count in the audit summary
  • Add comment explaining intentional omission of set -e

Test plan

  • bash -n syntax check passes
  • Run audit on homelab and verify stuck process threshold is respected
  • Verify static-IP LXC containers appear in audit output
  • Verify SSH failures are logged and summarized

Fixes #23

🤖 Generated with Claude Code

## Summary - Pass `STUCK_PROC_CPU_WARN` as env var to remote bash sessions so the configurable threshold actually reaches the collector (was hardcoded `10`) - Add `pct config` fallback for LXC IP discovery when `lxc-info` returns empty (static-IP containers) - Log SSH failures to `$REPORT_DIR/ssh-failures.log` instead of `/dev/null`, with a count in the audit summary - Add comment explaining intentional omission of `set -e` ## Test plan - [x] `bash -n` syntax check passes - [ ] Run audit on homelab and verify stuck process threshold is respected - [ ] Verify static-IP LXC containers appear in audit output - [ ] Verify SSH failures are logged and summarized Fixes #23 🤖 Generated with [Claude Code](https://claude.com/claude-code)
cal added 16 commits 2026-04-03 01:26:26 +00:00
fix: resolve variable interpolation and collector bugs in homelab-audit.sh
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 6s
daee881987
- Pass STUCK_PROC_CPU_WARN as env var to remote bash sessions so the
  configurable threshold actually reaches the collector (was hardcoded 10)
- Add pct config fallback for LXC IP discovery when lxc-info returns
  empty (static-IP containers)
- Log SSH failures to $REPORT_DIR/ssh-failures.log instead of /dev/null,
  with a count in the audit summary
- Add comment explaining intentional omission of set -e

Fixes #23

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cal reviewed 2026-04-03 01:34:59 +00:00
cal left a comment
Author
Owner

AI Code Review

Files Reviewed

  • monitoring/scripts/homelab-audit.sh (added — 483 lines)
  • development/ace-step-local-network.md (added)
  • development/subagent-write-permission-blocked.md (added)
  • gaming/release-2026.4.02.md, gaming/troubleshooting.md, and 10 other .md files (added — docs-only, no issues)

The markdown files are documentation updates with no issues. All findings below are specific to homelab-audit.sh.


Findings

Correctness

Bug: --output with missing argument causes unbound variable error
Line 46-47: When --output is the last argument with no filename following, $2 is unbound. With set -uo pipefail active, the script crashes with a nondescript unbound variable error instead of a helpful usage message. Suggested fix:

    --output)
      [[ -z "${2:-}" ]] && { echo "Error: --output requires a filename"; exit 1; }
      OUTPUT_FILE="$2"
      shift 2
      ;;

--json / OUTPUT_FORMAT is declared and parsed but never used
Lines 22 and 41-44: OUTPUT_FORMAT is set to "json" when --json is passed, but parse_and_report and all output functions always produce text. The --json flag silently has no effect. Either implement JSON output or remove the flag from the argument parser and the usage comment in the header.

ZOMBIE_WARN threshold is defined but the comparison ignores it
Line 35 defines ZOMBIE_WARN=1. Line 317 compares zombie_count against hardcoded 0 instead of $ZOMBIE_WARN. If you want a configurable threshold:

if [[ "${zombie_count:-0}" -gt "$ZOMBIE_WARN" ]]; then

If the intent is to always warn on any zombie, remove ZOMBIE_WARN from the config block to avoid confusion.

now=$(date +%s) is dead code in the collector script
Line 151 of COLLECTOR_SCRIPT: now is assigned but never referenced anywhere. This is leftover from an earlier version of the stuck-process check that used now - start_time before switching to etimes. Should be removed.

wc -l may under-report multi-line SSH errors
Line 438: SSH connection errors can span multiple lines per failure event, so wc -l can over-count. The summary message "$ssh_fail_count error(s)" is potentially misleading. Simplest fix: just note that failures occurred and point to the log file, rather than showing a line count as an error count.

Security

StrictHostKeyChecking=accept-new in both SSH helpers
Lines 63 and 70: This mode auto-accepts any new host key on first connection (TOFU). In a stable homelab with pre-configured SSH config aliases this is acceptable, but noted for completeness. If known_hosts is pre-populated for all homelab hosts, StrictHostKeyChecking=yes would be more secure.

STUCK_PROC_CPU_WARN injected into remote env — value is unquoted in the SSH command string
Lines 171 and 370: "STUCK_PROC_CPU_WARN=$STUCK_PROC_CPU_WARN bash -s" — the value is unquoted. Since this value is always an integer set at the top of the script, there is no real injection risk here. Quoting defensively ("STUCK_PROC_CPU_WARN='$STUCK_PROC_CPU_WARN' bash -s") costs nothing.

Style & Conventions

Core fixes for issue #23 are correctly implemented:

  • STUCK_PROC_CPU_WARN=$STUCK_PROC_CPU_WARN bash -s correctly injects the threshold into the remote bash environment, fixing the hardcoded-10 bug. The ${STUCK_PROC_CPU_WARN:-10} fallback inside the single-quoted heredoc then reads that injected env var. This is the correct pattern.
  • The pct config fallback for LXC IPs at line 419 correctly handles static-IP containers where lxc-info returns empty. The grep -oP 'ip=\K[0-9.]+' correctly extracts the IP portion from a CIDR string like ip=10.10.0.5/24.
  • SSH failures now log to $REPORT_DIR/ssh-failures.log and are surfaced in the summary. Correct.
  • # -e omitted intentionally comment at line 15 is a good addition.

local -n requires bash 4.3+ (line 382) — Fine for all Linux homelab hosts, worth noting if portability ever matters.

grep -oP (PCRE) at line 419 — GNU grep only, fine on Linux.

Suggestions

  • Add a --help / -h flag that prints the usage line from the header comment.
  • The generate_summary recommendation list includes "Adjust monitoring thresholds to use per-core load metrics" (item 7), but the script already does per-core load analysis. That recommendation is stale.

Verdict: REQUEST_CHANGES (for informational purposes — Gitea prevented a formal review since this is a self-review)

Three correctness issues should be fixed before merge:

  1. The --output FILE unbound variable crash when the filename argument is omitted
  2. --json flag silently has no effect — implement it or remove it from the interface
  3. ZOMBIE_WARN is defined in the config block but the comparison uses hardcoded 0, making the threshold non-functional

The dead now variable is a minor cleanup. Security observations are informational and are not blocking for a homelab script.


Automated review by Claude PR Reviewer

## AI Code Review ### Files Reviewed - `monitoring/scripts/homelab-audit.sh` (added — 483 lines) - `development/ace-step-local-network.md` (added) - `development/subagent-write-permission-blocked.md` (added) - `gaming/release-2026.4.02.md`, `gaming/troubleshooting.md`, and 10 other `.md` files (added — docs-only, no issues) The markdown files are documentation updates with no issues. All findings below are specific to `homelab-audit.sh`. --- ### Findings #### Correctness **Bug: `--output` with missing argument causes unbound variable error** Line 46-47: When `--output` is the last argument with no filename following, `$2` is unbound. With `set -uo pipefail` active, the script crashes with a nondescript unbound variable error instead of a helpful usage message. Suggested fix: ```bash --output) [[ -z "${2:-}" ]] && { echo "Error: --output requires a filename"; exit 1; } OUTPUT_FILE="$2" shift 2 ;; ``` **`--json` / `OUTPUT_FORMAT` is declared and parsed but never used** Lines 22 and 41-44: `OUTPUT_FORMAT` is set to `"json"` when `--json` is passed, but `parse_and_report` and all output functions always produce text. The `--json` flag silently has no effect. Either implement JSON output or remove the flag from the argument parser and the usage comment in the header. **`ZOMBIE_WARN` threshold is defined but the comparison ignores it** Line 35 defines `ZOMBIE_WARN=1`. Line 317 compares `zombie_count` against hardcoded `0` instead of `$ZOMBIE_WARN`. If you want a configurable threshold: ```bash if [[ "${zombie_count:-0}" -gt "$ZOMBIE_WARN" ]]; then ``` If the intent is to always warn on any zombie, remove `ZOMBIE_WARN` from the config block to avoid confusion. **`now=$(date +%s)` is dead code in the collector script** Line 151 of `COLLECTOR_SCRIPT`: `now` is assigned but never referenced anywhere. This is leftover from an earlier version of the stuck-process check that used `now - start_time` before switching to `etimes`. Should be removed. **`wc -l` may under-report multi-line SSH errors** Line 438: SSH connection errors can span multiple lines per failure event, so `wc -l` can over-count. The summary message `"$ssh_fail_count error(s)"` is potentially misleading. Simplest fix: just note that failures occurred and point to the log file, rather than showing a line count as an error count. #### Security **`StrictHostKeyChecking=accept-new` in both SSH helpers** Lines 63 and 70: This mode auto-accepts any new host key on first connection (TOFU). In a stable homelab with pre-configured SSH config aliases this is acceptable, but noted for completeness. If `known_hosts` is pre-populated for all homelab hosts, `StrictHostKeyChecking=yes` would be more secure. **`STUCK_PROC_CPU_WARN` injected into remote env — value is unquoted in the SSH command string** Lines 171 and 370: `"STUCK_PROC_CPU_WARN=$STUCK_PROC_CPU_WARN bash -s"` — the value is unquoted. Since this value is always an integer set at the top of the script, there is no real injection risk here. Quoting defensively (`"STUCK_PROC_CPU_WARN='$STUCK_PROC_CPU_WARN' bash -s"`) costs nothing. #### Style & Conventions **Core fixes for issue #23 are correctly implemented:** - `STUCK_PROC_CPU_WARN=$STUCK_PROC_CPU_WARN bash -s` correctly injects the threshold into the remote bash environment, fixing the hardcoded-`10` bug. The `${STUCK_PROC_CPU_WARN:-10}` fallback inside the single-quoted heredoc then reads that injected env var. This is the correct pattern. - The `pct config` fallback for LXC IPs at line 419 correctly handles static-IP containers where `lxc-info` returns empty. The `grep -oP 'ip=\K[0-9.]+'` correctly extracts the IP portion from a CIDR string like `ip=10.10.0.5/24`. - SSH failures now log to `$REPORT_DIR/ssh-failures.log` and are surfaced in the summary. Correct. - `# -e omitted intentionally` comment at line 15 is a good addition. **`local -n` requires bash 4.3+** (line 382) — Fine for all Linux homelab hosts, worth noting if portability ever matters. **`grep -oP` (PCRE)** at line 419 — GNU grep only, fine on Linux. #### Suggestions - Add a `--help` / `-h` flag that prints the usage line from the header comment. - The `generate_summary` recommendation list includes "Adjust monitoring thresholds to use per-core load metrics" (item 7), but the script already does per-core load analysis. That recommendation is stale. --- ### Verdict: REQUEST_CHANGES (for informational purposes — Gitea prevented a formal review since this is a self-review) Three correctness issues should be fixed before merge: 1. The `--output FILE` unbound variable crash when the filename argument is omitted 2. `--json` flag silently has no effect — implement it or remove it from the interface 3. `ZOMBIE_WARN` is defined in the config block but the comparison uses hardcoded `0`, making the threshold non-functional The dead `now` variable is a minor cleanup. Security observations are informational and are not blocking for a homelab script. --- *Automated review by Claude PR Reviewer*
Claude added the
ai-reviewing
label 2026-04-03 01:47:59 +00:00
cal closed this pull request 2026-04-03 01:49:29 +00:00
Claude reviewed 2026-04-03 01:50:37 +00:00
Claude left a comment
Collaborator

AI Code Review

Files Reviewed

  • monitoring/scripts/homelab-audit.sh (added, 483 lines)

Findings

Correctness

All four fixes described in the PR body are correctly implemented:

  1. STUCK_PROC_CPU_WARN propagation — The COLLECTOR_SCRIPT heredoc is correctly single-quoted, so the local variable doesn't expand into it. The threshold is passed as an environment variable when invoking remote bash sessions (STUCK_PROC_CPU_WARN=$STUCK_PROC_CPU_WARN bash -s), and the collector uses ${STUCK_PROC_CPU_WARN:-10} to consume it with a sensible default. Correct.

  2. LXC IP discovery fallbacklxc-info is tried first; if empty, falls back to pct config ... | grep -oP 'ip=\K[0-9.]+'. The regex correctly stops at / so CIDR notation is handled. Correct.

  3. SSH failure logging — Both ssh_cmd and ssh_stdin redirect stderr to $REPORT_DIR/ssh-failures.log. generate_summary counts lines in that file and emits a warning. Correct.

  4. set -e comment — The explanation (# -e omitted intentionally — unreachable hosts should not abort the full audit) is clear and justified for this use case. Correct.

One logic note: ssh_fail_count=$(wc -l <"$REPORT_DIR/ssh-failures.log") counts lines, not events. A single SSH connection failure can produce multiple error lines, so the count may overstate actual failures. The label "error(s)" in the output is reasonable but slightly imprecise.

Security

  • No hardcoded credentials.
  • STUCK_PROC_CPU_WARN is a script-level integer constant, so passing it as STUCK_PROC_CPU_WARN=$STUCK_PROC_CPU_WARN bash -s is safe (no injection risk).
  • $ctid and $vmid sourced from pct list/qm list on a trusted Proxmox host — acceptable for homelab scope.
  • StrictHostKeyChecking=accept-new is a reasonable tradeoff for automation on a private homelab.

No security issues found.

Style & Conventions

  • bash-CONTEXT.md recommends #!/bin/bash as shebang; the script uses #!/usr/bin/env bash. Functionally equivalent, minor divergence.
  • bash-CONTEXT.md recommends set -euo pipefail; the script uses set -uo pipefail with the -e omission explicitly justified in a comment. This is the correct call for a multi-host audit script.
  • Otherwise well-organized: clear section comments, consistent helper naming, clean separation between collection (COLLECTOR_SCRIPT), parsing (parse_and_report), and presentation.

Suggestions

  • SSH failure count precision: Consider counting sentinel patterns rather than raw lines if you want an accurate per-host failure count (e.g., grep for ssh: or Permission denied patterns), or write one sentinel line per host in the collect_guest failure branch.
  • python3 dependency on Proxmox host: The VM IP discovery uses python3 -c for JSON parsing. This works on Proxmox (Debian-based), but it's an undocumented dependency. A comment noting it would help.
  • AUDIT_START/AUDIT_END markers are emitted by the collector but never validated by parse_and_report. If a remote script partially fails, the parse would silently proceed on incomplete data. Not a bug for the current scope, but worth a TODO comment.

Verdict: COMMENT

The code is correct and well-written — the four bug fixes all land as described. However, the PR is currently not mergeable (mergeable: false from Gitea) due to a conflict with PR #34, which also creates monitoring/scripts/homelab-audit.sh as a fix for the same issue #23. One of these PRs should be closed before proceeding. Once the conflict is resolved, this code is ready to merge.


Automated review by Claude PR Reviewer

## AI Code Review ### Files Reviewed - `monitoring/scripts/homelab-audit.sh` (added, 483 lines) ### Findings #### Correctness All four fixes described in the PR body are correctly implemented: 1. **`STUCK_PROC_CPU_WARN` propagation** — The `COLLECTOR_SCRIPT` heredoc is correctly single-quoted, so the local variable doesn't expand into it. The threshold is passed as an environment variable when invoking remote bash sessions (`STUCK_PROC_CPU_WARN=$STUCK_PROC_CPU_WARN bash -s`), and the collector uses `${STUCK_PROC_CPU_WARN:-10}` to consume it with a sensible default. Correct. 2. **LXC IP discovery fallback** — `lxc-info` is tried first; if empty, falls back to `pct config ... | grep -oP 'ip=\K[0-9.]+'`. The regex correctly stops at `/` so CIDR notation is handled. Correct. 3. **SSH failure logging** — Both `ssh_cmd` and `ssh_stdin` redirect stderr to `$REPORT_DIR/ssh-failures.log`. `generate_summary` counts lines in that file and emits a warning. Correct. 4. **`set -e` comment** — The explanation (`# -e omitted intentionally — unreachable hosts should not abort the full audit`) is clear and justified for this use case. Correct. **One logic note**: `ssh_fail_count=$(wc -l <"$REPORT_DIR/ssh-failures.log")` counts *lines*, not *events*. A single SSH connection failure can produce multiple error lines, so the count may overstate actual failures. The label "error(s)" in the output is reasonable but slightly imprecise. #### Security - No hardcoded credentials. - `STUCK_PROC_CPU_WARN` is a script-level integer constant, so passing it as `STUCK_PROC_CPU_WARN=$STUCK_PROC_CPU_WARN bash -s` is safe (no injection risk). - `$ctid` and `$vmid` sourced from `pct list`/`qm list` on a trusted Proxmox host — acceptable for homelab scope. - `StrictHostKeyChecking=accept-new` is a reasonable tradeoff for automation on a private homelab. No security issues found. #### Style & Conventions - `bash-CONTEXT.md` recommends `#!/bin/bash` as shebang; the script uses `#!/usr/bin/env bash`. Functionally equivalent, minor divergence. - `bash-CONTEXT.md` recommends `set -euo pipefail`; the script uses `set -uo pipefail` with the `-e` omission explicitly justified in a comment. This is the correct call for a multi-host audit script. - Otherwise well-organized: clear section comments, consistent helper naming, clean separation between collection (`COLLECTOR_SCRIPT`), parsing (`parse_and_report`), and presentation. #### Suggestions - **SSH failure count precision**: Consider counting sentinel patterns rather than raw lines if you want an accurate per-host failure count (e.g., grep for `ssh:` or `Permission denied` patterns), or write one sentinel line per host in the `collect_guest` failure branch. - **`python3` dependency on Proxmox host**: The VM IP discovery uses `python3 -c` for JSON parsing. This works on Proxmox (Debian-based), but it's an undocumented dependency. A comment noting it would help. - **`AUDIT_START`/`AUDIT_END` markers** are emitted by the collector but never validated by `parse_and_report`. If a remote script partially fails, the parse would silently proceed on incomplete data. Not a bug for the current scope, but worth a TODO comment. ### Verdict: COMMENT **The code is correct and well-written — the four bug fixes all land as described.** However, the PR is currently **not mergeable** (`mergeable: false` from Gitea) due to a conflict with PR #34, which also creates `monitoring/scripts/homelab-audit.sh` as a fix for the same issue #23. One of these PRs should be closed before proceeding. Once the conflict is resolved, this code is ready to merge. --- *Automated review by Claude PR Reviewer*
Claude added
ai-reviewed
and removed
ai-reviewing
labels 2026-04-03 01:50:42 +00:00
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 6s

Pull request closed

Sign in to join this conversation.
No description provided.