fix: resolve variable interpolation and collector bugs in homelab-audit.sh #33

cal · 2026-04-03T01:26:24Z

cal commented

2026-04-03 01:26:24 +00:00

Summary

Pass STUCK_PROC_CPU_WARN as env var to remote bash sessions so the configurable threshold actually reaches the collector (was hardcoded 10)
Add pct config fallback for LXC IP discovery when lxc-info returns empty (static-IP containers)
Log SSH failures to $REPORT_DIR/ssh-failures.log instead of /dev/null, with a count in the audit summary
Add comment explaining intentional omission of set -e

Test plan

bash -n syntax check passes
Run audit on homelab and verify stuck process threshold is respected
Verify static-IP LXC containers appear in audit output
Verify SSH failures are logged and summarized

Fixes #23

🤖 Generated with Claude Code

## Summary - Pass `STUCK_PROC_CPU_WARN` as env var to remote bash sessions so the configurable threshold actually reaches the collector (was hardcoded `10`) - Add `pct config` fallback for LXC IP discovery when `lxc-info` returns empty (static-IP containers) - Log SSH failures to `$REPORT_DIR/ssh-failures.log` instead of `/dev/null`, with a count in the audit summary - Add comment explaining intentional omission of `set -e` ## Test plan - [x] `bash -n` syntax check passes - [ ] Run audit on homelab and verify stuck process threshold is respected - [ ] Verify static-IP LXC containers appear in audit output - [ ] Verify SSH failures are logged and summarized Fixes #23 🤖 Generated with [Claude Code](https://claude.com/claude-code)

cal added 16 commits 2026-04-03 01:26:26 +00:00

docs: sync KB — open-packs-checkin-crash.md 57a2618d44

docs: sync KB — codex-agents-marketplace.md 55e7f2604d

docs: sync KB — subagent-write-permission-blocked.md,release-2026.3.28.md 21dc20b9ab

docs: sync KB — 2026-03-30.md b057c5b4a6

docs: sync KB — release-2026.3.31.md afa179c6b7

docs: sync KB — release-2026.3.31.md b97d5ede2e

docs: sync KB — release-2026.3.31-2.md bdd500bcdf

docs: sync KB — troubleshooting.md 02bda7bdd5

docs: sync KB — troubleshooting.md dcefc0f3d1

docs: sync KB — release-2026.3.31.md,release-2026.4.01.md 0f59e77cf2

docs: sync KB — troubleshooting-gunicorn-worker-timeouts.md 43421e2cf1

docs: sync KB — database-release-2026.4.1.md 0c44dec1f0

docs: sync KB — release-2026.4.02.md b43cc36ff1

docs: sync KB — ace-step-local-network.md e3c1f20abf

docs: sync KB — troubleshooting.md 5ee48af40f

fix: resolve variable interpolation and collector bugs in homelab-audit.sh

Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 6s

Details

daee881987

- Pass STUCK_PROC_CPU_WARN as env var to remote bash sessions so the
  configurable threshold actually reaches the collector (was hardcoded 10)
- Add pct config fallback for LXC IP discovery when lxc-info returns
  empty (static-IP containers)
- Log SSH failures to $REPORT_DIR/ssh-failures.log instead of /dev/null,
  with a count in the audit summary
- Add comment explaining intentional omission of set -e

Fixes #23

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cal reviewed 2026-04-03 01:34:59 +00:00

cal left a comment

AI Code Review

Files Reviewed

monitoring/scripts/homelab-audit.sh (added — 483 lines)
development/ace-step-local-network.md (added)
development/subagent-write-permission-blocked.md (added)
gaming/release-2026.4.02.md, gaming/troubleshooting.md, and 10 other .md files (added — docs-only, no issues)

The markdown files are documentation updates with no issues. All findings below are specific to homelab-audit.sh.

Findings

Correctness

Bug: --output with missing argument causes unbound variable error
Line 46-47: When --output is the last argument with no filename following, $2 is unbound. With set -uo pipefail active, the script crashes with a nondescript unbound variable error instead of a helpful usage message. Suggested fix:

    --output)
      [[ -z "${2:-}" ]] && { echo "Error: --output requires a filename"; exit 1; }
      OUTPUT_FILE="$2"
      shift 2
      ;;

--json / OUTPUT_FORMAT is declared and parsed but never used
Lines 22 and 41-44: OUTPUT_FORMAT is set to "json" when --json is passed, but parse_and_report and all output functions always produce text. The --json flag silently has no effect. Either implement JSON output or remove the flag from the argument parser and the usage comment in the header.

ZOMBIE_WARN threshold is defined but the comparison ignores it
Line 35 defines ZOMBIE_WARN=1. Line 317 compares zombie_count against hardcoded 0 instead of $ZOMBIE_WARN. If you want a configurable threshold:

if [[ "${zombie_count:-0}" -gt "$ZOMBIE_WARN" ]]; then

If the intent is to always warn on any zombie, remove ZOMBIE_WARN from the config block to avoid confusion.

now=$(date +%s) is dead code in the collector script
Line 151 of COLLECTOR_SCRIPT: now is assigned but never referenced anywhere. This is leftover from an earlier version of the stuck-process check that used now - start_time before switching to etimes. Should be removed.

wc -l may under-report multi-line SSH errors
Line 438: SSH connection errors can span multiple lines per failure event, so wc -l can over-count. The summary message "$ssh_fail_count error(s)" is potentially misleading. Simplest fix: just note that failures occurred and point to the log file, rather than showing a line count as an error count.

Security

StrictHostKeyChecking=accept-new in both SSH helpers
Lines 63 and 70: This mode auto-accepts any new host key on first connection (TOFU). In a stable homelab with pre-configured SSH config aliases this is acceptable, but noted for completeness. If known_hosts is pre-populated for all homelab hosts, StrictHostKeyChecking=yes would be more secure.

STUCK_PROC_CPU_WARN injected into remote env — value is unquoted in the SSH command string
Lines 171 and 370: "STUCK_PROC_CPU_WARN=$STUCK_PROC_CPU_WARN bash -s" — the value is unquoted. Since this value is always an integer set at the top of the script, there is no real injection risk here. Quoting defensively ("STUCK_PROC_CPU_WARN='$STUCK_PROC_CPU_WARN' bash -s") costs nothing.

Style & Conventions

Core fixes for issue #23 are correctly implemented:

STUCK_PROC_CPU_WARN=$STUCK_PROC_CPU_WARN bash -s correctly injects the threshold into the remote bash environment, fixing the hardcoded-10 bug. The ${STUCK_PROC_CPU_WARN:-10} fallback inside the single-quoted heredoc then reads that injected env var. This is the correct pattern.
The pct config fallback for LXC IPs at line 419 correctly handles static-IP containers where lxc-info returns empty. The grep -oP 'ip=\K[0-9.]+' correctly extracts the IP portion from a CIDR string like ip=10.10.0.5/24.
SSH failures now log to $REPORT_DIR/ssh-failures.log and are surfaced in the summary. Correct.
# -e omitted intentionally comment at line 15 is a good addition.

local -n requires bash 4.3+ (line 382) — Fine for all Linux homelab hosts, worth noting if portability ever matters.

grep -oP (PCRE) at line 419 — GNU grep only, fine on Linux.

Suggestions

Add a --help / -h flag that prints the usage line from the header comment.
The generate_summary recommendation list includes "Adjust monitoring thresholds to use per-core load metrics" (item 7), but the script already does per-core load analysis. That recommendation is stale.

Verdict: REQUEST_CHANGES (for informational purposes — Gitea prevented a formal review since this is a self-review)

Three correctness issues should be fixed before merge:

The --output FILE unbound variable crash when the filename argument is omitted
--json flag silently has no effect — implement it or remove it from the interface
ZOMBIE_WARN is defined in the config block but the comparison uses hardcoded 0, making the threshold non-functional

The dead now variable is a minor cleanup. Security observations are informational and are not blocking for a homelab script.

Automated review by Claude PR Reviewer

## AI Code Review ### Files Reviewed - `monitoring/scripts/homelab-audit.sh` (added — 483 lines) - `development/ace-step-local-network.md` (added) - `development/subagent-write-permission-blocked.md` (added) - `gaming/release-2026.4.02.md`, `gaming/troubleshooting.md`, and 10 other `.md` files (added — docs-only, no issues) The markdown files are documentation updates with no issues. All findings below are specific to `homelab-audit.sh`. --- ### Findings #### Correctness **Bug: `--output` with missing argument causes unbound variable error** Line 46-47: When `--output` is the last argument with no filename following, `$2` is unbound. With `set -uo pipefail` active, the script crashes with a nondescript unbound variable error instead of a helpful usage message. Suggested fix: ```bash --output) [[ -z "${2:-}" ]] && { echo "Error: --output requires a filename"; exit 1; } OUTPUT_FILE="$2" shift 2 ;; ``` **`--json` / `OUTPUT_FORMAT` is declared and parsed but never used** Lines 22 and 41-44: `OUTPUT_FORMAT` is set to `"json"` when `--json` is passed, but `parse_and_report` and all output functions always produce text. The `--json` flag silently has no effect. Either implement JSON output or remove the flag from the argument parser and the usage comment in the header. **`ZOMBIE_WARN` threshold is defined but the comparison ignores it** Line 35 defines `ZOMBIE_WARN=1`. Line 317 compares `zombie_count` against hardcoded `0` instead of `$ZOMBIE_WARN`. If you want a configurable threshold: ```bash if [[ "${zombie_count:-0}" -gt "$ZOMBIE_WARN" ]]; then ``` If the intent is to always warn on any zombie, remove `ZOMBIE_WARN` from the config block to avoid confusion. **`now=$(date +%s)` is dead code in the collector script** Line 151 of `COLLECTOR_SCRIPT`: `now` is assigned but never referenced anywhere. This is leftover from an earlier version of the stuck-process check that used `now - start_time` before switching to `etimes`. Should be removed. **`wc -l` may under-report multi-line SSH errors** Line 438: SSH connection errors can span multiple lines per failure event, so `wc -l` can over-count. The summary message `"$ssh_fail_count error(s)"` is potentially misleading. Simplest fix: just note that failures occurred and point to the log file, rather than showing a line count as an error count. #### Security **`StrictHostKeyChecking=accept-new` in both SSH helpers** Lines 63 and 70: This mode auto-accepts any new host key on first connection (TOFU). In a stable homelab with pre-configured SSH config aliases this is acceptable, but noted for completeness. If `known_hosts` is pre-populated for all homelab hosts, `StrictHostKeyChecking=yes` would be more secure. **`STUCK_PROC_CPU_WARN` injected into remote env — value is unquoted in the SSH command string** Lines 171 and 370: `"STUCK_PROC_CPU_WARN=$STUCK_PROC_CPU_WARN bash -s"` — the value is unquoted. Since this value is always an integer set at the top of the script, there is no real injection risk here. Quoting defensively (`"STUCK_PROC_CPU_WARN='$STUCK_PROC_CPU_WARN' bash -s"`) costs nothing. #### Style & Conventions **Core fixes for issue #23 are correctly implemented:** - `STUCK_PROC_CPU_WARN=$STUCK_PROC_CPU_WARN bash -s` correctly injects the threshold into the remote bash environment, fixing the hardcoded-`10` bug. The `${STUCK_PROC_CPU_WARN:-10}` fallback inside the single-quoted heredoc then reads that injected env var. This is the correct pattern. - The `pct config` fallback for LXC IPs at line 419 correctly handles static-IP containers where `lxc-info` returns empty. The `grep -oP 'ip=\K[0-9.]+'` correctly extracts the IP portion from a CIDR string like `ip=10.10.0.5/24`. - SSH failures now log to `$REPORT_DIR/ssh-failures.log` and are surfaced in the summary. Correct. - `# -e omitted intentionally` comment at line 15 is a good addition. **`local -n` requires bash 4.3+** (line 382) — Fine for all Linux homelab hosts, worth noting if portability ever matters. **`grep -oP` (PCRE)** at line 419 — GNU grep only, fine on Linux. #### Suggestions - Add a `--help` / `-h` flag that prints the usage line from the header comment. - The `generate_summary` recommendation list includes "Adjust monitoring thresholds to use per-core load metrics" (item 7), but the script already does per-core load analysis. That recommendation is stale. --- ### Verdict: REQUEST_CHANGES (for informational purposes — Gitea prevented a formal review since this is a self-review) Three correctness issues should be fixed before merge: 1. The `--output FILE` unbound variable crash when the filename argument is omitted 2. `--json` flag silently has no effect — implement it or remove it from the interface 3. `ZOMBIE_WARN` is defined in the config block but the comparison uses hardcoded `0`, making the threshold non-functional The dead `now` variable is a minor cleanup. Security observations are informational and are not blocking for a homelab script. --- *Automated review by Claude PR Reviewer*

Claude added the

ai-reviewing

label 2026-04-03 01:47:59 +00:00

cal closed this pull request

2026-04-03 01:49:29 +00:00

Claude reviewed 2026-04-03 01:50:37 +00:00

Claude left a comment

AI Code Review

Files Reviewed

monitoring/scripts/homelab-audit.sh (added, 483 lines)

Findings

Correctness

All four fixes described in the PR body are correctly implemented:

STUCK_PROC_CPU_WARN propagation — The COLLECTOR_SCRIPT heredoc is correctly single-quoted, so the local variable doesn't expand into it. The threshold is passed as an environment variable when invoking remote bash sessions (STUCK_PROC_CPU_WARN=$STUCK_PROC_CPU_WARN bash -s), and the collector uses ${STUCK_PROC_CPU_WARN:-10} to consume it with a sensible default. Correct.
LXC IP discovery fallback — lxc-info is tried first; if empty, falls back to pct config ... | grep -oP 'ip=\K[0-9.]+'. The regex correctly stops at / so CIDR notation is handled. Correct.
SSH failure logging — Both ssh_cmd and ssh_stdin redirect stderr to $REPORT_DIR/ssh-failures.log. generate_summary counts lines in that file and emits a warning. Correct.
set -e comment — The explanation (# -e omitted intentionally — unreachable hosts should not abort the full audit) is clear and justified for this use case. Correct.

One logic note: ssh_fail_count=$(wc -l <"$REPORT_DIR/ssh-failures.log") counts lines, not events. A single SSH connection failure can produce multiple error lines, so the count may overstate actual failures. The label "error(s)" in the output is reasonable but slightly imprecise.

Security

No hardcoded credentials.
STUCK_PROC_CPU_WARN is a script-level integer constant, so passing it as STUCK_PROC_CPU_WARN=$STUCK_PROC_CPU_WARN bash -s is safe (no injection risk).
$ctid and $vmid sourced from pct list/qm list on a trusted Proxmox host — acceptable for homelab scope.
StrictHostKeyChecking=accept-new is a reasonable tradeoff for automation on a private homelab.

No security issues found.

Style & Conventions

bash-CONTEXT.md recommends #!/bin/bash as shebang; the script uses #!/usr/bin/env bash. Functionally equivalent, minor divergence.
bash-CONTEXT.md recommends set -euo pipefail; the script uses set -uo pipefail with the -e omission explicitly justified in a comment. This is the correct call for a multi-host audit script.
Otherwise well-organized: clear section comments, consistent helper naming, clean separation between collection (COLLECTOR_SCRIPT), parsing (parse_and_report), and presentation.

Suggestions

SSH failure count precision: Consider counting sentinel patterns rather than raw lines if you want an accurate per-host failure count (e.g., grep for ssh: or Permission denied patterns), or write one sentinel line per host in the collect_guest failure branch.
python3 dependency on Proxmox host: The VM IP discovery uses python3 -c for JSON parsing. This works on Proxmox (Debian-based), but it's an undocumented dependency. A comment noting it would help.
AUDIT_START/AUDIT_END markers are emitted by the collector but never validated by parse_and_report. If a remote script partially fails, the parse would silently proceed on incomplete data. Not a bug for the current scope, but worth a TODO comment.

Verdict: COMMENT

The code is correct and well-written — the four bug fixes all land as described. However, the PR is currently not mergeable (mergeable: false from Gitea) due to a conflict with PR #34, which also creates monitoring/scripts/homelab-audit.sh as a fix for the same issue #23. One of these PRs should be closed before proceeding. Once the conflict is resolved, this code is ready to merge.

Automated review by Claude PR Reviewer

## AI Code Review ### Files Reviewed - `monitoring/scripts/homelab-audit.sh` (added, 483 lines) ### Findings #### Correctness All four fixes described in the PR body are correctly implemented: 1. **`STUCK_PROC_CPU_WARN` propagation** — The `COLLECTOR_SCRIPT` heredoc is correctly single-quoted, so the local variable doesn't expand into it. The threshold is passed as an environment variable when invoking remote bash sessions (`STUCK_PROC_CPU_WARN=$STUCK_PROC_CPU_WARN bash -s`), and the collector uses `${STUCK_PROC_CPU_WARN:-10}` to consume it with a sensible default. Correct. 2. **LXC IP discovery fallback** — `lxc-info` is tried first; if empty, falls back to `pct config ... | grep -oP 'ip=\K[0-9.]+'`. The regex correctly stops at `/` so CIDR notation is handled. Correct. 3. **SSH failure logging** — Both `ssh_cmd` and `ssh_stdin` redirect stderr to `$REPORT_DIR/ssh-failures.log`. `generate_summary` counts lines in that file and emits a warning. Correct. 4. **`set -e` comment** — The explanation (`# -e omitted intentionally — unreachable hosts should not abort the full audit`) is clear and justified for this use case. Correct. **One logic note**: `ssh_fail_count=$(wc -l <"$REPORT_DIR/ssh-failures.log")` counts *lines*, not *events*. A single SSH connection failure can produce multiple error lines, so the count may overstate actual failures. The label "error(s)" in the output is reasonable but slightly imprecise. #### Security - No hardcoded credentials. - `STUCK_PROC_CPU_WARN` is a script-level integer constant, so passing it as `STUCK_PROC_CPU_WARN=$STUCK_PROC_CPU_WARN bash -s` is safe (no injection risk). - `$ctid` and `$vmid` sourced from `pct list`/`qm list` on a trusted Proxmox host — acceptable for homelab scope. - `StrictHostKeyChecking=accept-new` is a reasonable tradeoff for automation on a private homelab. No security issues found. #### Style & Conventions - `bash-CONTEXT.md` recommends `#!/bin/bash` as shebang; the script uses `#!/usr/bin/env bash`. Functionally equivalent, minor divergence. - `bash-CONTEXT.md` recommends `set -euo pipefail`; the script uses `set -uo pipefail` with the `-e` omission explicitly justified in a comment. This is the correct call for a multi-host audit script. - Otherwise well-organized: clear section comments, consistent helper naming, clean separation between collection (`COLLECTOR_SCRIPT`), parsing (`parse_and_report`), and presentation. #### Suggestions - **SSH failure count precision**: Consider counting sentinel patterns rather than raw lines if you want an accurate per-host failure count (e.g., grep for `ssh:` or `Permission denied` patterns), or write one sentinel line per host in the `collect_guest` failure branch. - **`python3` dependency on Proxmox host**: The VM IP discovery uses `python3 -c` for JSON parsing. This works on Proxmox (Debian-based), but it's an undocumented dependency. A comment noting it would help. - **`AUDIT_START`/`AUDIT_END` markers** are emitted by the collector but never validated by `parse_and_report`. If a remote script partially fails, the parse would silently proceed on incomplete data. Not a bug for the current scope, but worth a TODO comment. ### Verdict: COMMENT **The code is correct and well-written — the four bug fixes all land as described.** However, the PR is currently **not mergeable** (`mergeable: false` from Gitea) due to a conflict with PR #34, which also creates `monitoring/scripts/homelab-audit.sh` as a fix for the same issue #23. One of these PRs should be closed before proceeding. Once the conflict is resolved, this code is ready to merge. --- *Automated review by Claude PR Reviewer*

Claude added

ai-reviewed

and removed

ai-reviewing

labels 2026-04-03 01:50:42 +00:00

Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 6s

Details