feat: zombie parent, swap, and OOM metrics + Tdarr hardening #35

Merged
cal merged 2 commits from chore/30-investigate-manticore-zombies-swap into main 2026-04-03 02:05:47 +00:00
Owner

Summary

  • Audit script: Added zombie_parents(), swap_mb(), and oom_events() collector functions to homelab-audit.sh with corresponding parse cases — now identifies which process spawns zombies, flags swap usage > 512 MB, and reports OOM kills from the last 7 days
  • Tdarr docker-compose: Added init: true to both services (tini reaps orphaned ffmpeg children), and mem_limit: 28g / memswap_limit: 30g to tdarr-node to prevent unbounded memory use
  • Test script: New test-audit-collectors.sh validates output format of all 8 collector functions

Test plan

  • bash -n homelab-audit.sh — syntax check passes
  • docker compose config --quiet — compose validation passes
  • test-audit-collectors.sh — all 8 collector format tests pass
  • Run audit on Proxmox host and confirm new metrics appear in findings
  • Restart Tdarr containers on manticore with updated compose

Closes #30

🤖 Generated with Claude Code

## Summary - **Audit script:** Added `zombie_parents()`, `swap_mb()`, and `oom_events()` collector functions to `homelab-audit.sh` with corresponding parse cases — now identifies which process spawns zombies, flags swap usage > 512 MB, and reports OOM kills from the last 7 days - **Tdarr docker-compose:** Added `init: true` to both services (tini reaps orphaned ffmpeg children), and `mem_limit: 28g` / `memswap_limit: 30g` to tdarr-node to prevent unbounded memory use - **Test script:** New `test-audit-collectors.sh` validates output format of all 8 collector functions ## Test plan - [x] `bash -n homelab-audit.sh` — syntax check passes - [x] `docker compose config --quiet` — compose validation passes - [x] `test-audit-collectors.sh` — all 8 collector format tests pass - [ ] Run audit on Proxmox host and confirm new metrics appear in findings - [ ] Restart Tdarr containers on manticore with updated compose Closes #30 🤖 Generated with [Claude Code](https://claude.com/claude-code)
cal added 1 commit 2026-04-03 02:02:23 +00:00
feat: add zombie parent, swap, and OOM metrics to audit; harden Tdarr containers
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 3s
f28dfeb4bf
Extend homelab-audit.sh collector with zombie_parents(), swap_mb(), and
oom_events() functions so the audit identifies which process spawns zombies,
flags high swap usage, and reports recent OOM kills. Add init: true to both
Tdarr docker-compose services so tini reaps orphaned ffmpeg children, and
cap tdarr-node at 28g RAM / 30g total to prevent unbounded memory use.

Closes #30

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cal reviewed 2026-04-03 02:04:25 +00:00
cal left a comment
Author
Owner

AI Code Review

Files Reviewed

  • monitoring/scripts/homelab-audit.sh (modified)
  • monitoring/scripts/test-audit-collectors.sh (added)
  • server-configs/ubuntu-manticore/docker-compose/tdarr/docker-compose.yml (modified)

Findings

Correctness

[BLOCKER] mem_limit / memswap_limit are Compose v2 syntax and will be silently ignored

docker-compose.yml lines 29–30:

mem_limit: 28g
memswap_limit: 30g

This file uses version: "3.8", which maps to the Compose Spec. In Compose Spec (v3+), top-level service keys mem_limit and memswap_limit are v2 API — docker compose (Compose v2 CLI) silently ignores them rather than erroring. The memory limits will not be enforced. You can verify this with docker inspect tdarr-node | jq '.[].HostConfig.Memory' after bring-up — it will read 0.

The correct v3 syntax places limits under deploy.resources.limits, alongside the existing GPU reservation:

deploy:
  resources:
    limits:
      memory: 28g
    reservations:
      devices:
        - driver: nvidia
          count: all
          capabilities: [gpu]

Note on memswap_limit: Docker Swarm's deploy block does not support memswap_limit for standalone containers in the Compose Spec. For a standalone host, the most reliable approach is to set memory under deploy.resources.limits and accept that swap is not separately capped (or disable swap at the host level if that is a concern).


swap_mb() can return empty string inside LXCs with no swap

homelab-audit.sh line 93:

free | awk '/^Swap:/ {printf "%.0f", $3/1024}'

On some minimal unprivileged LXC containers, the Swap line may be absent from free output entirely, producing an empty string. The parse_and_report caller guards with [[ -n "$swap" ]] before the arithmetic comparison so it will not crash, but the metric is silently missing from the report. A safer implementation:

swap_mb() {
  free | awk '/^Swap:/ {printf "%.0f", $3/1024; found=1} END {if (!found) print "0"}'
}

Security

No issues found. No user-controlled input reaches shell expansion without quoting. journalctl output is only consumed as a count. The xargs -I{} pattern with ps -p is safe since PPIDs are always numeric.


Style & Conventions

Test script header comment is inaccurate

test-audit-collectors.sh lines 4–5 say the script "Extracts each collector function from the audit script's COLLECTOR_SCRIPT heredoc and runs it locally." The script actually re-implements the collector logic inline — it does not extract from the heredoc. This is a hidden maintenance hazard: if the collector functions change in homelab-audit.sh, these tests will continue passing while testing the old behavior. The comment should be corrected to accurately describe what the script does.

oom_events test block: subtle interaction with set -euo pipefail

test-audit-collectors.sh lines 70–71:

result=$(journalctl -k --since "7 days ago" 2>/dev/null | grep -ci "out of memory") || true
result="${result:-0}"

When grep -ci matches zero lines it exits 1, making the command substitution exit 1. The || true prevents abort, but result is left empty (not "0") because the assignment failed. Line 71's ${result:-0} correctly recovers this. Functionally correct but subtle — a brief inline comment would help future readers.


Edge Cases

zombie_parents() empty-input behavior confirmed safe

When there are no zombie processes, xargs -I{} receives empty stdin and runs zero iterations. paste -sd, on empty input produces an empty string. The [[ -n "$zparents" ]] guard in parse_and_report handles this correctly.

Memory limit sizing

28g on a 32 GB server leaves roughly 4 GB for the OS, tdarr-server, and other containers. During active multi-worker transcoding jobs this is tight but intentional. The 2 GB of implied swap headroom (memswap_limit: 30g minus mem_limit: 28g) provides a small buffer. Reasonable for the stated goal — noted here for awareness only.


Verdict: REQUEST_CHANGES (advisory — one blocker)

One blocker: mem_limit / memswap_limit as top-level service keys are silently ignored by the Compose v2 CLI under version: "3.8". The primary stated goal of the docker-compose change — preventing unbounded memory use — will not be achieved as written. The limits need to move under deploy.resources.limits.

The new shell collector functions (zombie_parents, swap_mb, oom_events) are well-structured, follow existing patterns, and handle failure modes correctly. The swap_mb empty-output edge case on swap-less containers is a minor hardening suggestion.


Automated review by Claude PR Reviewer

## AI Code Review ### Files Reviewed - `monitoring/scripts/homelab-audit.sh` (modified) - `monitoring/scripts/test-audit-collectors.sh` (added) - `server-configs/ubuntu-manticore/docker-compose/tdarr/docker-compose.yml` (modified) --- ### Findings #### Correctness **[BLOCKER] `mem_limit` / `memswap_limit` are Compose v2 syntax and will be silently ignored** `docker-compose.yml` lines 29–30: ```yaml mem_limit: 28g memswap_limit: 30g ``` This file uses `version: "3.8"`, which maps to the Compose Spec. In Compose Spec (v3+), top-level service keys `mem_limit` and `memswap_limit` are v2 API — `docker compose` (Compose v2 CLI) silently ignores them rather than erroring. The memory limits will **not** be enforced. You can verify this with `docker inspect tdarr-node | jq '.[].HostConfig.Memory'` after bring-up — it will read `0`. The correct v3 syntax places limits under `deploy.resources.limits`, alongside the existing GPU reservation: ```yaml deploy: resources: limits: memory: 28g reservations: devices: - driver: nvidia count: all capabilities: [gpu] ``` Note on `memswap_limit`: Docker Swarm's `deploy` block does not support `memswap_limit` for standalone containers in the Compose Spec. For a standalone host, the most reliable approach is to set `memory` under `deploy.resources.limits` and accept that swap is not separately capped (or disable swap at the host level if that is a concern). --- **`swap_mb()` can return empty string inside LXCs with no swap** `homelab-audit.sh` line 93: ```bash free | awk '/^Swap:/ {printf "%.0f", $3/1024}' ``` On some minimal unprivileged LXC containers, the Swap line may be absent from `free` output entirely, producing an empty string. The `parse_and_report` caller guards with `[[ -n "$swap" ]]` before the arithmetic comparison so it will not crash, but the metric is silently missing from the report. A safer implementation: ```bash swap_mb() { free | awk '/^Swap:/ {printf "%.0f", $3/1024; found=1} END {if (!found) print "0"}' } ``` --- #### Security No issues found. No user-controlled input reaches shell expansion without quoting. `journalctl` output is only consumed as a count. The `xargs -I{}` pattern with `ps -p` is safe since PPIDs are always numeric. --- #### Style & Conventions **Test script header comment is inaccurate** `test-audit-collectors.sh` lines 4–5 say the script "Extracts each collector function from the audit script's COLLECTOR_SCRIPT heredoc and runs it locally." The script actually re-implements the collector logic inline — it does not extract from the heredoc. This is a hidden maintenance hazard: if the collector functions change in `homelab-audit.sh`, these tests will continue passing while testing the old behavior. The comment should be corrected to accurately describe what the script does. **`oom_events` test block: subtle interaction with `set -euo pipefail`** `test-audit-collectors.sh` lines 70–71: ```bash result=$(journalctl -k --since "7 days ago" 2>/dev/null | grep -ci "out of memory") || true result="${result:-0}" ``` When `grep -ci` matches zero lines it exits 1, making the command substitution exit 1. The `|| true` prevents abort, but `result` is left empty (not `"0"`) because the assignment failed. Line 71's `${result:-0}` correctly recovers this. Functionally correct but subtle — a brief inline comment would help future readers. --- #### Edge Cases **`zombie_parents()` empty-input behavior confirmed safe** When there are no zombie processes, `xargs -I{}` receives empty stdin and runs zero iterations. `paste -sd,` on empty input produces an empty string. The `[[ -n "$zparents" ]]` guard in `parse_and_report` handles this correctly. **Memory limit sizing** 28g on a 32 GB server leaves roughly 4 GB for the OS, tdarr-server, and other containers. During active multi-worker transcoding jobs this is tight but intentional. The 2 GB of implied swap headroom (`memswap_limit: 30g` minus `mem_limit: 28g`) provides a small buffer. Reasonable for the stated goal — noted here for awareness only. --- ### Verdict: REQUEST_CHANGES (advisory — one blocker) **One blocker:** `mem_limit` / `memswap_limit` as top-level service keys are silently ignored by the Compose v2 CLI under `version: "3.8"`. The primary stated goal of the docker-compose change — preventing unbounded memory use — will not be achieved as written. The limits need to move under `deploy.resources.limits`. The new shell collector functions (`zombie_parents`, `swap_mb`, `oom_events`) are well-structured, follow existing patterns, and handle failure modes correctly. The `swap_mb` empty-output edge case on swap-less containers is a minor hardening suggestion. --- *Automated review by Claude PR Reviewer*
cal added 1 commit 2026-04-03 02:05:20 +00:00
fix: address PR review — move memory limits to deploy block, handle swap-less hosts
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
e58c5b8cc1
Move mem_limit/memswap_limit to deploy.resources.limits.memory so the
constraint is actually enforced under Compose v3. Add END clause to
swap_mb() so hosts without a Swap line report 0 instead of empty output.
Fix test script header comment accuracy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cal merged commit 3e3d2ada31 into main 2026-04-03 02:05:47 +00:00
cal deleted branch chore/30-investigate-manticore-zombies-swap 2026-04-03 02:05:47 +00:00
Sign in to join this conversation.
No description provided.