feat: zombie parent, swap, and OOM metrics + Tdarr hardening #35
No reviewers
Labels
No Label
ai-changes-requested
ai-failed
ai-pr-opened
ai-reviewed
ai-reviewing
ai-working
infra-audit
monitoring
operations
proxmox
script
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: cal/claude-home#35
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "chore/30-investigate-manticore-zombies-swap"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
zombie_parents(),swap_mb(), andoom_events()collector functions tohomelab-audit.shwith corresponding parse cases — now identifies which process spawns zombies, flags swap usage > 512 MB, and reports OOM kills from the last 7 daysinit: trueto both services (tini reaps orphaned ffmpeg children), andmem_limit: 28g/memswap_limit: 30gto tdarr-node to prevent unbounded memory usetest-audit-collectors.shvalidates output format of all 8 collector functionsTest plan
bash -n homelab-audit.sh— syntax check passesdocker compose config --quiet— compose validation passestest-audit-collectors.sh— all 8 collector format tests passCloses #30
🤖 Generated with Claude Code
AI Code Review
Files Reviewed
monitoring/scripts/homelab-audit.sh(modified)monitoring/scripts/test-audit-collectors.sh(added)server-configs/ubuntu-manticore/docker-compose/tdarr/docker-compose.yml(modified)Findings
Correctness
[BLOCKER]
mem_limit/memswap_limitare Compose v2 syntax and will be silently ignoreddocker-compose.ymllines 29–30:This file uses
version: "3.8", which maps to the Compose Spec. In Compose Spec (v3+), top-level service keysmem_limitandmemswap_limitare v2 API —docker compose(Compose v2 CLI) silently ignores them rather than erroring. The memory limits will not be enforced. You can verify this withdocker inspect tdarr-node | jq '.[].HostConfig.Memory'after bring-up — it will read0.The correct v3 syntax places limits under
deploy.resources.limits, alongside the existing GPU reservation:Note on
memswap_limit: Docker Swarm'sdeployblock does not supportmemswap_limitfor standalone containers in the Compose Spec. For a standalone host, the most reliable approach is to setmemoryunderdeploy.resources.limitsand accept that swap is not separately capped (or disable swap at the host level if that is a concern).swap_mb()can return empty string inside LXCs with no swaphomelab-audit.shline 93:On some minimal unprivileged LXC containers, the Swap line may be absent from
freeoutput entirely, producing an empty string. Theparse_and_reportcaller guards with[[ -n "$swap" ]]before the arithmetic comparison so it will not crash, but the metric is silently missing from the report. A safer implementation:Security
No issues found. No user-controlled input reaches shell expansion without quoting.
journalctloutput is only consumed as a count. Thexargs -I{}pattern withps -pis safe since PPIDs are always numeric.Style & Conventions
Test script header comment is inaccurate
test-audit-collectors.shlines 4–5 say the script "Extracts each collector function from the audit script's COLLECTOR_SCRIPT heredoc and runs it locally." The script actually re-implements the collector logic inline — it does not extract from the heredoc. This is a hidden maintenance hazard: if the collector functions change inhomelab-audit.sh, these tests will continue passing while testing the old behavior. The comment should be corrected to accurately describe what the script does.oom_eventstest block: subtle interaction withset -euo pipefailtest-audit-collectors.shlines 70–71:When
grep -cimatches zero lines it exits 1, making the command substitution exit 1. The|| trueprevents abort, butresultis left empty (not"0") because the assignment failed. Line 71's${result:-0}correctly recovers this. Functionally correct but subtle — a brief inline comment would help future readers.Edge Cases
zombie_parents()empty-input behavior confirmed safeWhen there are no zombie processes,
xargs -I{}receives empty stdin and runs zero iterations.paste -sd,on empty input produces an empty string. The[[ -n "$zparents" ]]guard inparse_and_reporthandles this correctly.Memory limit sizing
28g on a 32 GB server leaves roughly 4 GB for the OS, tdarr-server, and other containers. During active multi-worker transcoding jobs this is tight but intentional. The 2 GB of implied swap headroom (
memswap_limit: 30gminusmem_limit: 28g) provides a small buffer. Reasonable for the stated goal — noted here for awareness only.Verdict: REQUEST_CHANGES (advisory — one blocker)
One blocker:
mem_limit/memswap_limitas top-level service keys are silently ignored by the Compose v2 CLI underversion: "3.8". The primary stated goal of the docker-compose change — preventing unbounded memory use — will not be achieved as written. The limits need to move underdeploy.resources.limits.The new shell collector functions (
zombie_parents,swap_mb,oom_events) are well-structured, follow existing patterns, and handle failure modes correctly. Theswap_mbempty-output edge case on swap-less containers is a minor hardening suggestion.Automated review by Claude PR Reviewer