Commit Graph

166 Commits

Author SHA1 Message Date
Cal Corum
193ae68f96 docs: document per-core load threshold policy for server health monitoring (#22)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 5s
Closes #22

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 13:35:23 -05:00
Cal Corum
7c9c96eb52 docs: sync KB — troubleshooting.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
2026-04-03 12:00:22 -05:00
cal
a8c85a8d91 Merge pull request 'chore: decommission VM 105 (docker-vpn) — repo cleanup' (#40) from chore/20-decommission-vm-105-docker-vpn into main
Some checks failed
Reindex Knowledge Base / reindex (push) Failing after 17s
2026-04-03 12:56:43 +00:00
Cal Corum
9e8346a8ab chore: decommission VM 105 (docker-vpn) — repo cleanup (#20)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
VM 105 was already destroyed on Proxmox. This removes stale references:
- Delete server-configs/proxmox/qemu/105.conf
- Comment out docker-vpn entries in example SSH config and server inventory
- Move VM 105 from Stopped/Investigate to Removed in upgrade plan
- Check off decommission task in wave2 migration results

Closes #20

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 23:57:55 -05:00
Cal Corum
4234351cfa feat: add Ansible playbook to mask avahi-daemon on all Ubuntu VMs (#28)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
Closes #28

Adds mask-avahi.yml targeting the vms:physical inventory groups (all
Ubuntu QEMU VMs + ubuntu-manticore). Also adds avahi masking to the
cloud-init template so future VMs are hardened from first boot.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 23:32:47 -05:00
Cal Corum
a97f443f60 docs: sync KB — vm-decommission-runbook.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
2026-04-02 22:00:04 -05:00
cal
1db2c2b168 Merge pull request 'feat: add backup recency, cert expiry, and I/O wait checks (#25)' (#36) from issue/25-homelab-audit-sh-add-backup-recency-and-certificat into main 2026-04-03 02:15:41 +00:00
Cal Corum
ae5da035f6 feat: add backup recency, cert expiry, OOM, and I/O wait checks (#25)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
Closes #25

- check_backup_recency(): queries pvesh vzdump task history; flags VMs
  with no backup (CRIT) or no backup in 7 days (WARN)
- check_cert_expiry(): probes ports 443/8443 per host via openssl;
  flags certs expiring ≤14 days (WARN) or ≤7 days (CRIT)
- io_wait_pct() in COLLECTOR_SCRIPT: uses vmstat 1 2 to sample I/O
  wait; flagged as WARN when > 20%
- OOM kill history was already collected via journalctl; no changes needed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 21:06:44 -05:00
cal
3e3d2ada31 Merge pull request 'feat: zombie parent, swap, and OOM metrics + Tdarr hardening' (#35) from chore/30-investigate-manticore-zombies-swap into main 2026-04-03 02:05:46 +00:00
Cal Corum
e58c5b8cc1 fix: address PR review — move memory limits to deploy block, handle swap-less hosts
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
Move mem_limit/memswap_limit to deploy.resources.limits.memory so the
constraint is actually enforced under Compose v3. Add END clause to
swap_mb() so hosts without a Swap line report 0 instead of empty output.
Fix test script header comment accuracy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 21:05:12 -05:00
Cal Corum
f28dfeb4bf feat: add zombie parent, swap, and OOM metrics to audit; harden Tdarr containers
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 3s
Extend homelab-audit.sh collector with zombie_parents(), swap_mb(), and
oom_events() functions so the audit identifies which process spawns zombies,
flags high swap usage, and reports recent OOM kills. Add init: true to both
Tdarr docker-compose services so tini reaps orphaned ffmpeg children, and
cap tdarr-node at 28g RAM / 30g total to prevent unbounded memory use.

Closes #30

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 21:02:05 -05:00
Cal Corum
1ed911e61b fix: single-quote awk program in stuck_procs() collector
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 3s
Reindex Knowledge Base / reindex (push) Successful in 3s
The awk program was double-quoted inside the single-quoted
COLLECTOR_SCRIPT, causing $1/$2/$3 to be expanded by the remote
shell as empty positional parameters instead of awk field references.
This made the D-state process filter silently match nothing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 20:48:56 -05:00
Cal Corum
7c801f6c3b fix: guard --output-dir arg and use configurable ZOMBIE_WARN threshold
- Validate --output-dir has a following argument before accessing $2
  (prevents unbound variable crash under set -u)
- Add ZOMBIE_WARN config variable (default: 1) and use it in the zombie
  check instead of hardcoding 0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 20:48:56 -05:00
Cal Corum
9a39abd64c fix: add homelab-audit.sh with variable interpolation and collector fixes (#23)
Closes #23

- Fix STUCK_PROC_CPU_WARN not reaching remote collector: COLLECTOR_SCRIPT
  heredoc stays single-quoted; threshold is passed as $1 to the remote
  bash session so it is evaluated correctly on the collecting host
- Fix LXC IP discovery for static-IP containers: lxc-info result now falls
  back to parsing pct config when lxc-info returns empty
- Fix SSH failures silently dropped: stderr redirected to
  $REPORT_DIR/ssh-failures.log; SSH_FAILURE entries counted and printed
  in the summary
- Add explicit comment explaining why -e is omitted from set options

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 20:48:56 -05:00
Cal Corum
def437f0cb docs: sync KB — troubleshooting.md 2026-04-02 20:48:39 -05:00
Cal Corum
2e86864e94 docs: sync KB — ace-step-local-network.md 2026-04-02 20:48:06 -05:00
Cal Corum
016683cc35 docs: sync KB — release-2026.4.02.md 2026-04-02 20:48:06 -05:00
Cal Corum
51389c612a docs: sync KB — database-release-2026.4.1.md 2026-04-02 20:48:06 -05:00
Cal Corum
98c69617ff docs: sync KB — troubleshooting-gunicorn-worker-timeouts.md 2026-04-02 20:48:06 -05:00
Cal Corum
50125d8b39 docs: sync KB — release-2026.3.31.md,release-2026.4.01.md 2026-04-02 20:48:06 -05:00
Cal Corum
7bdaa0e002 docs: sync KB — troubleshooting.md 2026-04-02 20:48:06 -05:00
Cal Corum
2cb1ced842 docs: sync KB — troubleshooting.md 2026-04-02 20:48:06 -05:00
Cal Corum
ad6adf7a4c docs: sync KB — release-2026.3.31-2.md 2026-04-02 20:48:06 -05:00
Cal Corum
acb1a35170 docs: sync KB — release-2026.3.31.md 2026-04-02 20:48:06 -05:00
Cal Corum
1d85ed26b9 docs: sync KB — release-2026.3.31.md 2026-04-02 20:48:06 -05:00
Cal Corum
1e7f99269e docs: sync KB — 2026-03-30.md 2026-04-02 20:48:06 -05:00
Cal Corum
f5eab93f7b docs: sync KB — subagent-write-permission-blocked.md,release-2026.3.28.md 2026-04-02 20:48:06 -05:00
Cal Corum
bf4b7dc8b7 docs: sync KB — codex-agents-marketplace.md 2026-04-02 20:48:06 -05:00
Cal Corum
3ac33d0046 docs: sync KB — open-packs-checkin-crash.md 2026-04-02 20:48:06 -05:00
cal
d730ea28bc Merge pull request 'docs: Roku WiFi buffering fix in troubleshooting' (#17) from docs/roku-wifi-buffering-fix into main
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
2026-03-26 11:55:30 +00:00
Cal Corum
b6e00ca33a docs: sync KB — Roku WiFi buffering fix in troubleshooting.md
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 13s
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 06:55:04 -05:00
cal
43e72fc1b6 Merge pull request 'docs: add Ansible controller LXC setup guide' (#16) from docs/ansible-controller-setup into main
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
2026-03-26 03:27:29 +00:00
Cal Corum
93d6093d45 docs: add Ansible controller LXC setup guide and update VM context
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 6s
New KB doc covering LXC 304 (ansible-controller) at 10.10.0.232 with
full inventory, update playbooks, snapshot rollback, and systemd timer.
Updated CONTEXT.md to reference the new controller.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 22:26:55 -05:00
Cal Corum
b7ed0f8435 docs: sync KB — claude-code-multi-account.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 4s
2026-03-25 16:00:43 -05:00
Cal Corum
4ecf93a3e2 docs: sync KB — kb-rag-mcp-oauth-fix.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
2026-03-25 10:00:07 -05:00
Cal Corum
646991e1a9 docs: sync KB — troubleshooting.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
2026-03-25 00:00:43 -05:00
Cal Corum
08a9dcd6eb docs: sync KB — troubleshooting.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 7s
2026-03-24 22:00:43 -05:00
Cal Corum
cedb056bce docs: sync KB — database-deployment-guide.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 5s
2026-03-24 08:00:43 -05:00
Cal Corum
18e69b3c43 docs: sync KB — 2026-03-24.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
2026-03-24 02:00:43 -05:00
Cal Corum
36aa78e591 docs: sync KB — docker-buildx-cache-400-error.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 4s
2026-03-23 22:00:43 -05:00
Cal Corum
7bea39b39b docs: sync KB — docker-buildx-cache-400-error.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 4s
2026-03-23 18:00:43 -05:00
Cal Corum
cc7617cbaa docs: sync KB — 2026-03-23.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 5s
2026-03-23 14:00:43 -05:00
Cal Corum
cfde1c4d0c docs: sync KB — pd-plan-release-1.0.0.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 4s
2026-03-23 02:00:43 -05:00
Cal Corum
b2117416f6 docs: sync KB — 2026-03-22-ecosystem-organization.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
2026-03-23 00:00:43 -05:00
Cal Corum
15851c7417 docs: sync KB — release-2026.3.21-cookbook.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
2026-03-21 18:00:08 -05:00
Cal Corum
cd57645dd0 docs: sync KB — 2026-03-20.md,release-2026.3.20.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
2026-03-20 22:00:43 -05:00
Cal Corum
b1fed02219 docs: sync KB — tag-triggered-release-deploy.md,release-2026.3.20.md claude-code-config.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
2026-03-20 14:00:43 -05:00
Cal Corum
730f100619 docs: sync KB — claude-plugins-marketplace.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 4s
2026-03-19 16:00:43 -05:00
Cal Corum
15c1c97d9d docs: sync KB — evolution-phase1-implementation.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 2s
2026-03-19 10:00:43 -05:00
cal
dcee66978b Merge pull request 'ci: remove approval step from auto-merge workflow' (#15) from ci/cleanup-auto-merge into main
Reviewed-on: #15
2026-03-19 04:43:38 +00:00