Commit Graph

14 Commits

Author SHA1 Message Date
Cal Corum
1a3785f01a feat: dynamic summary, --hosts filter, and --json output (#24)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
Closes #24

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 20:08:07 +00:00
Cal Corum
ae5da035f6 feat: add backup recency, cert expiry, OOM, and I/O wait checks (#25)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
Closes #25

- check_backup_recency(): queries pvesh vzdump task history; flags VMs
  with no backup (CRIT) or no backup in 7 days (WARN)
- check_cert_expiry(): probes ports 443/8443 per host via openssl;
  flags certs expiring ≤14 days (WARN) or ≤7 days (CRIT)
- io_wait_pct() in COLLECTOR_SCRIPT: uses vmstat 1 2 to sample I/O
  wait; flagged as WARN when > 20%
- OOM kill history was already collected via journalctl; no changes needed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 21:06:44 -05:00
Cal Corum
e58c5b8cc1 fix: address PR review — move memory limits to deploy block, handle swap-less hosts
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
Move mem_limit/memswap_limit to deploy.resources.limits.memory so the
constraint is actually enforced under Compose v3. Add END clause to
swap_mb() so hosts without a Swap line report 0 instead of empty output.
Fix test script header comment accuracy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 21:05:12 -05:00
Cal Corum
f28dfeb4bf feat: add zombie parent, swap, and OOM metrics to audit; harden Tdarr containers
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 3s
Extend homelab-audit.sh collector with zombie_parents(), swap_mb(), and
oom_events() functions so the audit identifies which process spawns zombies,
flags high swap usage, and reports recent OOM kills. Add init: true to both
Tdarr docker-compose services so tini reaps orphaned ffmpeg children, and
cap tdarr-node at 28g RAM / 30g total to prevent unbounded memory use.

Closes #30

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 21:02:05 -05:00
Cal Corum
1ed911e61b fix: single-quote awk program in stuck_procs() collector
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 3s
Reindex Knowledge Base / reindex (push) Successful in 3s
The awk program was double-quoted inside the single-quoted
COLLECTOR_SCRIPT, causing $1/$2/$3 to be expanded by the remote
shell as empty positional parameters instead of awk field references.
This made the D-state process filter silently match nothing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 20:48:56 -05:00
Cal Corum
7c801f6c3b fix: guard --output-dir arg and use configurable ZOMBIE_WARN threshold
- Validate --output-dir has a following argument before accessing $2
  (prevents unbound variable crash under set -u)
- Add ZOMBIE_WARN config variable (default: 1) and use it in the zombie
  check instead of hardcoding 0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 20:48:56 -05:00
Cal Corum
9a39abd64c fix: add homelab-audit.sh with variable interpolation and collector fixes (#23)
Closes #23

- Fix STUCK_PROC_CPU_WARN not reaching remote collector: COLLECTOR_SCRIPT
  heredoc stays single-quoted; threshold is passed as $1 to the remote
  bash session so it is evaluated correctly on the collecting host
- Fix LXC IP discovery for static-IP containers: lxc-info result now falls
  back to parsing pct config when lxc-info returns empty
- Fix SSH failures silently dropped: stderr redirected to
  $REPORT_DIR/ssh-failures.log; SSH_FAILURE entries counted and printed
  in the summary
- Add explicit comment explaining why -e is omitted from set options

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 20:48:56 -05:00
Cal Corum
4b7eca8a46 docs: add YAML frontmatter to all 151 markdown files
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
Adds title, description, type, domain, and tags frontmatter to every
doc for improved KB semantic search. The description field is prepended
to every search chunk, and domain/type/tags enable filtered queries.

Type values: context, guide, runbook, reference, troubleshooting
Domain values match directory structure (networking, docker, etc.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 09:00:44 -05:00
Cal Corum
3737c7dda5 docs: expand monitoring coverage, update Proxmox upgrade plan, remove decommissioned tdarr scripts
- Update monitoring CONTEXT.md with 6-server inventory table, per-server
  SSH user support, and pre-escalation Discord notification docs
- Remove tdarr local monitoring scripts (decommissioned per prior decision)
- Update Proxmox upgrade plan with Phase 1 completion and Phase 2 prep
- Update vm-management CONTEXT.md with current PVE 8 state
- CLAUDE.md: auto-run /save-memories at 25% context instead of asking

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 11:08:48 -06:00
Cal Corum
d0dbe86fba Add NVIDIA update checker and monitoring scripts documentation
Add nvidia_update_checker.py for weekly driver update monitoring with
Discord alerts. Add scripts CONTEXT.md and update README.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 22:21:00 -06:00
Cal Corum
3112b3d6fe CLAUDE: Add Jellyfin GPU health monitor with auto-restart
- Created jellyfin_gpu_monitor.py for detecting lost GPU access
- Sends Discord alerts when GPU access fails
- Auto-restarts container to restore GPU binding
- Runs every 5 minutes via cron on ubuntu-manticore
- Documents FFmpeg exit code 187 (NVENC failure) in troubleshooting

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 22:57:04 -06:00
Cal Corum
0ecac96703 CLAUDE: Add Tdarr file monitoring scripts
- Add tdarr_file_monitor.py for API-based monitoring
- Add cron wrapper script for scheduled execution

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-07 00:48:10 -06:00
Cal Corum
edc78c2dd6 CLAUDE: Add comprehensive gaming-aware Tdarr management system
- Created complete gaming detection and priority system
- Added gaming schedule configuration and enforcement
- Implemented Steam library monitoring with auto-detection
- Built comprehensive game process detection for multiple platforms
- Added gaming-aware Tdarr worker management with priority controls
- Created emergency gaming mode for immediate worker shutdown
- Integrated Discord notifications for gaming state changes
- Replaced old bash monitoring with enhanced Python monitoring system
- Added persistent state management and memory tracking
- Implemented configurable gaming time windows and schedules
- Updated .gitignore to exclude logs directories

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-14 15:17:52 -05:00
Cal Corum
10c9e0d854 CLAUDE: Migrate to technology-first documentation architecture
Complete restructure from patterns/examples/reference to technology-focused directories:

• Created technology-specific directories with comprehensive documentation:
  - /tdarr/ - Transcoding automation with gaming-aware scheduling
  - /docker/ - Container management with GPU acceleration patterns
  - /vm-management/ - Virtual machine automation and cloud-init
  - /networking/ - SSH infrastructure, reverse proxy, and security
  - /monitoring/ - System health checks and Discord notifications
  - /databases/ - Database patterns and troubleshooting
  - /development/ - Programming language patterns (bash, nodejs, python, vuejs)

• Enhanced CLAUDE.md with intelligent context loading:
  - Technology-first loading rules for automatic context provision
  - Troubleshooting keyword triggers for emergency scenarios
  - Documentation maintenance protocols with automated reminders
  - Context window management for optimal documentation updates

• Preserved valuable content from .claude/tmp/:
  - SSH security improvements and server inventory
  - Tdarr CIFS troubleshooting and Docker iptables solutions
  - Operational scripts with proper technology classification

• Benefits achieved:
  - Self-contained technology directories with complete context
  - Automatic loading of relevant documentation based on keywords
  - Emergency-ready troubleshooting with comprehensive guides
  - Scalable structure for future technology additions
  - Eliminated context bloat through targeted loading

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-12 23:20:15 -05:00