Closes#25
- check_backup_recency(): queries pvesh vzdump task history; flags VMs
with no backup (CRIT) or no backup in 7 days (WARN)
- check_cert_expiry(): probes ports 443/8443 per host via openssl;
flags certs expiring ≤14 days (WARN) or ≤7 days (CRIT)
- io_wait_pct() in COLLECTOR_SCRIPT: uses vmstat 1 2 to sample I/O
wait; flagged as WARN when > 20%
- OOM kill history was already collected via journalctl; no changes needed
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move mem_limit/memswap_limit to deploy.resources.limits.memory so the
constraint is actually enforced under Compose v3. Add END clause to
swap_mb() so hosts without a Swap line report 0 instead of empty output.
Fix test script header comment accuracy.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extend homelab-audit.sh collector with zombie_parents(), swap_mb(), and
oom_events() functions so the audit identifies which process spawns zombies,
flags high swap usage, and reports recent OOM kills. Add init: true to both
Tdarr docker-compose services so tini reaps orphaned ffmpeg children, and
cap tdarr-node at 28g RAM / 30g total to prevent unbounded memory use.
Closes#30
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The awk program was double-quoted inside the single-quoted
COLLECTOR_SCRIPT, causing $1/$2/$3 to be expanded by the remote
shell as empty positional parameters instead of awk field references.
This made the D-state process filter silently match nothing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Validate --output-dir has a following argument before accessing $2
(prevents unbound variable crash under set -u)
- Add ZOMBIE_WARN config variable (default: 1) and use it in the zombie
check instead of hardcoding 0
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes#23
- Fix STUCK_PROC_CPU_WARN not reaching remote collector: COLLECTOR_SCRIPT
heredoc stays single-quoted; threshold is passed as $1 to the remote
bash session so it is evaluated correctly on the collecting host
- Fix LXC IP discovery for static-IP containers: lxc-info result now falls
back to parsing pct config when lxc-info returns empty
- Fix SSH failures silently dropped: stderr redirected to
$REPORT_DIR/ssh-failures.log; SSH_FAILURE entries counted and printed
in the summary
- Add explicit comment explaining why -e is omitted from set options
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Removed cognitive-memory MCP, timers, and symlink system references.
Replaced with kb-search MCP and /save-doc skill workflow.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds title, description, type, domain, and tags frontmatter to every
doc for improved KB semantic search. The description field is prepended
to every search chunk, and domain/type/tags enable filtered queries.
Type values: context, guide, runbook, reference, troubleshooting
Domain values match directory structure (networking, docker, etc.)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Groups Claude Discord Coordinator, Claude Runner, and MCP Gateway
under a shared section. Documents new CT 303 MCP Gateway with n8n
and Gitea MCP server configuration details.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Update monitoring CONTEXT.md with 6-server inventory table, per-server
SSH user support, and pre-escalation Discord notification docs
- Remove tdarr local monitoring scripts (decommissioned per prior decision)
- Update Proxmox upgrade plan with Phase 1 completion and Phase 2 prep
- Update vm-management CONTEXT.md with current PVE 8 state
- CLAUDE.md: auto-run /save-memories at 25% context instead of asking
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add server table with all 6 monitored hosts, per-server SSH user
docs, updated workflow server list, and pre-escalation Discord
notification documentation.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Documents the claude-runner SSH alias, HTTPS token auth method,
and notes that SSH git remotes don't work from CT 302.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Document all 20 active monitors with targets and tags, Discord
notification configuration, and API access details for programmatic
management via uptime-kuma-api.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add nvidia_update_checker.py for weekly driver update monitoring with
Discord alerts. Add scripts CONTEXT.md and update README.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deploy Uptime Kuma for centralized service uptime monitoring at
https://status.manticorum.com. Proxmox LXC 227 (10.10.0.227) running
Ubuntu 22.04 with Docker. Updated monitoring documentation, CLAUDE.md
context loading rules, and server-configs host inventory.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Created jellyfin_gpu_monitor.py for detecting lost GPU access
- Sends Discord alerts when GPU access fails
- Auto-restarts container to restore GPU binding
- Runs every 5 minutes via cron on ubuntu-manticore
- Documents FFmpeg exit code 187 (NVENC failure) in troubleshooting
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add tdarr_file_monitor.py for API-based monitoring
- Add cron wrapper script for scheduled execution
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Created complete gaming detection and priority system
- Added gaming schedule configuration and enforcement
- Implemented Steam library monitoring with auto-detection
- Built comprehensive game process detection for multiple platforms
- Added gaming-aware Tdarr worker management with priority controls
- Created emergency gaming mode for immediate worker shutdown
- Integrated Discord notifications for gaming state changes
- Replaced old bash monitoring with enhanced Python monitoring system
- Added persistent state management and memory tracking
- Implemented configurable gaming time windows and schedules
- Updated .gitignore to exclude logs directories
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>