Commit Graph

24 Commits

Author SHA1 Message Date
Cal Corum
1a3785f01a feat: dynamic summary, --hosts filter, and --json output (#24)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
Closes #24

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 20:08:07 +00:00
Cal Corum
193ae68f96 docs: document per-core load threshold policy for server health monitoring (#22)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 5s
Closes #22

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 13:35:23 -05:00
Cal Corum
ae5da035f6 feat: add backup recency, cert expiry, OOM, and I/O wait checks (#25)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
Closes #25

- check_backup_recency(): queries pvesh vzdump task history; flags VMs
  with no backup (CRIT) or no backup in 7 days (WARN)
- check_cert_expiry(): probes ports 443/8443 per host via openssl;
  flags certs expiring ≤14 days (WARN) or ≤7 days (CRIT)
- io_wait_pct() in COLLECTOR_SCRIPT: uses vmstat 1 2 to sample I/O
  wait; flagged as WARN when > 20%
- OOM kill history was already collected via journalctl; no changes needed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 21:06:44 -05:00
Cal Corum
e58c5b8cc1 fix: address PR review — move memory limits to deploy block, handle swap-less hosts
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
Move mem_limit/memswap_limit to deploy.resources.limits.memory so the
constraint is actually enforced under Compose v3. Add END clause to
swap_mb() so hosts without a Swap line report 0 instead of empty output.
Fix test script header comment accuracy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 21:05:12 -05:00
Cal Corum
f28dfeb4bf feat: add zombie parent, swap, and OOM metrics to audit; harden Tdarr containers
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 3s
Extend homelab-audit.sh collector with zombie_parents(), swap_mb(), and
oom_events() functions so the audit identifies which process spawns zombies,
flags high swap usage, and reports recent OOM kills. Add init: true to both
Tdarr docker-compose services so tini reaps orphaned ffmpeg children, and
cap tdarr-node at 28g RAM / 30g total to prevent unbounded memory use.

Closes #30

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 21:02:05 -05:00
Cal Corum
1ed911e61b fix: single-quote awk program in stuck_procs() collector
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 3s
Reindex Knowledge Base / reindex (push) Successful in 3s
The awk program was double-quoted inside the single-quoted
COLLECTOR_SCRIPT, causing $1/$2/$3 to be expanded by the remote
shell as empty positional parameters instead of awk field references.
This made the D-state process filter silently match nothing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 20:48:56 -05:00
Cal Corum
7c801f6c3b fix: guard --output-dir arg and use configurable ZOMBIE_WARN threshold
- Validate --output-dir has a following argument before accessing $2
  (prevents unbound variable crash under set -u)
- Add ZOMBIE_WARN config variable (default: 1) and use it in the zombie
  check instead of hardcoding 0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 20:48:56 -05:00
Cal Corum
9a39abd64c fix: add homelab-audit.sh with variable interpolation and collector fixes (#23)
Closes #23

- Fix STUCK_PROC_CPU_WARN not reaching remote collector: COLLECTOR_SCRIPT
  heredoc stays single-quoted; threshold is passed as $1 to the remote
  bash session so it is evaluated correctly on the collecting host
- Fix LXC IP discovery for static-IP containers: lxc-info result now falls
  back to parsing pct config when lxc-info returns empty
- Fix SSH failures silently dropped: stderr redirected to
  $REPORT_DIR/ssh-failures.log; SSH_FAILURE entries counted and printed
  in the summary
- Add explicit comment explaining why -e is omitted from set options

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 20:48:56 -05:00
Cal Corum
fcecde0de4 docs: decommission cognitive memory references from KB
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 2s
Removed cognitive-memory MCP, timers, and symlink system references.
Replaced with kb-search MCP and /save-doc skill workflow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 23:02:56 -05:00
Cal Corum
4b7eca8a46 docs: add YAML frontmatter to all 151 markdown files
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
Adds title, description, type, domain, and tags frontmatter to every
doc for improved KB semantic search. The description field is prepended
to every search chunk, and domain/type/tags enable filtered queries.

Type values: context, guide, runbook, reference, troubleshooting
Domain values match directory structure (networking, docker, etc.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 09:00:44 -05:00
Cal Corum
28abde7c9f chore: add recovered CT 302 configs, archive tdarr scripts, clean up repo
- Add recovered LXC 300/302 server-diagnostics configs as reference
  (headless Claude permission patterns, health check client)
- Archive decommissioned tdarr monitoring scripts
- Gitignore rpg-art/ directory
- Delete stray temp files and swarm-test/

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 00:41:41 -06:00
Cal Corum
5ff94a9d20 docs: remove decommissioned MCP Gateway (CT 303) from monitoring inventory
Migrated MCP servers back to local stdio config, shut down LXC 303.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 12:39:53 -06:00
Cal Corum
df553e5142 docs: add AI infrastructure LXCs (301-303) to monitoring server inventory
Groups Claude Discord Coordinator, Claude Runner, and MCP Gateway
under a shared section. Documents new CT 303 MCP Gateway with n8n
and Gitea MCP server configuration details.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 19:58:19 -06:00
Cal Corum
28851a9012 docs: add pihole1, sba-bots, foundry to monitoring server inventory
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 14:15:43 -06:00
Cal Corum
3737c7dda5 docs: expand monitoring coverage, update Proxmox upgrade plan, remove decommissioned tdarr scripts
- Update monitoring CONTEXT.md with 6-server inventory table, per-server
  SSH user support, and pre-escalation Discord notification docs
- Remove tdarr local monitoring scripts (decommissioned per prior decision)
- Update Proxmox upgrade plan with Phase 1 completion and Phase 2 prep
- Update vm-management CONTEXT.md with current PVE 8 state
- CLAUDE.md: auto-run /save-memories at 25% context instead of asking

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 11:08:48 -06:00
Cal Corum
f20e221090 docs: update monitoring CONTEXT.md with expanded server inventory
Add server table with all 6 monitored hosts, per-server SSH user
docs, updated workflow server list, and pre-escalation Discord
notification documentation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 11:05:40 -06:00
Cal Corum
ed16fee9f7 docs: add CT 302 SSH alias and git auth details to server-diagnostics
Documents the claude-runner SSH alias, HTTPS token auth method,
and notes that SSH git remotes don't work from CT 302.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 09:04:33 -06:00
Cal Corum
3b2e031f45 Update monitoring docs with Uptime Kuma monitors and Discord alerts
Document all 20 active monitors with targets and tags, Discord
notification configuration, and API access details for programmatic
management via uptime-kuma-api.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 23:02:49 -06:00
Cal Corum
d0dbe86fba Add NVIDIA update checker and monitoring scripts documentation
Add nvidia_update_checker.py for weekly driver update monitoring with
Discord alerts. Add scripts CONTEXT.md and update README.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 22:21:00 -06:00
Cal Corum
a35891b565 Add Uptime Kuma service monitoring on LXC 227
Deploy Uptime Kuma for centralized service uptime monitoring at
https://status.manticorum.com. Proxmox LXC 227 (10.10.0.227) running
Ubuntu 22.04 with Docker. Updated monitoring documentation, CLAUDE.md
context loading rules, and server-configs host inventory.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 22:18:51 -06:00
Cal Corum
3112b3d6fe CLAUDE: Add Jellyfin GPU health monitor with auto-restart
- Created jellyfin_gpu_monitor.py for detecting lost GPU access
- Sends Discord alerts when GPU access fails
- Auto-restarts container to restore GPU binding
- Runs every 5 minutes via cron on ubuntu-manticore
- Documents FFmpeg exit code 187 (NVENC failure) in troubleshooting

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 22:57:04 -06:00
Cal Corum
0ecac96703 CLAUDE: Add Tdarr file monitoring scripts
- Add tdarr_file_monitor.py for API-based monitoring
- Add cron wrapper script for scheduled execution

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-07 00:48:10 -06:00
Cal Corum
edc78c2dd6 CLAUDE: Add comprehensive gaming-aware Tdarr management system
- Created complete gaming detection and priority system
- Added gaming schedule configuration and enforcement
- Implemented Steam library monitoring with auto-detection
- Built comprehensive game process detection for multiple platforms
- Added gaming-aware Tdarr worker management with priority controls
- Created emergency gaming mode for immediate worker shutdown
- Integrated Discord notifications for gaming state changes
- Replaced old bash monitoring with enhanced Python monitoring system
- Added persistent state management and memory tracking
- Implemented configurable gaming time windows and schedules
- Updated .gitignore to exclude logs directories

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-14 15:17:52 -05:00
Cal Corum
10c9e0d854 CLAUDE: Migrate to technology-first documentation architecture
Complete restructure from patterns/examples/reference to technology-focused directories:

• Created technology-specific directories with comprehensive documentation:
  - /tdarr/ - Transcoding automation with gaming-aware scheduling
  - /docker/ - Container management with GPU acceleration patterns
  - /vm-management/ - Virtual machine automation and cloud-init
  - /networking/ - SSH infrastructure, reverse proxy, and security
  - /monitoring/ - System health checks and Discord notifications
  - /databases/ - Database patterns and troubleshooting
  - /development/ - Programming language patterns (bash, nodejs, python, vuejs)

• Enhanced CLAUDE.md with intelligent context loading:
  - Technology-first loading rules for automatic context provision
  - Troubleshooting keyword triggers for emergency scenarios
  - Documentation maintenance protocols with automated reminders
  - Context window management for optimal documentation updates

• Preserved valuable content from .claude/tmp/:
  - SSH security improvements and server inventory
  - Tdarr CIFS troubleshooting and Docker iptables solutions
  - Operational scripts with proper technology classification

• Benefits achieved:
  - Self-contained technology directories with complete context
  - Automatic loading of relevant documentation based on keywords
  - Emergency-ready troubleshooting with comprehensive guides
  - Scalable structure for future technology additions
  - Eliminated context bloat through targeted loading

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-12 23:20:15 -05:00