Reindex Knowledge Base / reindex (push) Successful in 3s

Details

docs: add YAML frontmatter to all 151 markdown files

Adds title, description, type, domain, and tags frontmatter to every
doc for improved KB semantic search. The description field is prepended
to every search chunk, and domain/type/tags enable filtered queries.

Type values: context, guide, runbook, reference, troubleshooting
Domain values match directory structure (networking, docker, etc.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-12 09:00:44 -05:00

12 KiB

Raw Blame History

title

description

type

domain

System Monitoring and Alerting - Technology Context

Overview

Comprehensive monitoring and alerting system for home lab infrastructure with focus on automated health checks, Discord notifications, and proactive system maintenance.

Architecture Patterns

Distributed Monitoring Strategy

Pattern: Service-specific monitoring with centralized alerting

Uptime Kuma: Centralized service uptime and health monitoring (status page)
Claude Runner (CT 302): SSH-based server diagnostics with two-tier auto-remediation
Tdarr Monitoring: API-based transcoding health checks
Windows Desktop Monitoring: Reboot detection and system events
Network Monitoring: Connectivity and service availability
Container Monitoring: Docker/Podman health and resource usage

AI Infrastructure LXCs (301–302)

Claude Discord Coordinator — CT 301 (10.10.0.147)

Purpose: Discord bot coordination with read-only cognitive memory MCP access SSH alias: claude-discord-coordinator

Claude Runner — CT 302 (10.10.0.148)

Purpose: Automated server health monitoring with AI-escalated remediation Repo: cal/claude-runner-monitoring on Gitea (cloned to /root/.claude on CT 302) Docs: monitoring/server-diagnostics/CONTEXT.md

Two-tier system:

Tier 1 (health_check.py): Pure Python, runs every 5 min via n8n. Checks containers, systemd services, disk/memory/load. Auto-restarts containers when allowed. Exit 0=healthy, 1=auto-fixed, 2=needs Claude.
Tier 2 (client.py): Full diagnostic toolkit used by Claude during escalation sessions.

Monitored servers (dynamic from config.yaml):

Server Key	IP	SSH User	Services	Critical
arr-stack	10.10.0.221	root	sonarr, radarr, readarr, lidarr, jellyseerr, sabnzbd	Yes
gitea	10.10.0.225	root	gitea (systemd), gitea-runner (docker)	Yes
uptime-kuma	10.10.0.227	root	uptime-kuma	Yes
n8n	10.10.0.210	root	n8n (no restart), n8n-postgres, omni-tools, termix	Yes
ubuntu-manticore	10.10.0.226	cal	jellyfin, tdarr-server, tdarr-node, pihole, watchstate, orbital-sync	Yes
strat-database	10.10.0.42	cal	sba_postgres, sba_redis, sba_db_api, dev_pd_database, sba_adminer	No (dev)
pihole1	10.10.0.16	cal	pihole (primary DNS), nginx-proxy-manager, portainer	Yes
sba-bots	10.10.0.88	cal	paper-dynasty bot, paper-dynasty DB (prod), PD adminer, sba-website, ghost	Yes
foundry	10.10.0.223	root	foundry-foundry-1 (Foundry VTT)	No

Per-server SSH user: health_check.py supports per-server ssh_user override in config.yaml (default: root). Used by ubuntu-manticore, strat-database, pihole1, and sba-bots which require cal user. SSH keys: n8n uses n8n_runner_key → CT 302, CT 302 uses homelab_rsa → target servers Helper script: /root/.claude/skills/server-diagnostics/list_servers.sh — extracts server keys from config.yaml as JSON array

n8n Workflow Architecture (Master + Sub-workflow)

The monitoring uses a master/sub-workflow pattern in n8n. Adding or removing servers only requires editing config.yaml on CT 302 — no n8n changes needed.

Master: "Server Health Monitor - Claude Code" (p7XmW23SgCs3hEkY, active)

Schedule (every 5 min)
  → SSH to CT 302: list_servers.sh → ["arr-stack", "gitea", "uptime-kuma", "n8n", "ubuntu-manticore", "strat-database"]
  → Code: split JSON array into one item per server_key
  → Execute Sub-workflow (mode: "each") → "Server Health Check"
  → Code: aggregate results (healthy/remediated/escalated counts)
  → If any escalations → Discord summary embed

Sub-workflow: "Server Health Check" (BhzYmWr6NcIDoioy, active)

Execute Workflow Trigger (receives { server_key: "arr-stack" })
  → SSH to CT 302: health_check.py --server {server_key}
  → Code: parse JSON output (status, exit_code, issues, escalations)
  → If exit_code == 2 → SSH: remediate.sh (escalation JSON)
  → Return results to parent (server_key, status, issues, remediation_output)

Exit code behavior:

0 (healthy): No action, aggregated in summary
1 (auto-remediated): Script already handled it + sent Discord via notifier.py — n8n takes no action
2 (needs escalation): Sub-workflow runs remediate.sh, master sends Discord summary

Pre-escalation notification: remediate.sh sends a Discord warning embed ("Claude API Escalation Triggered") via notifier.py before invoking the Claude CLI, so Cal gets a heads-up that API charges are about to be incurred.

SSH credential: SSH Private Key account (id: QkbHQ8JmYimUoTcM) Discord webhook: Homelab Alerts channel

Alert Management

Pattern: Structured notifications with actionable information

# Discord webhook integration
curl -X POST "$DISCORD_WEBHOOK" \
  -H "Content-Type: application/json" \
  -d '{
    "content": "**System Alert**\n```\nService: Tdarr\nIssue: Staging timeout\nAction: Automatic cleanup performed\n```\n<@user_id>"
  }'

Core Monitoring Components

Tdarr System Monitoring

Purpose: Monitor transcoding pipeline health and performance Location: scripts/tdarr_monitor.py

Key Features:

API-based status monitoring with dataclass structures
Staging section timeout detection and cleanup
Discord notifications with professional formatting
Log rotation and retention management

Windows Desktop Monitoring

Purpose: Track Windows system reboots and power events Location: scripts/windows-desktop/

Components:

PowerShell monitoring script
Scheduled task automation
Discord notification integration
System event correlation

Uptime Kuma (Centralized Uptime Monitoring)

Purpose: Centralized service uptime, health checks, and status page for all homelab services Location: LXC 227 (10.10.0.227), Docker container URL: https://status.manticorum.com (internal: http://10.10.0.227:3001)

Key Features:

HTTP/HTTPS, TCP, DNS, Docker, and ping monitoring
Discord notification integration (default alert channel for all monitors)
Public status page at https://status.manticorum.com
Multi-protocol health checks at 60-second intervals with 3 retries
Certificate expiration monitoring

Infrastructure:

Proxmox LXC 227, Ubuntu 22.04, 2 cores, 2GB RAM, 8GB disk
Docker with AppArmor unconfined (required for Docker-in-LXC)
Data persisted via Docker named volume (uptime-kuma-data)
Compose config: server-configs/uptime-kuma/docker-compose/uptime-kuma/
SSH alias: uptime-kuma
Admin credentials: username cal, password in ~/.claude/secrets/kuma_web_password

Active Monitors (20):

Tag	Monitor	Type	Target
Infrastructure	Proxmox VE	HTTP	https://10.10.0.11:8006
Infrastructure	Home Assistant	HTTP	http://10.0.0.28:8123
DNS	Pi-hole Primary DNS	DNS	10.10.0.16:53
DNS	Pi-hole Secondary DNS	DNS	10.10.0.226:53
Media	Jellyfin	HTTP	http://10.10.0.226:8096
Media	Tdarr	HTTP	http://10.10.0.226:8265
Media	Sonarr	HTTP	http://10.10.0.221:8989
Media	Radarr	HTTP	http://10.10.0.221:7878
Media	Jellyseerr	HTTP	http://10.10.0.221:5055
DevOps	Gitea	HTTP	http://10.10.0.225:3000
DevOps	n8n	HTTP	http://10.10.0.210:5678
Networking	NPM Local (Admin)	HTTP	http://10.10.0.16:81
Networking	Pi-hole Primary Web	HTTP	http://10.10.0.16:81/admin
Networking	Pi-hole Secondary Web	HTTP	http://10.10.0.226:8053/admin
Gaming	Foundry VTT	HTTP	http://10.10.0.223:30000
AI	OpenClaw Gateway	HTTP	http://10.10.0.224:18789
Bots	discord-bots VM	Ping	10.10.0.33
Bots	sba-bots VM	Ping	10.10.0.88
Database	PostgreSQL (strat-database)	TCP	10.10.0.42:5432
External	Akamai NPM	HTTP	http://172.237.147.99

Notifications:

Discord webhook: "Discord - Homelab Alerts" (default, applied to all monitors)
Alerts on service down (after 3 retries at 30s intervals) and on recovery

API Access:

Python library: uptime-kuma-api (pip installed)
Connection: UptimeKumaApi("http://10.10.0.227:3001")
Used for programmatic monitor/notification management

Network and Service Monitoring

Purpose: Monitor critical infrastructure availability Implementation:

# Service health check pattern
SERVICES="https://homelab.local http://nas.homelab.local"
for service in $SERVICES; do
    if curl -sSf --max-time 10 "$service" >/dev/null 2>&1; then
        echo "✅ $service: Available"
    else
        echo "❌ $service: Failed" | send_alert
    fi
done

Automation Patterns

Cron-Based Scheduling

Pattern: Regular health checks with intelligent alerting

# Monitoring schedule examples
*/20 * * * * /path/to/tdarr-timeout-monitor.sh    # Every 20 minutes
0 */6 * * * /path/to/cleanup-temp-dirs.sh         # Every 6 hours
0 2 * * * /path/to/backup-monitor.sh              # Daily at 2 AM

Event-Driven Monitoring

Pattern: Reactive monitoring for critical events

System Startup: Windows boot detection
Service Failures: Container restart alerts
Resource Exhaustion: Disk space warnings
Security Events: Failed login attempts

Data Collection and Analysis

Log Management

Pattern: Centralized logging with rotation

# Log rotation configuration
LOG_FILE="/var/log/homelab-monitor.log"
MAX_SIZE="10M"
RETENTION_DAYS=30

# Rotate logs when size exceeded
if [ $(stat -c%s "$LOG_FILE") -gt $((10*1024*1024)) ]; then
    mv "$LOG_FILE" "$LOG_FILE.$(date +%Y%m%d)"
    touch "$LOG_FILE"
fi

Metrics Collection

Pattern: Time-series data for trend analysis

System Metrics: CPU, memory, disk usage
Service Metrics: Response times, error rates
Application Metrics: Transcoding progress, queue sizes
Network Metrics: Bandwidth usage, latency

Alert Integration

Discord Notification System

Pattern: Rich, actionable notifications

# Professional alert format
**🔧 System Maintenance**
Service: Tdarr Transcoding
Issue: 3 files timed out in staging
Resolution: Automatic cleanup completed
Status: System operational

Manual review recommended <@user_id>

Alert Escalation

Pattern: Tiered alerting based on severity

Info: Routine maintenance completed
Warning: Service degradation detected
Critical: Service failure requiring immediate attention
Emergency: System-wide failure requiring manual intervention

Best Practices Implementation

Monitoring Strategy

Proactive: Monitor trends to predict issues
Reactive: Alert on current failures
Preventive: Automated cleanup and maintenance
Comprehensive: Cover all critical services
Actionable: Provide clear resolution paths

Performance Optimization

Efficient Polling: Balance monitoring frequency with resource usage
Smart Alerting: Avoid alert fatigue with intelligent filtering
Resource Management: Monitor the monitoring system itself
Scalable Architecture: Design for growth and additional services

This technology context provides the foundation for implementing comprehensive monitoring and alerting in home lab environments.

12 KiB Raw Blame History Unescape Escape