claude-home/monitoring/CONTEXT.md
Cal Corum 3b2e031f45 Update monitoring docs with Uptime Kuma monitors and Discord alerts
Document all 20 active monitors with targets and tags, Discord
notification configuration, and API access details for programmatic
management via uptime-kuma-api.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 23:02:49 -06:00

197 lines
7.2 KiB
Markdown

# System Monitoring and Alerting - Technology Context
## Overview
Comprehensive monitoring and alerting system for home lab infrastructure with focus on automated health checks, Discord notifications, and proactive system maintenance.
## Architecture Patterns
### Distributed Monitoring Strategy
**Pattern**: Service-specific monitoring with centralized alerting
- **Uptime Kuma**: Centralized service uptime and health monitoring (status page)
- **Tdarr Monitoring**: API-based transcoding health checks
- **Windows Desktop Monitoring**: Reboot detection and system events
- **Network Monitoring**: Connectivity and service availability
- **Container Monitoring**: Docker/Podman health and resource usage
### Alert Management
**Pattern**: Structured notifications with actionable information
```bash
# Discord webhook integration
curl -X POST "$DISCORD_WEBHOOK" \
-H "Content-Type: application/json" \
-d '{
"content": "**System Alert**\n```\nService: Tdarr\nIssue: Staging timeout\nAction: Automatic cleanup performed\n```\n<@user_id>"
}'
```
## Core Monitoring Components
### Tdarr System Monitoring
**Purpose**: Monitor transcoding pipeline health and performance
**Location**: `scripts/tdarr_monitor.py`
**Key Features**:
- API-based status monitoring with dataclass structures
- Staging section timeout detection and cleanup
- Discord notifications with professional formatting
- Log rotation and retention management
### Windows Desktop Monitoring
**Purpose**: Track Windows system reboots and power events
**Location**: `scripts/windows-desktop/`
**Components**:
- PowerShell monitoring script
- Scheduled task automation
- Discord notification integration
- System event correlation
### Uptime Kuma (Centralized Uptime Monitoring)
**Purpose**: Centralized service uptime, health checks, and status page for all homelab services
**Location**: LXC 227 (10.10.0.227), Docker container
**URL**: https://status.manticorum.com (internal: http://10.10.0.227:3001)
**Key Features**:
- HTTP/HTTPS, TCP, DNS, Docker, and ping monitoring
- Discord notification integration (default alert channel for all monitors)
- Public status page at https://status.manticorum.com
- Multi-protocol health checks at 60-second intervals with 3 retries
- Certificate expiration monitoring
**Infrastructure**:
- Proxmox LXC 227, Ubuntu 22.04, 2 cores, 2GB RAM, 8GB disk
- Docker with AppArmor unconfined (required for Docker-in-LXC)
- Data persisted via Docker named volume (`uptime-kuma-data`)
- Compose config: `server-configs/uptime-kuma/docker-compose/uptime-kuma/`
- SSH alias: `uptime-kuma`
- Admin credentials: username `cal`, password in `~/.claude/secrets/kuma_web_password`
**Active Monitors (20)**:
| Tag | Monitor | Type | Target |
|-----|---------|------|--------|
| Infrastructure | Proxmox VE | HTTP | https://10.10.0.11:8006 |
| Infrastructure | Home Assistant | HTTP | http://10.0.0.28:8123 |
| DNS | Pi-hole Primary DNS | DNS | 10.10.0.16:53 |
| DNS | Pi-hole Secondary DNS | DNS | 10.10.0.226:53 |
| Media | Jellyfin | HTTP | http://10.10.0.226:8096 |
| Media | Tdarr | HTTP | http://10.10.0.226:8265 |
| Media | Sonarr | HTTP | http://10.10.0.221:8989 |
| Media | Radarr | HTTP | http://10.10.0.221:7878 |
| Media | Jellyseerr | HTTP | http://10.10.0.221:5055 |
| DevOps | Gitea | HTTP | http://10.10.0.225:3000 |
| DevOps | n8n | HTTP | http://10.10.0.210:5678 |
| Networking | NPM Local (Admin) | HTTP | http://10.10.0.16:81 |
| Networking | Pi-hole Primary Web | HTTP | http://10.10.0.16:81/admin |
| Networking | Pi-hole Secondary Web | HTTP | http://10.10.0.226:8053/admin |
| Gaming | Foundry VTT | HTTP | http://10.10.0.223:30000 |
| AI | OpenClaw Gateway | HTTP | http://10.10.0.224:18789 |
| Bots | discord-bots VM | Ping | 10.10.0.33 |
| Bots | sba-bots VM | Ping | 10.10.0.88 |
| Database | PostgreSQL (strat-database) | TCP | 10.10.0.42:5432 |
| External | Akamai NPM | HTTP | http://172.237.147.99 |
**Notifications**:
- Discord webhook: "Discord - Homelab Alerts" (default, applied to all monitors)
- Alerts on service down (after 3 retries at 30s intervals) and on recovery
**API Access**:
- Python library: `uptime-kuma-api` (pip installed)
- Connection: `UptimeKumaApi("http://10.10.0.227:3001")`
- Used for programmatic monitor/notification management
### Network and Service Monitoring
**Purpose**: Monitor critical infrastructure availability
**Implementation**:
```bash
# Service health check pattern
SERVICES="https://homelab.local http://nas.homelab.local"
for service in $SERVICES; do
if curl -sSf --max-time 10 "$service" >/dev/null 2>&1; then
echo "✅ $service: Available"
else
echo "❌ $service: Failed" | send_alert
fi
done
```
## Automation Patterns
### Cron-Based Scheduling
**Pattern**: Regular health checks with intelligent alerting
```bash
# Monitoring schedule examples
*/20 * * * * /path/to/tdarr-timeout-monitor.sh # Every 20 minutes
0 */6 * * * /path/to/cleanup-temp-dirs.sh # Every 6 hours
0 2 * * * /path/to/backup-monitor.sh # Daily at 2 AM
```
### Event-Driven Monitoring
**Pattern**: Reactive monitoring for critical events
- **System Startup**: Windows boot detection
- **Service Failures**: Container restart alerts
- **Resource Exhaustion**: Disk space warnings
- **Security Events**: Failed login attempts
## Data Collection and Analysis
### Log Management
**Pattern**: Centralized logging with rotation
```bash
# Log rotation configuration
LOG_FILE="/var/log/homelab-monitor.log"
MAX_SIZE="10M"
RETENTION_DAYS=30
# Rotate logs when size exceeded
if [ $(stat -c%s "$LOG_FILE") -gt $((10*1024*1024)) ]; then
mv "$LOG_FILE" "$LOG_FILE.$(date +%Y%m%d)"
touch "$LOG_FILE"
fi
```
### Metrics Collection
**Pattern**: Time-series data for trend analysis
- **System Metrics**: CPU, memory, disk usage
- **Service Metrics**: Response times, error rates
- **Application Metrics**: Transcoding progress, queue sizes
- **Network Metrics**: Bandwidth usage, latency
## Alert Integration
### Discord Notification System
**Pattern**: Rich, actionable notifications
```markdown
# Professional alert format
**🔧 System Maintenance**
Service: Tdarr Transcoding
Issue: 3 files timed out in staging
Resolution: Automatic cleanup completed
Status: System operational
Manual review recommended <@user_id>
```
### Alert Escalation
**Pattern**: Tiered alerting based on severity
1. **Info**: Routine maintenance completed
2. **Warning**: Service degradation detected
3. **Critical**: Service failure requiring immediate attention
4. **Emergency**: System-wide failure requiring manual intervention
## Best Practices Implementation
### Monitoring Strategy
1. **Proactive**: Monitor trends to predict issues
2. **Reactive**: Alert on current failures
3. **Preventive**: Automated cleanup and maintenance
4. **Comprehensive**: Cover all critical services
5. **Actionable**: Provide clear resolution paths
### Performance Optimization
1. **Efficient Polling**: Balance monitoring frequency with resource usage
2. **Smart Alerting**: Avoid alert fatigue with intelligent filtering
3. **Resource Management**: Monitor the monitoring system itself
4. **Scalable Architecture**: Design for growth and additional services
This technology context provides the foundation for implementing comprehensive monitoring and alerting in home lab environments.