claude-home/monitoring/CONTEXT.md

# System Monitoring and Alerting - Technology Context

## Overview
Comprehensive monitoring and alerting system for home lab infrastructure with focus on automated health checks, Discord notifications, and proactive system maintenance.

## Architecture Patterns

### Distributed Monitoring Strategy
**Pattern**: Service-specific monitoring with centralized alerting
- **Uptime Kuma**: Centralized service uptime and health monitoring (status page)
- **Tdarr Monitoring**: API-based transcoding health checks
- **Windows Desktop Monitoring**: Reboot detection and system events
- **Network Monitoring**: Connectivity and service availability
- **Container Monitoring**: Docker/Podman health and resource usage

### Alert Management
**Pattern**: Structured notifications with actionable information
```bash
# Discord webhook integration
curl -X POST "$DISCORD_WEBHOOK" \
  -H "Content-Type: application/json" \
  -d '{
    "content": "**System Alert**\n```\nService: Tdarr\nIssue: Staging timeout\nAction: Automatic cleanup performed\n```\n<@user_id>"
  }'
```

## Core Monitoring Components

### Tdarr System Monitoring
**Purpose**: Monitor transcoding pipeline health and performance
**Location**: `scripts/tdarr_monitor.py`

**Key Features**:
- API-based status monitoring with dataclass structures
- Staging section timeout detection and cleanup
- Discord notifications with professional formatting
- Log rotation and retention management

### Windows Desktop Monitoring
**Purpose**: Track Windows system reboots and power events
**Location**: `scripts/windows-desktop/`

**Components**:
- PowerShell monitoring script
- Scheduled task automation
- Discord notification integration
- System event correlation

### Uptime Kuma (Centralized Uptime Monitoring)
**Purpose**: Centralized service uptime, health checks, and status page for all homelab services
**Location**: LXC 227 (10.10.0.227), Docker container
**URL**: https://status.manticorum.com (internal: http://10.10.0.227:3001)

**Key Features**:
- HTTP/HTTPS, TCP, DNS, Docker, and ping monitoring
- Discord notification integration (default alert channel for all monitors)
- Public status page at https://status.manticorum.com
- Multi-protocol health checks at 60-second intervals with 3 retries
- Certificate expiration monitoring

**Infrastructure**:
- Proxmox LXC 227, Ubuntu 22.04, 2 cores, 2GB RAM, 8GB disk
- Docker with AppArmor unconfined (required for Docker-in-LXC)
- Data persisted via Docker named volume (`uptime-kuma-data`)
- Compose config: `server-configs/uptime-kuma/docker-compose/uptime-kuma/`
- SSH alias: `uptime-kuma`
- Admin credentials: username `cal`, password in `~/.claude/secrets/kuma_web_password`

**Active Monitors (20)**:

| Tag | Monitor | Type | Target |
|-----|---------|------|--------|
| Infrastructure | Proxmox VE | HTTP | https://10.10.0.11:8006 |
| Infrastructure | Home Assistant | HTTP | http://10.0.0.28:8123 |
| DNS | Pi-hole Primary DNS | DNS | 10.10.0.16:53 |
| DNS | Pi-hole Secondary DNS | DNS | 10.10.0.226:53 |
| Media | Jellyfin | HTTP | http://10.10.0.226:8096 |
| Media | Tdarr | HTTP | http://10.10.0.226:8265 |
| Media | Sonarr | HTTP | http://10.10.0.221:8989 |
| Media | Radarr | HTTP | http://10.10.0.221:7878 |
| Media | Jellyseerr | HTTP | http://10.10.0.221:5055 |
| DevOps | Gitea | HTTP | http://10.10.0.225:3000 |
| DevOps | n8n | HTTP | http://10.10.0.210:5678 |
| Networking | NPM Local (Admin) | HTTP | http://10.10.0.16:81 |
| Networking | Pi-hole Primary Web | HTTP | http://10.10.0.16:81/admin |
| Networking | Pi-hole Secondary Web | HTTP | http://10.10.0.226:8053/admin |
| Gaming | Foundry VTT | HTTP | http://10.10.0.223:30000 |
| AI | OpenClaw Gateway | HTTP | http://10.10.0.224:18789 |
| Bots | discord-bots VM | Ping | 10.10.0.33 |
| Bots | sba-bots VM | Ping | 10.10.0.88 |
| Database | PostgreSQL (strat-database) | TCP | 10.10.0.42:5432 |
| External | Akamai NPM | HTTP | http://172.237.147.99 |

**Notifications**:
- Discord webhook: "Discord - Homelab Alerts" (default, applied to all monitors)
- Alerts on service down (after 3 retries at 30s intervals) and on recovery

**API Access**:
- Python library: `uptime-kuma-api` (pip installed)
- Connection: `UptimeKumaApi("http://10.10.0.227:3001")`
- Used for programmatic monitor/notification management

### Network and Service Monitoring
**Purpose**: Monitor critical infrastructure availability
**Implementation**:
```bash
# Service health check pattern
SERVICES="https://homelab.local http://nas.homelab.local"
for service in $SERVICES; do
    if curl -sSf --max-time 10 "$service" >/dev/null 2>&1; then
        echo "✅ $service: Available"
    else
        echo "❌ $service: Failed" | send_alert
    fi
done
```

## Automation Patterns

### Cron-Based Scheduling
**Pattern**: Regular health checks with intelligent alerting
```bash
# Monitoring schedule examples
*/20 * * * * /path/to/tdarr-timeout-monitor.sh    # Every 20 minutes
0 */6 * * * /path/to/cleanup-temp-dirs.sh         # Every 6 hours
0 2 * * * /path/to/backup-monitor.sh              # Daily at 2 AM
```

### Event-Driven Monitoring
**Pattern**: Reactive monitoring for critical events
- **System Startup**: Windows boot detection
- **Service Failures**: Container restart alerts
- **Resource Exhaustion**: Disk space warnings
- **Security Events**: Failed login attempts

## Data Collection and Analysis

### Log Management
**Pattern**: Centralized logging with rotation
```bash
# Log rotation configuration
LOG_FILE="/var/log/homelab-monitor.log"
MAX_SIZE="10M"
RETENTION_DAYS=30

# Rotate logs when size exceeded
if [ $(stat -c%s "$LOG_FILE") -gt $((10*1024*1024)) ]; then
    mv "$LOG_FILE" "$LOG_FILE.$(date +%Y%m%d)"
    touch "$LOG_FILE"
fi
```

### Metrics Collection
**Pattern**: Time-series data for trend analysis
- **System Metrics**: CPU, memory, disk usage
- **Service Metrics**: Response times, error rates
- **Application Metrics**: Transcoding progress, queue sizes
- **Network Metrics**: Bandwidth usage, latency

## Alert Integration

### Discord Notification System
**Pattern**: Rich, actionable notifications
```markdown
# Professional alert format
**🔧 System Maintenance**
Service: Tdarr Transcoding
Issue: 3 files timed out in staging
Resolution: Automatic cleanup completed
Status: System operational

Manual review recommended <@user_id>
```

### Alert Escalation
**Pattern**: Tiered alerting based on severity
1. **Info**: Routine maintenance completed
2. **Warning**: Service degradation detected
3. **Critical**: Service failure requiring immediate attention
4. **Emergency**: System-wide failure requiring manual intervention

## Best Practices Implementation

### Monitoring Strategy
1. **Proactive**: Monitor trends to predict issues
2. **Reactive**: Alert on current failures
3. **Preventive**: Automated cleanup and maintenance
4. **Comprehensive**: Cover all critical services
5. **Actionable**: Provide clear resolution paths

### Performance Optimization
1. **Efficient Polling**: Balance monitoring frequency with resource usage
2. **Smart Alerting**: Avoid alert fatigue with intelligent filtering
3. **Resource Management**: Monitor the monitoring system itself
4. **Scalable Architecture**: Design for growth and additional services

This technology context provides the foundation for implementing comprehensive monitoring and alerting in home lab environments.