# System Monitoring and Alerting - Technology Context ## Overview Comprehensive monitoring and alerting system for home lab infrastructure with focus on automated health checks, Discord notifications, and proactive system maintenance. ## Architecture Patterns ### Distributed Monitoring Strategy **Pattern**: Service-specific monitoring with centralized alerting - **Tdarr Monitoring**: API-based transcoding health checks - **Windows Desktop Monitoring**: Reboot detection and system events - **Network Monitoring**: Connectivity and service availability - **Container Monitoring**: Docker/Podman health and resource usage ### Alert Management **Pattern**: Structured notifications with actionable information ```bash # Discord webhook integration curl -X POST "$DISCORD_WEBHOOK" \ -H "Content-Type: application/json" \ -d '{ "content": "**System Alert**\n```\nService: Tdarr\nIssue: Staging timeout\nAction: Automatic cleanup performed\n```\n<@user_id>" }' ``` ## Core Monitoring Components ### Tdarr System Monitoring **Purpose**: Monitor transcoding pipeline health and performance **Location**: `scripts/tdarr_monitor.py` **Key Features**: - API-based status monitoring with dataclass structures - Staging section timeout detection and cleanup - Discord notifications with professional formatting - Log rotation and retention management ### Windows Desktop Monitoring **Purpose**: Track Windows system reboots and power events **Location**: `scripts/windows-desktop/` **Components**: - PowerShell monitoring script - Scheduled task automation - Discord notification integration - System event correlation ### Network and Service Monitoring **Purpose**: Monitor critical infrastructure availability **Implementation**: ```bash # Service health check pattern SERVICES="https://homelab.local http://nas.homelab.local" for service in $SERVICES; do if curl -sSf --max-time 10 "$service" >/dev/null 2>&1; then echo "✅ $service: Available" else echo "❌ $service: Failed" | send_alert fi done ``` ## Automation Patterns ### Cron-Based Scheduling **Pattern**: Regular health checks with intelligent alerting ```bash # Monitoring schedule examples */20 * * * * /path/to/tdarr-timeout-monitor.sh # Every 20 minutes 0 */6 * * * /path/to/cleanup-temp-dirs.sh # Every 6 hours 0 2 * * * /path/to/backup-monitor.sh # Daily at 2 AM ``` ### Event-Driven Monitoring **Pattern**: Reactive monitoring for critical events - **System Startup**: Windows boot detection - **Service Failures**: Container restart alerts - **Resource Exhaustion**: Disk space warnings - **Security Events**: Failed login attempts ## Data Collection and Analysis ### Log Management **Pattern**: Centralized logging with rotation ```bash # Log rotation configuration LOG_FILE="/var/log/homelab-monitor.log" MAX_SIZE="10M" RETENTION_DAYS=30 # Rotate logs when size exceeded if [ $(stat -c%s "$LOG_FILE") -gt $((10*1024*1024)) ]; then mv "$LOG_FILE" "$LOG_FILE.$(date +%Y%m%d)" touch "$LOG_FILE" fi ``` ### Metrics Collection **Pattern**: Time-series data for trend analysis - **System Metrics**: CPU, memory, disk usage - **Service Metrics**: Response times, error rates - **Application Metrics**: Transcoding progress, queue sizes - **Network Metrics**: Bandwidth usage, latency ## Alert Integration ### Discord Notification System **Pattern**: Rich, actionable notifications ```markdown # Professional alert format **🔧 System Maintenance** Service: Tdarr Transcoding Issue: 3 files timed out in staging Resolution: Automatic cleanup completed Status: System operational Manual review recommended <@user_id> ``` ### Alert Escalation **Pattern**: Tiered alerting based on severity 1. **Info**: Routine maintenance completed 2. **Warning**: Service degradation detected 3. **Critical**: Service failure requiring immediate attention 4. **Emergency**: System-wide failure requiring manual intervention ## Best Practices Implementation ### Monitoring Strategy 1. **Proactive**: Monitor trends to predict issues 2. **Reactive**: Alert on current failures 3. **Preventive**: Automated cleanup and maintenance 4. **Comprehensive**: Cover all critical services 5. **Actionable**: Provide clear resolution paths ### Performance Optimization 1. **Efficient Polling**: Balance monitoring frequency with resource usage 2. **Smart Alerting**: Avoid alert fatigue with intelligent filtering 3. **Resource Management**: Monitor the monitoring system itself 4. **Scalable Architecture**: Design for growth and additional services This technology context provides the foundation for implementing comprehensive monitoring and alerting in home lab environments.