Complete restructure from patterns/examples/reference to technology-focused directories: • Created technology-specific directories with comprehensive documentation: - /tdarr/ - Transcoding automation with gaming-aware scheduling - /docker/ - Container management with GPU acceleration patterns - /vm-management/ - Virtual machine automation and cloud-init - /networking/ - SSH infrastructure, reverse proxy, and security - /monitoring/ - System health checks and Discord notifications - /databases/ - Database patterns and troubleshooting - /development/ - Programming language patterns (bash, nodejs, python, vuejs) • Enhanced CLAUDE.md with intelligent context loading: - Technology-first loading rules for automatic context provision - Troubleshooting keyword triggers for emergency scenarios - Documentation maintenance protocols with automated reminders - Context window management for optimal documentation updates • Preserved valuable content from .claude/tmp/: - SSH security improvements and server inventory - Tdarr CIFS troubleshooting and Docker iptables solutions - Operational scripts with proper technology classification • Benefits achieved: - Self-contained technology directories with complete context - Automatic loading of relevant documentation based on keywords - Emergency-ready troubleshooting with comprehensive guides - Scalable structure for future technology additions - Eliminated context bloat through targeted loading 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
142 lines
4.6 KiB
Markdown
142 lines
4.6 KiB
Markdown
# System Monitoring and Alerting - Technology Context
|
|
|
|
## Overview
|
|
Comprehensive monitoring and alerting system for home lab infrastructure with focus on automated health checks, Discord notifications, and proactive system maintenance.
|
|
|
|
## Architecture Patterns
|
|
|
|
### Distributed Monitoring Strategy
|
|
**Pattern**: Service-specific monitoring with centralized alerting
|
|
- **Tdarr Monitoring**: API-based transcoding health checks
|
|
- **Windows Desktop Monitoring**: Reboot detection and system events
|
|
- **Network Monitoring**: Connectivity and service availability
|
|
- **Container Monitoring**: Docker/Podman health and resource usage
|
|
|
|
### Alert Management
|
|
**Pattern**: Structured notifications with actionable information
|
|
```bash
|
|
# Discord webhook integration
|
|
curl -X POST "$DISCORD_WEBHOOK" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"content": "**System Alert**\n```\nService: Tdarr\nIssue: Staging timeout\nAction: Automatic cleanup performed\n```\n<@user_id>"
|
|
}'
|
|
```
|
|
|
|
## Core Monitoring Components
|
|
|
|
### Tdarr System Monitoring
|
|
**Purpose**: Monitor transcoding pipeline health and performance
|
|
**Location**: `scripts/tdarr_monitor.py`
|
|
|
|
**Key Features**:
|
|
- API-based status monitoring with dataclass structures
|
|
- Staging section timeout detection and cleanup
|
|
- Discord notifications with professional formatting
|
|
- Log rotation and retention management
|
|
|
|
### Windows Desktop Monitoring
|
|
**Purpose**: Track Windows system reboots and power events
|
|
**Location**: `scripts/windows-desktop/`
|
|
|
|
**Components**:
|
|
- PowerShell monitoring script
|
|
- Scheduled task automation
|
|
- Discord notification integration
|
|
- System event correlation
|
|
|
|
### Network and Service Monitoring
|
|
**Purpose**: Monitor critical infrastructure availability
|
|
**Implementation**:
|
|
```bash
|
|
# Service health check pattern
|
|
SERVICES="https://homelab.local http://nas.homelab.local"
|
|
for service in $SERVICES; do
|
|
if curl -sSf --max-time 10 "$service" >/dev/null 2>&1; then
|
|
echo "✅ $service: Available"
|
|
else
|
|
echo "❌ $service: Failed" | send_alert
|
|
fi
|
|
done
|
|
```
|
|
|
|
## Automation Patterns
|
|
|
|
### Cron-Based Scheduling
|
|
**Pattern**: Regular health checks with intelligent alerting
|
|
```bash
|
|
# Monitoring schedule examples
|
|
*/20 * * * * /path/to/tdarr-timeout-monitor.sh # Every 20 minutes
|
|
0 */6 * * * /path/to/cleanup-temp-dirs.sh # Every 6 hours
|
|
0 2 * * * /path/to/backup-monitor.sh # Daily at 2 AM
|
|
```
|
|
|
|
### Event-Driven Monitoring
|
|
**Pattern**: Reactive monitoring for critical events
|
|
- **System Startup**: Windows boot detection
|
|
- **Service Failures**: Container restart alerts
|
|
- **Resource Exhaustion**: Disk space warnings
|
|
- **Security Events**: Failed login attempts
|
|
|
|
## Data Collection and Analysis
|
|
|
|
### Log Management
|
|
**Pattern**: Centralized logging with rotation
|
|
```bash
|
|
# Log rotation configuration
|
|
LOG_FILE="/var/log/homelab-monitor.log"
|
|
MAX_SIZE="10M"
|
|
RETENTION_DAYS=30
|
|
|
|
# Rotate logs when size exceeded
|
|
if [ $(stat -c%s "$LOG_FILE") -gt $((10*1024*1024)) ]; then
|
|
mv "$LOG_FILE" "$LOG_FILE.$(date +%Y%m%d)"
|
|
touch "$LOG_FILE"
|
|
fi
|
|
```
|
|
|
|
### Metrics Collection
|
|
**Pattern**: Time-series data for trend analysis
|
|
- **System Metrics**: CPU, memory, disk usage
|
|
- **Service Metrics**: Response times, error rates
|
|
- **Application Metrics**: Transcoding progress, queue sizes
|
|
- **Network Metrics**: Bandwidth usage, latency
|
|
|
|
## Alert Integration
|
|
|
|
### Discord Notification System
|
|
**Pattern**: Rich, actionable notifications
|
|
```markdown
|
|
# Professional alert format
|
|
**🔧 System Maintenance**
|
|
Service: Tdarr Transcoding
|
|
Issue: 3 files timed out in staging
|
|
Resolution: Automatic cleanup completed
|
|
Status: System operational
|
|
|
|
Manual review recommended <@user_id>
|
|
```
|
|
|
|
### Alert Escalation
|
|
**Pattern**: Tiered alerting based on severity
|
|
1. **Info**: Routine maintenance completed
|
|
2. **Warning**: Service degradation detected
|
|
3. **Critical**: Service failure requiring immediate attention
|
|
4. **Emergency**: System-wide failure requiring manual intervention
|
|
|
|
## Best Practices Implementation
|
|
|
|
### Monitoring Strategy
|
|
1. **Proactive**: Monitor trends to predict issues
|
|
2. **Reactive**: Alert on current failures
|
|
3. **Preventive**: Automated cleanup and maintenance
|
|
4. **Comprehensive**: Cover all critical services
|
|
5. **Actionable**: Provide clear resolution paths
|
|
|
|
### Performance Optimization
|
|
1. **Efficient Polling**: Balance monitoring frequency with resource usage
|
|
2. **Smart Alerting**: Avoid alert fatigue with intelligent filtering
|
|
3. **Resource Management**: Monitor the monitoring system itself
|
|
4. **Scalable Architecture**: Design for growth and additional services
|
|
|
|
This technology context provides the foundation for implementing comprehensive monitoring and alerting in home lab environments. |