Complete restructure from patterns/examples/reference to technology-focused directories: • Created technology-specific directories with comprehensive documentation: - /tdarr/ - Transcoding automation with gaming-aware scheduling - /docker/ - Container management with GPU acceleration patterns - /vm-management/ - Virtual machine automation and cloud-init - /networking/ - SSH infrastructure, reverse proxy, and security - /monitoring/ - System health checks and Discord notifications - /databases/ - Database patterns and troubleshooting - /development/ - Programming language patterns (bash, nodejs, python, vuejs) • Enhanced CLAUDE.md with intelligent context loading: - Technology-first loading rules for automatic context provision - Troubleshooting keyword triggers for emergency scenarios - Documentation maintenance protocols with automated reminders - Context window management for optimal documentation updates • Preserved valuable content from .claude/tmp/: - SSH security improvements and server inventory - Tdarr CIFS troubleshooting and Docker iptables solutions - Operational scripts with proper technology classification • Benefits achieved: - Self-contained technology directories with complete context - Automatic loading of relevant documentation based on keywords - Emergency-ready troubleshooting with comprehensive guides - Scalable structure for future technology additions - Eliminated context bloat through targeted loading 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
4.6 KiB
System Monitoring and Alerting - Technology Context
Overview
Comprehensive monitoring and alerting system for home lab infrastructure with focus on automated health checks, Discord notifications, and proactive system maintenance.
Architecture Patterns
Distributed Monitoring Strategy
Pattern: Service-specific monitoring with centralized alerting
- Tdarr Monitoring: API-based transcoding health checks
- Windows Desktop Monitoring: Reboot detection and system events
- Network Monitoring: Connectivity and service availability
- Container Monitoring: Docker/Podman health and resource usage
Alert Management
Pattern: Structured notifications with actionable information
# Discord webhook integration
curl -X POST "$DISCORD_WEBHOOK" \
-H "Content-Type: application/json" \
-d '{
"content": "**System Alert**\n```\nService: Tdarr\nIssue: Staging timeout\nAction: Automatic cleanup performed\n```\n<@user_id>"
}'
Core Monitoring Components
Tdarr System Monitoring
Purpose: Monitor transcoding pipeline health and performance
Location: scripts/tdarr_monitor.py
Key Features:
- API-based status monitoring with dataclass structures
- Staging section timeout detection and cleanup
- Discord notifications with professional formatting
- Log rotation and retention management
Windows Desktop Monitoring
Purpose: Track Windows system reboots and power events
Location: scripts/windows-desktop/
Components:
- PowerShell monitoring script
- Scheduled task automation
- Discord notification integration
- System event correlation
Network and Service Monitoring
Purpose: Monitor critical infrastructure availability Implementation:
# Service health check pattern
SERVICES="https://homelab.local http://nas.homelab.local"
for service in $SERVICES; do
if curl -sSf --max-time 10 "$service" >/dev/null 2>&1; then
echo "✅ $service: Available"
else
echo "❌ $service: Failed" | send_alert
fi
done
Automation Patterns
Cron-Based Scheduling
Pattern: Regular health checks with intelligent alerting
# Monitoring schedule examples
*/20 * * * * /path/to/tdarr-timeout-monitor.sh # Every 20 minutes
0 */6 * * * /path/to/cleanup-temp-dirs.sh # Every 6 hours
0 2 * * * /path/to/backup-monitor.sh # Daily at 2 AM
Event-Driven Monitoring
Pattern: Reactive monitoring for critical events
- System Startup: Windows boot detection
- Service Failures: Container restart alerts
- Resource Exhaustion: Disk space warnings
- Security Events: Failed login attempts
Data Collection and Analysis
Log Management
Pattern: Centralized logging with rotation
# Log rotation configuration
LOG_FILE="/var/log/homelab-monitor.log"
MAX_SIZE="10M"
RETENTION_DAYS=30
# Rotate logs when size exceeded
if [ $(stat -c%s "$LOG_FILE") -gt $((10*1024*1024)) ]; then
mv "$LOG_FILE" "$LOG_FILE.$(date +%Y%m%d)"
touch "$LOG_FILE"
fi
Metrics Collection
Pattern: Time-series data for trend analysis
- System Metrics: CPU, memory, disk usage
- Service Metrics: Response times, error rates
- Application Metrics: Transcoding progress, queue sizes
- Network Metrics: Bandwidth usage, latency
Alert Integration
Discord Notification System
Pattern: Rich, actionable notifications
# Professional alert format
**🔧 System Maintenance**
Service: Tdarr Transcoding
Issue: 3 files timed out in staging
Resolution: Automatic cleanup completed
Status: System operational
Manual review recommended <@user_id>
Alert Escalation
Pattern: Tiered alerting based on severity
- Info: Routine maintenance completed
- Warning: Service degradation detected
- Critical: Service failure requiring immediate attention
- Emergency: System-wide failure requiring manual intervention
Best Practices Implementation
Monitoring Strategy
- Proactive: Monitor trends to predict issues
- Reactive: Alert on current failures
- Preventive: Automated cleanup and maintenance
- Comprehensive: Cover all critical services
- Actionable: Provide clear resolution paths
Performance Optimization
- Efficient Polling: Balance monitoring frequency with resource usage
- Smart Alerting: Avoid alert fatigue with intelligent filtering
- Resource Management: Monitor the monitoring system itself
- Scalable Architecture: Design for growth and additional services
This technology context provides the foundation for implementing comprehensive monitoring and alerting in home lab environments.