Complete documentation package for home lab infrastructure: ## New Documentation Files: - **Tdarr Monitoring Configuration**: Complete setup guide for Discord-based Tdarr monitoring system - **NAS Mount Configuration**: SMB/CIFS mount setup and troubleshooting for media storage - **Discord Monitoring Setup**: Step-by-step guide for webhook configuration and notification testing ## Documentation Features: - **Reference Architecture**: Best practices for distributed Tdarr deployments - **Configuration Templates**: Copy-paste ready configurations with security considerations - **Troubleshooting Guides**: Common issues and solutions for production environments - **Integration Examples**: Real-world implementation patterns for home lab environments ## Coverage Areas: - Docker container orchestration and monitoring - Network storage integration and performance optimization - Automated alerting and notification systems - Production-ready configuration management These documents support the enhanced monitoring system and provide comprehensive guidance for maintaining a robust home lab infrastructure. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
8.6 KiB
Tdarr Monitoring Configuration - Discord Integration
Overview
This document describes the active Discord monitoring system for Tdarr worker timeouts, stalls, and operational status across the homelab environment.
System Architecture
Components
- Tdarr Server: ubuntu-ct (10.10.0.43) running
tdarr-cleancontainer - Tdarr Node: nobara-pc (unmapped architecture) running
tdarr-node-gpu-unmapped - Monitoring Script:
/scripts/monitoring/tdarr-timeout-monitor.sh - Discord Integration: Webhook-based notifications to designated channel
Monitoring Targets
✅ Server Limbo Timeouts - Files stuck in staging beyond timeout period
✅ Node Worker Stalls - Workers that hang during transcoding operations
✅ Worker Disconnections - Unexpected worker disconnects
✅ Success Notifications - Optional completion tracking (currently disabled)
Active Configuration
Script Location
/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
Key Settings
- Polling Interval: 15 minutes (900 seconds)
- Discord Webhook: Configured and active
- Server Connection: SSH alias
tdarr→ ubuntu-ct (10.10.0.43) - Node Container:
tdarr-node-gpu-unmappedvia Podman - Working Directory:
/tmp/tdarr-monitor/
Monitored Log Patterns
Server Timeout Events
# Pattern: "has been in limbo for X seconds, removing from staging section"
grep -i "has been in limbo"
Example Alert:
⚠️ 4 file(s) timed out in staging:
TV/Survivor/Season 48/Survivor (2000) - S48E04...
TV/Survivor/Season 48/Survivor (2000) - S48E11...
TV/Survivor/Season 26/Survivor (2000) - S26E05...
TV/Survivor/Season 22/Survivor (2000) - S22E13...
Files were removed from staging and will retry.
Node Worker Events
# Patterns: "Worker X has stalled, cancelling" | "Worker X disconnected. Pruning."
grep -i "worker.*stalled\|worker.*disconnected"
Example Alert:
🔴 4 worker stall(s) detected:
Worker eager-eyas
Worker oblong-owl
Worker other-olm
Worker dry-dugong
Workers were cancelled and will restart.
Historical Context
Problem Timeline
- Initial Issue: Multiple workers stalling due to resource competition
- Worker Reduction: Reduced to 1 CPU + 1 GPU worker each
- Timeout Increase: Extended staging timeout from 5 minutes to 15-20 minutes
- SMB Optimization: Improved file transfer from 61.8 MB/s to 103 MB/s
- Monitoring Implementation: Custom Discord alerts for proactive issue detection
Root Causes Addressed
- Large File Timeouts: TV episodes (3-8GB+) exceeding 5-minute staging timeout
- Worker Competition: Multiple workers competing for GPU resources
- Network Performance: Slow SMB transfers causing download delays
- Node Instability: Workers hanging during complex transcoding flows
Deployment Status
Current State
✅ Script Deployed: Located and executable
✅ Discord Webhook: Configured and tested
✅ Function Order: Fixed (send_discord_notification defined before use)
✅ Polling Interval: Set to 15 minutes
✅ Initial Test: "monitoring started" message sent successfully
Automation Setup
Status: Ready for automation - choose deployment method:
Option A: Cron Job
# Edit user crontab
crontab -e
# Add monitoring every 15 minutes
*/15 * * * * /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh >/dev/null 2>&1
Option B: Systemd Service (Recommended)
# Service file
sudo tee /etc/systemd/system/tdarr-monitor.service > /dev/null <<EOF
[Unit]
Description=Tdarr Timeout Monitor
After=network.target
[Service]
Type=oneshot
User=cal
ExecStart=/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
EOF
# Timer file
sudo tee /etc/systemd/system/tdarr-monitor.timer > /dev/null <<EOF
[Unit]
Description=Run Tdarr Monitor every 15 minutes
Requires=tdarr-monitor.service
[Timer]
OnCalendar=*:0/15
Persistent=true
[Install]
WantedBy=timers.target
EOF
# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable tdarr-monitor.timer
sudo systemctl start tdarr-monitor.timer
Discord Integration Details
Webhook Configuration
- Channel: Designated homelab monitoring channel
- Message Format: Rich embeds with color coding
- Alert Colors: Red (15158332) for errors, Green (3066993) for success
- Content: File paths, worker names, timestamps, hostname
Message Examples
Successful Start
{
"title": "🎬 Tdarr Monitoring Alert",
"description": "Tdarr timeout monitoring started",
"color": 3066993,
"timestamp": "2025-08-10T14:19:55.000Z",
"footer": {
"text": "nobara-pc - Sun Aug 10 09:19:55 AM CDT 2025"
}
}
Timeout Alert
{
"title": "🎬 Tdarr Monitoring Alert",
"description": "⚠️ **4 file(s) timed out in staging:**\n```\nTV/Survivor/Season 48/...\n```\nFiles were removed from staging and will retry.",
"color": 15158332,
"timestamp": "2025-08-10T14:49:00.000Z"
}
Performance Impact
System Resources
- CPU Usage: Minimal - quick log parsing every 15 minutes
- Network Impact: One SSH connection + Docker logs query per check
- Storage: < 1MB in
/tmp/tdarr-monitor/ - Execution Time: ~2-5 seconds per check
Benefits
- Proactive Issue Detection: Know about problems before manual checking
- Historical Tracking: Discord provides persistent log of all alerts
- Mobile Notifications: Get alerts on phone via Discord app
- Reduced Manual Monitoring: Automated awareness of system health
Troubleshooting
Common Issues
No Discord Messages
# Test webhook manually
curl -H "Content-Type: application/json" -X POST \
-d '{"content":"Manual test message"}' \
"https://discord.com/api/webhooks/1404105821549498398/y2Ud1RK9rzFjv58xbypUfQNe3jrL7ZUq1FkQHa4_dfOHm2ylp93z0f4tY0O8Z-vQgKhD"
# Check script execution
/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
SSH Connection Issues
# Test server connection
ssh tdarr "echo 'SSH working'"
# Check SSH key authentication
ssh-add -l
Podman Container Access
# Verify container is running
podman ps | grep tdarr-node
# Test log access
podman logs --tail 5 tdarr-node-gpu-unmapped
Force Monitoring Check
# Reset timestamp to trigger immediate check
echo $(($(date +%s) - 1000)) > /tmp/tdarr-monitor/last_check.timestamp
# Run manual check
/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
Security Considerations
Webhook Protection
- URL Security: Webhook URL contains sensitive token
- Access Control: Only monitoring script has access to webhook
- Scope Limitation: Webhook only has permission to post messages
SSH Access
- Key-based Authentication: No password authentication used
- Limited Commands: Only Docker logs and basic system commands
- Network Isolation: SSH connection within trusted homelab network
File Permissions
# Verify script permissions
ls -la /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
# Should be: -rwxr--r-- (755) or more restrictive
# Secure if needed
chmod 750 /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
Maintenance
Regular Checks
- Weekly: Verify Discord messages are still being received
- Monthly: Check
/tmp/tdarr-monitor/directory size - Quarterly: Review alert frequency and adjust thresholds if needed
Log Rotation
# Clean old monitoring logs if they accumulate
find /tmp/tdarr-monitor/ -name "*.log" -mtime +7 -delete
Updates
- Script Updates: Version control via git in homelab documentation
- Webhook Rotation: Update webhook URL if Discord server changes
- Threshold Tuning: Adjust 15-minute interval based on operational experience
Integration with Other Systems
Future Enhancements
- Grafana Integration: Add metrics collection for dashboard visualization
- Prometheus Metrics: Export timing and error rate metrics
- Home Assistant: Integrate with home automation for additional alerting
- Email Backup: Secondary notification method for critical alerts
Related Documentation
- NAS Mount Configuration - SMB optimization context
- Tdarr Troubleshooting - Worker timeout background
- SSH Key Management - Server access setup
Status: ✅ Active and Configured
Last Updated: August 10, 2025
Next Review: September 10, 2025
Discord Channel: Homelab monitoring alerts configured and tested