Complete documentation package for home lab infrastructure: ## New Documentation Files: - **Tdarr Monitoring Configuration**: Complete setup guide for Discord-based Tdarr monitoring system - **NAS Mount Configuration**: SMB/CIFS mount setup and troubleshooting for media storage - **Discord Monitoring Setup**: Step-by-step guide for webhook configuration and notification testing ## Documentation Features: - **Reference Architecture**: Best practices for distributed Tdarr deployments - **Configuration Templates**: Copy-paste ready configurations with security considerations - **Troubleshooting Guides**: Common issues and solutions for production environments - **Integration Examples**: Real-world implementation patterns for home lab environments ## Coverage Areas: - Docker container orchestration and monitoring - Network storage integration and performance optimization - Automated alerting and notification systems - Production-ready configuration management These documents support the enhanced monitoring system and provide comprehensive guidance for maintaining a robust home lab infrastructure. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
282 lines
8.6 KiB
Markdown
282 lines
8.6 KiB
Markdown
# Tdarr Monitoring Configuration - Discord Integration
|
|
|
|
## Overview
|
|
This document describes the active Discord monitoring system for Tdarr worker timeouts, stalls, and operational status across the homelab environment.
|
|
|
|
## System Architecture
|
|
|
|
### Components
|
|
- **Tdarr Server**: ubuntu-ct (10.10.0.43) running `tdarr-clean` container
|
|
- **Tdarr Node**: nobara-pc (unmapped architecture) running `tdarr-node-gpu-unmapped`
|
|
- **Monitoring Script**: `/scripts/monitoring/tdarr-timeout-monitor.sh`
|
|
- **Discord Integration**: Webhook-based notifications to designated channel
|
|
|
|
### Monitoring Targets
|
|
✅ **Server Limbo Timeouts** - Files stuck in staging beyond timeout period
|
|
✅ **Node Worker Stalls** - Workers that hang during transcoding operations
|
|
✅ **Worker Disconnections** - Unexpected worker disconnects
|
|
✅ **Success Notifications** - Optional completion tracking (currently disabled)
|
|
|
|
## Active Configuration
|
|
|
|
### Script Location
|
|
```
|
|
/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
|
|
```
|
|
|
|
### Key Settings
|
|
- **Polling Interval**: 15 minutes (900 seconds)
|
|
- **Discord Webhook**: Configured and active
|
|
- **Server Connection**: SSH alias `tdarr` → ubuntu-ct (10.10.0.43)
|
|
- **Node Container**: `tdarr-node-gpu-unmapped` via Podman
|
|
- **Working Directory**: `/tmp/tdarr-monitor/`
|
|
|
|
### Monitored Log Patterns
|
|
|
|
#### Server Timeout Events
|
|
```bash
|
|
# Pattern: "has been in limbo for X seconds, removing from staging section"
|
|
grep -i "has been in limbo"
|
|
```
|
|
**Example Alert**:
|
|
```
|
|
⚠️ 4 file(s) timed out in staging:
|
|
TV/Survivor/Season 48/Survivor (2000) - S48E04...
|
|
TV/Survivor/Season 48/Survivor (2000) - S48E11...
|
|
TV/Survivor/Season 26/Survivor (2000) - S26E05...
|
|
TV/Survivor/Season 22/Survivor (2000) - S22E13...
|
|
Files were removed from staging and will retry.
|
|
```
|
|
|
|
#### Node Worker Events
|
|
```bash
|
|
# Patterns: "Worker X has stalled, cancelling" | "Worker X disconnected. Pruning."
|
|
grep -i "worker.*stalled\|worker.*disconnected"
|
|
```
|
|
**Example Alert**:
|
|
```
|
|
🔴 4 worker stall(s) detected:
|
|
Worker eager-eyas
|
|
Worker oblong-owl
|
|
Worker other-olm
|
|
Worker dry-dugong
|
|
Workers were cancelled and will restart.
|
|
```
|
|
|
|
## Historical Context
|
|
|
|
### Problem Timeline
|
|
1. **Initial Issue**: Multiple workers stalling due to resource competition
|
|
2. **Worker Reduction**: Reduced to 1 CPU + 1 GPU worker each
|
|
3. **Timeout Increase**: Extended staging timeout from 5 minutes to 15-20 minutes
|
|
4. **SMB Optimization**: Improved file transfer from 61.8 MB/s to 103 MB/s
|
|
5. **Monitoring Implementation**: Custom Discord alerts for proactive issue detection
|
|
|
|
### Root Causes Addressed
|
|
- **Large File Timeouts**: TV episodes (3-8GB+) exceeding 5-minute staging timeout
|
|
- **Worker Competition**: Multiple workers competing for GPU resources
|
|
- **Network Performance**: Slow SMB transfers causing download delays
|
|
- **Node Instability**: Workers hanging during complex transcoding flows
|
|
|
|
## Deployment Status
|
|
|
|
### Current State
|
|
✅ **Script Deployed**: Located and executable
|
|
✅ **Discord Webhook**: Configured and tested
|
|
✅ **Function Order**: Fixed (send_discord_notification defined before use)
|
|
✅ **Polling Interval**: Set to 15 minutes
|
|
✅ **Initial Test**: "monitoring started" message sent successfully
|
|
|
|
### Automation Setup
|
|
**Status**: Ready for automation - choose deployment method:
|
|
|
|
#### Option A: Cron Job
|
|
```bash
|
|
# Edit user crontab
|
|
crontab -e
|
|
|
|
# Add monitoring every 15 minutes
|
|
*/15 * * * * /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh >/dev/null 2>&1
|
|
```
|
|
|
|
#### Option B: Systemd Service (Recommended)
|
|
```bash
|
|
# Service file
|
|
sudo tee /etc/systemd/system/tdarr-monitor.service > /dev/null <<EOF
|
|
[Unit]
|
|
Description=Tdarr Timeout Monitor
|
|
After=network.target
|
|
|
|
[Service]
|
|
Type=oneshot
|
|
User=cal
|
|
ExecStart=/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
|
|
EOF
|
|
|
|
# Timer file
|
|
sudo tee /etc/systemd/system/tdarr-monitor.timer > /dev/null <<EOF
|
|
[Unit]
|
|
Description=Run Tdarr Monitor every 15 minutes
|
|
Requires=tdarr-monitor.service
|
|
|
|
[Timer]
|
|
OnCalendar=*:0/15
|
|
Persistent=true
|
|
|
|
[Install]
|
|
WantedBy=timers.target
|
|
EOF
|
|
|
|
# Enable and start
|
|
sudo systemctl daemon-reload
|
|
sudo systemctl enable tdarr-monitor.timer
|
|
sudo systemctl start tdarr-monitor.timer
|
|
```
|
|
|
|
## Discord Integration Details
|
|
|
|
### Webhook Configuration
|
|
- **Channel**: Designated homelab monitoring channel
|
|
- **Message Format**: Rich embeds with color coding
|
|
- **Alert Colors**: Red (15158332) for errors, Green (3066993) for success
|
|
- **Content**: File paths, worker names, timestamps, hostname
|
|
|
|
### Message Examples
|
|
|
|
#### Successful Start
|
|
```json
|
|
{
|
|
"title": "🎬 Tdarr Monitoring Alert",
|
|
"description": "Tdarr timeout monitoring started",
|
|
"color": 3066993,
|
|
"timestamp": "2025-08-10T14:19:55.000Z",
|
|
"footer": {
|
|
"text": "nobara-pc - Sun Aug 10 09:19:55 AM CDT 2025"
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Timeout Alert
|
|
```json
|
|
{
|
|
"title": "🎬 Tdarr Monitoring Alert",
|
|
"description": "⚠️ **4 file(s) timed out in staging:**\n```\nTV/Survivor/Season 48/...\n```\nFiles were removed from staging and will retry.",
|
|
"color": 15158332,
|
|
"timestamp": "2025-08-10T14:49:00.000Z"
|
|
}
|
|
```
|
|
|
|
## Performance Impact
|
|
|
|
### System Resources
|
|
- **CPU Usage**: Minimal - quick log parsing every 15 minutes
|
|
- **Network Impact**: One SSH connection + Docker logs query per check
|
|
- **Storage**: < 1MB in `/tmp/tdarr-monitor/`
|
|
- **Execution Time**: ~2-5 seconds per check
|
|
|
|
### Benefits
|
|
- **Proactive Issue Detection**: Know about problems before manual checking
|
|
- **Historical Tracking**: Discord provides persistent log of all alerts
|
|
- **Mobile Notifications**: Get alerts on phone via Discord app
|
|
- **Reduced Manual Monitoring**: Automated awareness of system health
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
#### No Discord Messages
|
|
```bash
|
|
# Test webhook manually
|
|
curl -H "Content-Type: application/json" -X POST \
|
|
-d '{"content":"Manual test message"}' \
|
|
"https://discord.com/api/webhooks/1404105821549498398/y2Ud1RK9rzFjv58xbypUfQNe3jrL7ZUq1FkQHa4_dfOHm2ylp93z0f4tY0O8Z-vQgKhD"
|
|
|
|
# Check script execution
|
|
/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
|
|
```
|
|
|
|
#### SSH Connection Issues
|
|
```bash
|
|
# Test server connection
|
|
ssh tdarr "echo 'SSH working'"
|
|
|
|
# Check SSH key authentication
|
|
ssh-add -l
|
|
```
|
|
|
|
#### Podman Container Access
|
|
```bash
|
|
# Verify container is running
|
|
podman ps | grep tdarr-node
|
|
|
|
# Test log access
|
|
podman logs --tail 5 tdarr-node-gpu-unmapped
|
|
```
|
|
|
|
#### Force Monitoring Check
|
|
```bash
|
|
# Reset timestamp to trigger immediate check
|
|
echo $(($(date +%s) - 1000)) > /tmp/tdarr-monitor/last_check.timestamp
|
|
|
|
# Run manual check
|
|
/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
|
|
```
|
|
|
|
## Security Considerations
|
|
|
|
### Webhook Protection
|
|
- **URL Security**: Webhook URL contains sensitive token
|
|
- **Access Control**: Only monitoring script has access to webhook
|
|
- **Scope Limitation**: Webhook only has permission to post messages
|
|
|
|
### SSH Access
|
|
- **Key-based Authentication**: No password authentication used
|
|
- **Limited Commands**: Only Docker logs and basic system commands
|
|
- **Network Isolation**: SSH connection within trusted homelab network
|
|
|
|
### File Permissions
|
|
```bash
|
|
# Verify script permissions
|
|
ls -la /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
|
|
# Should be: -rwxr--r-- (755) or more restrictive
|
|
|
|
# Secure if needed
|
|
chmod 750 /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
|
|
```
|
|
|
|
## Maintenance
|
|
|
|
### Regular Checks
|
|
- **Weekly**: Verify Discord messages are still being received
|
|
- **Monthly**: Check `/tmp/tdarr-monitor/` directory size
|
|
- **Quarterly**: Review alert frequency and adjust thresholds if needed
|
|
|
|
### Log Rotation
|
|
```bash
|
|
# Clean old monitoring logs if they accumulate
|
|
find /tmp/tdarr-monitor/ -name "*.log" -mtime +7 -delete
|
|
```
|
|
|
|
### Updates
|
|
- **Script Updates**: Version control via git in homelab documentation
|
|
- **Webhook Rotation**: Update webhook URL if Discord server changes
|
|
- **Threshold Tuning**: Adjust 15-minute interval based on operational experience
|
|
|
|
## Integration with Other Systems
|
|
|
|
### Future Enhancements
|
|
- **Grafana Integration**: Add metrics collection for dashboard visualization
|
|
- **Prometheus Metrics**: Export timing and error rate metrics
|
|
- **Home Assistant**: Integrate with home automation for additional alerting
|
|
- **Email Backup**: Secondary notification method for critical alerts
|
|
|
|
### Related Documentation
|
|
- [NAS Mount Configuration](../networking/nas-mount-configuration.md) - SMB optimization context
|
|
- [Tdarr Troubleshooting](tdarr-troubleshooting.md) - Worker timeout background
|
|
- [SSH Key Management](../networking/ssh-key-management.md) - Server access setup
|
|
|
|
---
|
|
**Status**: ✅ Active and Configured
|
|
**Last Updated**: August 10, 2025
|
|
**Next Review**: September 10, 2025
|
|
**Discord Channel**: Homelab monitoring alerts configured and tested |