claude-home/reference/docker/tdarr-monitoring-configuration.md

# Tdarr Monitoring Configuration - Discord Integration

## Overview
This document describes the active Discord monitoring system for Tdarr worker timeouts, stalls, and operational status across the homelab environment.

## System Architecture

### Components
- **Tdarr Server**: ubuntu-ct (10.10.0.43) running `tdarr-clean` container
- **Tdarr Node**: nobara-pc (unmapped architecture) running `tdarr-node-gpu-unmapped`
- **Monitoring Script**: `/scripts/monitoring/tdarr-timeout-monitor.sh`
- **Discord Integration**: Webhook-based notifications to designated channel

### Monitoring Targets
✅ **Server Limbo Timeouts** - Files stuck in staging beyond timeout period
✅ **Node Worker Stalls** - Workers that hang during transcoding operations
✅ **Worker Disconnections** - Unexpected worker disconnects
✅ **Success Notifications** - Optional completion tracking (currently disabled)

## Active Configuration

### Script Location
```
/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
```

### Key Settings
- **Polling Interval**: 15 minutes (900 seconds)
- **Discord Webhook**: Configured and active
- **Server Connection**: SSH alias `tdarr` → ubuntu-ct (10.10.0.43)
- **Node Container**: `tdarr-node-gpu-unmapped` via Podman
- **Working Directory**: `/tmp/tdarr-monitor/`

### Monitored Log Patterns

#### Server Timeout Events
```bash
# Pattern: "has been in limbo for X seconds, removing from staging section"
grep -i "has been in limbo"
```
**Example Alert**:
```
⚠️ 4 file(s) timed out in staging:
TV/Survivor/Season 48/Survivor (2000) - S48E04...
TV/Survivor/Season 48/Survivor (2000) - S48E11...
TV/Survivor/Season 26/Survivor (2000) - S26E05...
TV/Survivor/Season 22/Survivor (2000) - S22E13...
Files were removed from staging and will retry.
```

#### Node Worker Events
```bash
# Patterns: "Worker X has stalled, cancelling" | "Worker X disconnected. Pruning."
grep -i "worker.*stalled\|worker.*disconnected"
```
**Example Alert**:
```
🔴 4 worker stall(s) detected:
Worker eager-eyas
Worker oblong-owl
Worker other-olm
Worker dry-dugong
Workers were cancelled and will restart.
```

## Historical Context

### Problem Timeline
1. **Initial Issue**: Multiple workers stalling due to resource competition
2. **Worker Reduction**: Reduced to 1 CPU + 1 GPU worker each
3. **Timeout Increase**: Extended staging timeout from 5 minutes to 15-20 minutes
4. **SMB Optimization**: Improved file transfer from 61.8 MB/s to 103 MB/s
5. **Monitoring Implementation**: Custom Discord alerts for proactive issue detection

### Root Causes Addressed
- **Large File Timeouts**: TV episodes (3-8GB+) exceeding 5-minute staging timeout
- **Worker Competition**: Multiple workers competing for GPU resources
- **Network Performance**: Slow SMB transfers causing download delays
- **Node Instability**: Workers hanging during complex transcoding flows

## Deployment Status

### Current State
✅ **Script Deployed**: Located and executable
✅ **Discord Webhook**: Configured and tested
✅ **Function Order**: Fixed (send_discord_notification defined before use)
✅ **Polling Interval**: Set to 15 minutes
✅ **Initial Test**: "monitoring started" message sent successfully

### Automation Setup
**Status**: Ready for automation - choose deployment method:

#### Option A: Cron Job
```bash
# Edit user crontab
crontab -e

# Add monitoring every 15 minutes
*/15 * * * * /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh >/dev/null 2>&1
```

#### Option B: Systemd Service (Recommended)
```bash
# Service file
sudo tee /etc/systemd/system/tdarr-monitor.service > /dev/null <<EOF
[Unit]
Description=Tdarr Timeout Monitor
After=network.target

[Service]
Type=oneshot
User=cal
ExecStart=/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
EOF

# Timer file
sudo tee /etc/systemd/system/tdarr-monitor.timer > /dev/null <<EOF
[Unit]
Description=Run Tdarr Monitor every 15 minutes
Requires=tdarr-monitor.service

[Timer]
OnCalendar=*:0/15
Persistent=true

[Install]
WantedBy=timers.target
EOF

# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable tdarr-monitor.timer
sudo systemctl start tdarr-monitor.timer
```

## Discord Integration Details

### Webhook Configuration
- **Channel**: Designated homelab monitoring channel
- **Message Format**: Rich embeds with color coding
- **Alert Colors**: Red (15158332) for errors, Green (3066993) for success
- **Content**: File paths, worker names, timestamps, hostname

### Message Examples

#### Successful Start
```json
{
  "title": "🎬 Tdarr Monitoring Alert",
  "description": "Tdarr timeout monitoring started",
  "color": 3066993,
  "timestamp": "2025-08-10T14:19:55.000Z",
  "footer": {
    "text": "nobara-pc - Sun Aug 10 09:19:55 AM CDT 2025"
  }
}
```

#### Timeout Alert
```json
{
  "title": "🎬 Tdarr Monitoring Alert",
  "description": "⚠️ **4 file(s) timed out in staging:**\n```\nTV/Survivor/Season 48/...\n```\nFiles were removed from staging and will retry.",
  "color": 15158332,
  "timestamp": "2025-08-10T14:49:00.000Z"
}
```

## Performance Impact

### System Resources
- **CPU Usage**: Minimal - quick log parsing every 15 minutes
- **Network Impact**: One SSH connection + Docker logs query per check
- **Storage**: < 1MB in `/tmp/tdarr-monitor/`
- **Execution Time**: ~2-5 seconds per check

### Benefits
- **Proactive Issue Detection**: Know about problems before manual checking
- **Historical Tracking**: Discord provides persistent log of all alerts
- **Mobile Notifications**: Get alerts on phone via Discord app
- **Reduced Manual Monitoring**: Automated awareness of system health

## Troubleshooting

### Common Issues

#### No Discord Messages
```bash
# Test webhook manually
curl -H "Content-Type: application/json" -X POST \
     -d '{"content":"Manual test message"}' \
     "https://discord.com/api/webhooks/1404105821549498398/y2Ud1RK9rzFjv58xbypUfQNe3jrL7ZUq1FkQHa4_dfOHm2ylp93z0f4tY0O8Z-vQgKhD"

# Check script execution
/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
```

#### SSH Connection Issues
```bash
# Test server connection
ssh tdarr "echo 'SSH working'"

# Check SSH key authentication
ssh-add -l
```

#### Podman Container Access
```bash
# Verify container is running
podman ps | grep tdarr-node

# Test log access
podman logs --tail 5 tdarr-node-gpu-unmapped
```

#### Force Monitoring Check
```bash
# Reset timestamp to trigger immediate check
echo $(($(date +%s) - 1000)) > /tmp/tdarr-monitor/last_check.timestamp

# Run manual check
/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
```

## Security Considerations

### Webhook Protection
- **URL Security**: Webhook URL contains sensitive token
- **Access Control**: Only monitoring script has access to webhook
- **Scope Limitation**: Webhook only has permission to post messages

### SSH Access
- **Key-based Authentication**: No password authentication used
- **Limited Commands**: Only Docker logs and basic system commands
- **Network Isolation**: SSH connection within trusted homelab network

### File Permissions
```bash
# Verify script permissions
ls -la /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
# Should be: -rwxr--r-- (755) or more restrictive

# Secure if needed
chmod 750 /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
```

## Maintenance

### Regular Checks
- **Weekly**: Verify Discord messages are still being received
- **Monthly**: Check `/tmp/tdarr-monitor/` directory size
- **Quarterly**: Review alert frequency and adjust thresholds if needed

### Log Rotation
```bash
# Clean old monitoring logs if they accumulate
find /tmp/tdarr-monitor/ -name "*.log" -mtime +7 -delete
```

### Updates
- **Script Updates**: Version control via git in homelab documentation
- **Webhook Rotation**: Update webhook URL if Discord server changes
- **Threshold Tuning**: Adjust 15-minute interval based on operational experience

## Integration with Other Systems

### Future Enhancements
- **Grafana Integration**: Add metrics collection for dashboard visualization
- **Prometheus Metrics**: Export timing and error rate metrics
- **Home Assistant**: Integrate with home automation for additional alerting
- **Email Backup**: Secondary notification method for critical alerts

### Related Documentation
- [NAS Mount Configuration](../networking/nas-mount-configuration.md) - SMB optimization context
- [Tdarr Troubleshooting](tdarr-troubleshooting.md) - Worker timeout background
- [SSH Key Management](../networking/ssh-key-management.md) - Server access setup

---
**Status**: ✅ Active and Configured
**Last Updated**: August 10, 2025
**Next Review**: September 10, 2025
**Discord Channel**: Homelab monitoring alerts configured and tested