CLAUDE: Add comprehensive documentation for Tdarr monitoring and NAS configuration
Complete documentation package for home lab infrastructure: ## New Documentation Files: - **Tdarr Monitoring Configuration**: Complete setup guide for Discord-based Tdarr monitoring system - **NAS Mount Configuration**: SMB/CIFS mount setup and troubleshooting for media storage - **Discord Monitoring Setup**: Step-by-step guide for webhook configuration and notification testing ## Documentation Features: - **Reference Architecture**: Best practices for distributed Tdarr deployments - **Configuration Templates**: Copy-paste ready configurations with security considerations - **Troubleshooting Guides**: Common issues and solutions for production environments - **Integration Examples**: Real-world implementation patterns for home lab environments ## Coverage Areas: - Docker container orchestration and monitoring - Network storage integration and performance optimization - Automated alerting and notification systems - Production-ready configuration management These documents support the enhanced monitoring system and provide comprehensive guidance for maintaining a robust home lab infrastructure. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
parent
6cc0d0df2e
commit
715354da7d
282
reference/docker/tdarr-monitoring-configuration.md
Normal file
282
reference/docker/tdarr-monitoring-configuration.md
Normal file
@ -0,0 +1,282 @@
|
||||
# Tdarr Monitoring Configuration - Discord Integration
|
||||
|
||||
## Overview
|
||||
This document describes the active Discord monitoring system for Tdarr worker timeouts, stalls, and operational status across the homelab environment.
|
||||
|
||||
## System Architecture
|
||||
|
||||
### Components
|
||||
- **Tdarr Server**: ubuntu-ct (10.10.0.43) running `tdarr-clean` container
|
||||
- **Tdarr Node**: nobara-pc (unmapped architecture) running `tdarr-node-gpu-unmapped`
|
||||
- **Monitoring Script**: `/scripts/monitoring/tdarr-timeout-monitor.sh`
|
||||
- **Discord Integration**: Webhook-based notifications to designated channel
|
||||
|
||||
### Monitoring Targets
|
||||
✅ **Server Limbo Timeouts** - Files stuck in staging beyond timeout period
|
||||
✅ **Node Worker Stalls** - Workers that hang during transcoding operations
|
||||
✅ **Worker Disconnections** - Unexpected worker disconnects
|
||||
✅ **Success Notifications** - Optional completion tracking (currently disabled)
|
||||
|
||||
## Active Configuration
|
||||
|
||||
### Script Location
|
||||
```
|
||||
/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
|
||||
```
|
||||
|
||||
### Key Settings
|
||||
- **Polling Interval**: 15 minutes (900 seconds)
|
||||
- **Discord Webhook**: Configured and active
|
||||
- **Server Connection**: SSH alias `tdarr` → ubuntu-ct (10.10.0.43)
|
||||
- **Node Container**: `tdarr-node-gpu-unmapped` via Podman
|
||||
- **Working Directory**: `/tmp/tdarr-monitor/`
|
||||
|
||||
### Monitored Log Patterns
|
||||
|
||||
#### Server Timeout Events
|
||||
```bash
|
||||
# Pattern: "has been in limbo for X seconds, removing from staging section"
|
||||
grep -i "has been in limbo"
|
||||
```
|
||||
**Example Alert**:
|
||||
```
|
||||
⚠️ 4 file(s) timed out in staging:
|
||||
TV/Survivor/Season 48/Survivor (2000) - S48E04...
|
||||
TV/Survivor/Season 48/Survivor (2000) - S48E11...
|
||||
TV/Survivor/Season 26/Survivor (2000) - S26E05...
|
||||
TV/Survivor/Season 22/Survivor (2000) - S22E13...
|
||||
Files were removed from staging and will retry.
|
||||
```
|
||||
|
||||
#### Node Worker Events
|
||||
```bash
|
||||
# Patterns: "Worker X has stalled, cancelling" | "Worker X disconnected. Pruning."
|
||||
grep -i "worker.*stalled\|worker.*disconnected"
|
||||
```
|
||||
**Example Alert**:
|
||||
```
|
||||
🔴 4 worker stall(s) detected:
|
||||
Worker eager-eyas
|
||||
Worker oblong-owl
|
||||
Worker other-olm
|
||||
Worker dry-dugong
|
||||
Workers were cancelled and will restart.
|
||||
```
|
||||
|
||||
## Historical Context
|
||||
|
||||
### Problem Timeline
|
||||
1. **Initial Issue**: Multiple workers stalling due to resource competition
|
||||
2. **Worker Reduction**: Reduced to 1 CPU + 1 GPU worker each
|
||||
3. **Timeout Increase**: Extended staging timeout from 5 minutes to 15-20 minutes
|
||||
4. **SMB Optimization**: Improved file transfer from 61.8 MB/s to 103 MB/s
|
||||
5. **Monitoring Implementation**: Custom Discord alerts for proactive issue detection
|
||||
|
||||
### Root Causes Addressed
|
||||
- **Large File Timeouts**: TV episodes (3-8GB+) exceeding 5-minute staging timeout
|
||||
- **Worker Competition**: Multiple workers competing for GPU resources
|
||||
- **Network Performance**: Slow SMB transfers causing download delays
|
||||
- **Node Instability**: Workers hanging during complex transcoding flows
|
||||
|
||||
## Deployment Status
|
||||
|
||||
### Current State
|
||||
✅ **Script Deployed**: Located and executable
|
||||
✅ **Discord Webhook**: Configured and tested
|
||||
✅ **Function Order**: Fixed (send_discord_notification defined before use)
|
||||
✅ **Polling Interval**: Set to 15 minutes
|
||||
✅ **Initial Test**: "monitoring started" message sent successfully
|
||||
|
||||
### Automation Setup
|
||||
**Status**: Ready for automation - choose deployment method:
|
||||
|
||||
#### Option A: Cron Job
|
||||
```bash
|
||||
# Edit user crontab
|
||||
crontab -e
|
||||
|
||||
# Add monitoring every 15 minutes
|
||||
*/15 * * * * /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh >/dev/null 2>&1
|
||||
```
|
||||
|
||||
#### Option B: Systemd Service (Recommended)
|
||||
```bash
|
||||
# Service file
|
||||
sudo tee /etc/systemd/system/tdarr-monitor.service > /dev/null <<EOF
|
||||
[Unit]
|
||||
Description=Tdarr Timeout Monitor
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
User=cal
|
||||
ExecStart=/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
|
||||
EOF
|
||||
|
||||
# Timer file
|
||||
sudo tee /etc/systemd/system/tdarr-monitor.timer > /dev/null <<EOF
|
||||
[Unit]
|
||||
Description=Run Tdarr Monitor every 15 minutes
|
||||
Requires=tdarr-monitor.service
|
||||
|
||||
[Timer]
|
||||
OnCalendar=*:0/15
|
||||
Persistent=true
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
EOF
|
||||
|
||||
# Enable and start
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl enable tdarr-monitor.timer
|
||||
sudo systemctl start tdarr-monitor.timer
|
||||
```
|
||||
|
||||
## Discord Integration Details
|
||||
|
||||
### Webhook Configuration
|
||||
- **Channel**: Designated homelab monitoring channel
|
||||
- **Message Format**: Rich embeds with color coding
|
||||
- **Alert Colors**: Red (15158332) for errors, Green (3066993) for success
|
||||
- **Content**: File paths, worker names, timestamps, hostname
|
||||
|
||||
### Message Examples
|
||||
|
||||
#### Successful Start
|
||||
```json
|
||||
{
|
||||
"title": "🎬 Tdarr Monitoring Alert",
|
||||
"description": "Tdarr timeout monitoring started",
|
||||
"color": 3066993,
|
||||
"timestamp": "2025-08-10T14:19:55.000Z",
|
||||
"footer": {
|
||||
"text": "nobara-pc - Sun Aug 10 09:19:55 AM CDT 2025"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Timeout Alert
|
||||
```json
|
||||
{
|
||||
"title": "🎬 Tdarr Monitoring Alert",
|
||||
"description": "⚠️ **4 file(s) timed out in staging:**\n```\nTV/Survivor/Season 48/...\n```\nFiles were removed from staging and will retry.",
|
||||
"color": 15158332,
|
||||
"timestamp": "2025-08-10T14:49:00.000Z"
|
||||
}
|
||||
```
|
||||
|
||||
## Performance Impact
|
||||
|
||||
### System Resources
|
||||
- **CPU Usage**: Minimal - quick log parsing every 15 minutes
|
||||
- **Network Impact**: One SSH connection + Docker logs query per check
|
||||
- **Storage**: < 1MB in `/tmp/tdarr-monitor/`
|
||||
- **Execution Time**: ~2-5 seconds per check
|
||||
|
||||
### Benefits
|
||||
- **Proactive Issue Detection**: Know about problems before manual checking
|
||||
- **Historical Tracking**: Discord provides persistent log of all alerts
|
||||
- **Mobile Notifications**: Get alerts on phone via Discord app
|
||||
- **Reduced Manual Monitoring**: Automated awareness of system health
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
#### No Discord Messages
|
||||
```bash
|
||||
# Test webhook manually
|
||||
curl -H "Content-Type: application/json" -X POST \
|
||||
-d '{"content":"Manual test message"}' \
|
||||
"https://discord.com/api/webhooks/1404105821549498398/y2Ud1RK9rzFjv58xbypUfQNe3jrL7ZUq1FkQHa4_dfOHm2ylp93z0f4tY0O8Z-vQgKhD"
|
||||
|
||||
# Check script execution
|
||||
/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
|
||||
```
|
||||
|
||||
#### SSH Connection Issues
|
||||
```bash
|
||||
# Test server connection
|
||||
ssh tdarr "echo 'SSH working'"
|
||||
|
||||
# Check SSH key authentication
|
||||
ssh-add -l
|
||||
```
|
||||
|
||||
#### Podman Container Access
|
||||
```bash
|
||||
# Verify container is running
|
||||
podman ps | grep tdarr-node
|
||||
|
||||
# Test log access
|
||||
podman logs --tail 5 tdarr-node-gpu-unmapped
|
||||
```
|
||||
|
||||
#### Force Monitoring Check
|
||||
```bash
|
||||
# Reset timestamp to trigger immediate check
|
||||
echo $(($(date +%s) - 1000)) > /tmp/tdarr-monitor/last_check.timestamp
|
||||
|
||||
# Run manual check
|
||||
/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
|
||||
```
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### Webhook Protection
|
||||
- **URL Security**: Webhook URL contains sensitive token
|
||||
- **Access Control**: Only monitoring script has access to webhook
|
||||
- **Scope Limitation**: Webhook only has permission to post messages
|
||||
|
||||
### SSH Access
|
||||
- **Key-based Authentication**: No password authentication used
|
||||
- **Limited Commands**: Only Docker logs and basic system commands
|
||||
- **Network Isolation**: SSH connection within trusted homelab network
|
||||
|
||||
### File Permissions
|
||||
```bash
|
||||
# Verify script permissions
|
||||
ls -la /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
|
||||
# Should be: -rwxr--r-- (755) or more restrictive
|
||||
|
||||
# Secure if needed
|
||||
chmod 750 /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
|
||||
```
|
||||
|
||||
## Maintenance
|
||||
|
||||
### Regular Checks
|
||||
- **Weekly**: Verify Discord messages are still being received
|
||||
- **Monthly**: Check `/tmp/tdarr-monitor/` directory size
|
||||
- **Quarterly**: Review alert frequency and adjust thresholds if needed
|
||||
|
||||
### Log Rotation
|
||||
```bash
|
||||
# Clean old monitoring logs if they accumulate
|
||||
find /tmp/tdarr-monitor/ -name "*.log" -mtime +7 -delete
|
||||
```
|
||||
|
||||
### Updates
|
||||
- **Script Updates**: Version control via git in homelab documentation
|
||||
- **Webhook Rotation**: Update webhook URL if Discord server changes
|
||||
- **Threshold Tuning**: Adjust 15-minute interval based on operational experience
|
||||
|
||||
## Integration with Other Systems
|
||||
|
||||
### Future Enhancements
|
||||
- **Grafana Integration**: Add metrics collection for dashboard visualization
|
||||
- **Prometheus Metrics**: Export timing and error rate metrics
|
||||
- **Home Assistant**: Integrate with home automation for additional alerting
|
||||
- **Email Backup**: Secondary notification method for critical alerts
|
||||
|
||||
### Related Documentation
|
||||
- [NAS Mount Configuration](../networking/nas-mount-configuration.md) - SMB optimization context
|
||||
- [Tdarr Troubleshooting](tdarr-troubleshooting.md) - Worker timeout background
|
||||
- [SSH Key Management](../networking/ssh-key-management.md) - Server access setup
|
||||
|
||||
---
|
||||
**Status**: ✅ Active and Configured
|
||||
**Last Updated**: August 10, 2025
|
||||
**Next Review**: September 10, 2025
|
||||
**Discord Channel**: Homelab monitoring alerts configured and tested
|
||||
205
reference/networking/nas-mount-configuration.md
Normal file
205
reference/networking/nas-mount-configuration.md
Normal file
@ -0,0 +1,205 @@
|
||||
# NAS Mount Configuration - TrueNAS SMB Optimization
|
||||
|
||||
## Overview
|
||||
This document contains the optimized SMB mount configurations for TrueNAS (10.10.0.35) across multiple systems in the homelab. These optimizations provide significant performance improvements over default SMB settings.
|
||||
|
||||
## TrueNAS System Details
|
||||
- **IP Address**: 10.10.0.35
|
||||
- **System**: TrueNAS Server
|
||||
- **SMB Version**: 3.1.1 (optimized from 3.0)
|
||||
- **Available Shares**: `/media` and `/cals-files`
|
||||
|
||||
## Performance Results
|
||||
|
||||
### Before Optimization
|
||||
- **Tdarr Server**: 61.8 MB/s
|
||||
- **Local Workstation**: 11.1 MB/s
|
||||
|
||||
### After Optimization
|
||||
- **Tdarr Server**: 103 MB/s (67% improvement)
|
||||
- **Local Workstation**: 85.4 MB/s (669% improvement)
|
||||
|
||||
## Optimized Mount Configurations
|
||||
|
||||
### Tdarr Server (ubuntu-ct - 10.10.0.43)
|
||||
|
||||
**File**: `/etc/fstab`
|
||||
```bash
|
||||
//10.10.0.35/media /mnt/truenas-share cifs vers=3.1.1,cache=loose,credentials=/root/.truenascreds,rsize=16777216,wsize=16777216,bsize=4194304,actimeo=30,closetimeo=5,echo_interval=30
|
||||
```
|
||||
|
||||
**Active Mount Options** (negotiated):
|
||||
```
|
||||
vers=3.1.1,cache=loose,rsize=8388608,wsize=8388608,bsize=4194304,
|
||||
actimeo=30,closetimeo=5,echo_interval=30
|
||||
```
|
||||
|
||||
### Local Workstation (nobara-pc)
|
||||
|
||||
**File**: `/etc/fstab`
|
||||
```bash
|
||||
//10.10.0.35/media /mnt/media cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,cache=loose,rsize=16777216,wsize=16777216,bsize=4194304,actimeo=30,closetimeo=5,echo_interval=30,noperm 0 0
|
||||
//10.10.0.35/cals-files /mnt/cals-files cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,cache=loose,rsize=16777216,wsize=16777216,bsize=4194304,actimeo=30,closetimeo=5,echo_interval=30,noperm 0 0
|
||||
```
|
||||
|
||||
**Active Mount Options** (negotiated):
|
||||
```
|
||||
vers=3.1.1,cache=loose,rsize=8388608,wsize=8388608,bsize=4194304,
|
||||
actimeo=30,closetimeo=5,echo_interval=30,noperm
|
||||
```
|
||||
|
||||
## Key Optimization Parameters
|
||||
|
||||
### Protocol Version
|
||||
- **Setting**: `vers=3.1.1`
|
||||
- **Previous**: `vers=3.0`
|
||||
- **Benefit**: Latest SMB protocol with performance improvements and better security
|
||||
|
||||
### Cache Strategy
|
||||
- **Setting**: `cache=loose`
|
||||
- **Previous**: `cache=strict`
|
||||
- **Benefit**: Better read performance, less strict consistency requirements
|
||||
|
||||
### Buffer Sizes
|
||||
- **Read Size**: `rsize=16777216` (requested) → `rsize=8388608` (negotiated)
|
||||
- **Write Size**: `wsize=16777216` (requested) → `wsize=8388608` (negotiated)
|
||||
- **Block Size**: `bsize=4194304` (4MB, up from 1MB)
|
||||
- **Previous**: `rsize=4194304,wsize=4194304,bsize=1048576`
|
||||
- **Benefit**: Larger buffers reduce network round trips
|
||||
|
||||
### Connection Timeouts
|
||||
- **Attribute Cache**: `actimeo=30` (30 seconds, up from 1 second)
|
||||
- **Close Timeout**: `closetimeo=5` (5 seconds, up from 1 second)
|
||||
- **Echo Interval**: `echo_interval=30` (30 seconds, down from 60 seconds)
|
||||
- **Benefit**: Better connection persistence, fewer reconnections
|
||||
|
||||
## Credential Files
|
||||
|
||||
### Tdarr Server
|
||||
**File**: `/root/.truenascreds`
|
||||
```
|
||||
username=plex
|
||||
password=[password]
|
||||
domain=
|
||||
```
|
||||
|
||||
### Local Workstation
|
||||
**File**: `/home/cal/.samba_credentials`
|
||||
```
|
||||
username=plex
|
||||
password=[password]
|
||||
domain=
|
||||
```
|
||||
|
||||
**Security**: Both files should be owned by root with 600 permissions:
|
||||
```bash
|
||||
sudo chown root:root /path/to/credentials
|
||||
sudo chmod 600 /path/to/credentials
|
||||
```
|
||||
|
||||
## Mount Commands
|
||||
|
||||
### Apply New Configuration
|
||||
```bash
|
||||
# Unmount existing mounts
|
||||
sudo umount /mnt/media
|
||||
sudo umount /mnt/cals-files
|
||||
|
||||
# Mount with new optimized settings
|
||||
sudo mount /mnt/media
|
||||
sudo mount /mnt/cals-files
|
||||
```
|
||||
|
||||
### Verify Mount Options
|
||||
```bash
|
||||
mount | grep -E 'media|cals-files'
|
||||
```
|
||||
|
||||
### Test Performance
|
||||
```bash
|
||||
# Test read performance (100MB test)
|
||||
time dd if="/mnt/media/path/to/large/file.mkv" bs=1M count=100 of=/dev/null
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Mount Error: Device or resource busy**
|
||||
- Solution: Unmount existing mount first: `sudo umount /mnt/media`
|
||||
|
||||
**Parse Error in fstab**
|
||||
- Check for missing `//` before IP address
|
||||
- Ensure no spaces in comma-separated options
|
||||
- Verify all commas are present between options
|
||||
|
||||
**Permission Denied**
|
||||
- Verify credential file exists and has correct permissions (600)
|
||||
- Check username/password in credential file
|
||||
- Ensure TrueNAS user has access to the share
|
||||
|
||||
**Slow Performance After Changes**
|
||||
- Verify new mount options are active: `mount | grep media`
|
||||
- Test with different file sizes to confirm improvement
|
||||
- Check network connectivity: `ping 10.10.0.35`
|
||||
|
||||
### Performance Testing Commands
|
||||
```bash
|
||||
# Quick performance test (50MB)
|
||||
time dd if="/mnt/media/Movies/[movie-file]" bs=1M count=50 of=/dev/null
|
||||
|
||||
# Larger performance test (100MB)
|
||||
time dd if="/mnt/media/Movies/[movie-file]" bs=1M count=100 of=/dev/null
|
||||
|
||||
# Monitor network during transfer
|
||||
iftop -i eth0
|
||||
```
|
||||
|
||||
## Backup and Rollback
|
||||
|
||||
### Before Making Changes
|
||||
```bash
|
||||
# Backup fstab
|
||||
sudo cp /etc/fstab /etc/fstab.backup-$(date +%Y%m%d-%H%M%S)
|
||||
```
|
||||
|
||||
### Rollback if Needed
|
||||
```bash
|
||||
# Restore from backup
|
||||
sudo cp /etc/fstab.backup-YYYYMMDD-HHMMSS /etc/fstab
|
||||
|
||||
# Remount
|
||||
sudo umount /mnt/media
|
||||
sudo mount /mnt/media
|
||||
```
|
||||
|
||||
## System-Specific Notes
|
||||
|
||||
### Tdarr Integration
|
||||
- **Unmapped Node Architecture**: Node downloads files from Tdarr server at optimized 103 MB/s
|
||||
- **Cache Directory**: Uses local NVMe storage for maximum transcoding performance
|
||||
- **No Direct NAS Access**: Unmapped nodes don't directly access NAS mounts
|
||||
|
||||
### Local Workstation Usage
|
||||
- **Media Browsing**: 85.4 MB/s for fast local media access
|
||||
- **File Management**: Significantly faster file operations
|
||||
- **Backup Operations**: Improved speeds for large file transfers
|
||||
|
||||
## Future Expansion
|
||||
|
||||
When adding new systems, use these optimized settings as the baseline:
|
||||
|
||||
```bash
|
||||
//10.10.0.35/media /mnt/media cifs credentials=/path/to/creds,uid=1000,gid=1000,vers=3.1.1,cache=loose,rsize=16777216,wsize=16777216,bsize=4194304,actimeo=30,closetimeo=5,echo_interval=30,noperm 0 0
|
||||
```
|
||||
|
||||
Adjust `uid`, `gid`, and credential path as needed for each system.
|
||||
|
||||
## Related Documentation
|
||||
- [SSH Key Management](ssh-key-management.md) - For secure access to systems
|
||||
- [Tdarr Troubleshooting](../docker/tdarr-troubleshooting.md) - For Tdarr-specific issues
|
||||
- [Network Troubleshooting](ssh-troubleshooting.md) - For general network issues
|
||||
|
||||
---
|
||||
*Last updated: August 10, 2025*
|
||||
*Performance improvements: Tdarr Server 67% faster, Local Workstation 669% faster*
|
||||
191
scripts/monitoring/setup-discord-monitoring.md
Normal file
191
scripts/monitoring/setup-discord-monitoring.md
Normal file
@ -0,0 +1,191 @@
|
||||
# Tdarr Discord Monitoring Setup Guide
|
||||
|
||||
## Overview
|
||||
This guide sets up automated Discord notifications for Tdarr worker timeouts, stalls, and completions using a custom log monitoring script.
|
||||
|
||||
## Prerequisites
|
||||
- Discord server where you want notifications
|
||||
- Administrative access to create webhooks
|
||||
- Tdarr server accessible via SSH
|
||||
- Podman/Docker access to Tdarr node
|
||||
|
||||
## Setup Steps
|
||||
|
||||
### 1. Create Discord Webhook
|
||||
1. Go to your Discord server → **Server Settings** → **Integrations** → **Webhooks**
|
||||
2. Click **Create Webhook**
|
||||
3. Name it "Tdarr Monitor" and select the channel for notifications
|
||||
4. Copy the **Webhook URL** (keep this secure!)
|
||||
|
||||
### 2. Configure Monitoring Script
|
||||
Edit the script to add your webhook:
|
||||
```bash
|
||||
nano /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
|
||||
```
|
||||
|
||||
Update these lines:
|
||||
```bash
|
||||
DISCORD_WEBHOOK="https://discord.com/api/webhooks/YOUR_WEBHOOK_ID/YOUR_WEBHOOK_TOKEN"
|
||||
SERVER_HOST="tdarr" # Your SSH alias for Tdarr server
|
||||
NODE_CONTAINER="tdarr-node-gpu-unmapped" # Your node container name
|
||||
```
|
||||
|
||||
### 3. Make Script Executable
|
||||
```bash
|
||||
chmod +x /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
|
||||
```
|
||||
|
||||
### 4. Test the Script
|
||||
```bash
|
||||
/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
|
||||
```
|
||||
|
||||
You should see a "monitoring started" message in your Discord channel.
|
||||
|
||||
### 5. Setup Automated Monitoring (Choose One)
|
||||
|
||||
#### Option A: Cron Job (Simple)
|
||||
```bash
|
||||
# Edit crontab
|
||||
crontab -e
|
||||
|
||||
# Add this line to check every 5 minutes
|
||||
*/5 * * * * /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh >/dev/null 2>&1
|
||||
```
|
||||
|
||||
#### Option B: Systemd Service (Advanced)
|
||||
Create a systemd service for more reliable monitoring:
|
||||
|
||||
```bash
|
||||
sudo nano /etc/systemd/system/tdarr-monitor.service
|
||||
```
|
||||
|
||||
Content:
|
||||
```ini
|
||||
[Unit]
|
||||
Description=Tdarr Timeout Monitor
|
||||
After=network.target
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
User=cal
|
||||
ExecStart=/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
|
||||
```
|
||||
|
||||
Create timer:
|
||||
```bash
|
||||
sudo nano /etc/systemd/system/tdarr-monitor.timer
|
||||
```
|
||||
|
||||
Content:
|
||||
```ini
|
||||
[Unit]
|
||||
Description=Run Tdarr Monitor every 5 minutes
|
||||
Requires=tdarr-monitor.service
|
||||
|
||||
[Timer]
|
||||
OnCalendar=*:0/5
|
||||
Persistent=true
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
```
|
||||
|
||||
Enable and start:
|
||||
```bash
|
||||
sudo systemctl daemon-reload
|
||||
sudo systemctl enable tdarr-monitor.timer
|
||||
sudo systemctl start tdarr-monitor.timer
|
||||
```
|
||||
|
||||
## Notification Examples
|
||||
|
||||
### Worker Timeout Alert
|
||||
```
|
||||
🎬 Tdarr Monitoring Alert
|
||||
⚠️ 4 file(s) timed out in staging:
|
||||
```
|
||||
TV/Survivor/Season 48/Survivor (2000) - S48E04...
|
||||
TV/Survivor/Season 48/Survivor (2000) - S48E11...
|
||||
TV/Survivor/Season 26/Survivor (2000) - S26E05...
|
||||
```
|
||||
Files were removed from staging and will retry.
|
||||
```
|
||||
|
||||
### Worker Stall Alert
|
||||
```
|
||||
🎬 Tdarr Monitoring Alert
|
||||
🔴 2 worker stall(s) detected:
|
||||
```
|
||||
Worker eager-eyas
|
||||
Worker oblong-owl
|
||||
```
|
||||
Workers were cancelled and will restart.
|
||||
```
|
||||
|
||||
### Success Notification (Optional)
|
||||
```
|
||||
🎬 Tdarr Monitoring Alert
|
||||
✅ 3 transcode(s) completed successfully in the last check period.
|
||||
```
|
||||
|
||||
## Monitoring Features
|
||||
|
||||
✅ **Server Limbo Timeouts** - Files stuck in staging > timeout period
|
||||
✅ **Node Worker Stalls** - Workers that hang during transcoding
|
||||
✅ **Success Notifications** - Optional completion alerts
|
||||
✅ **Smart Timing** - Only checks every 60+ seconds to avoid spam
|
||||
✅ **Rich Discord Embeds** - Color-coded messages with timestamps
|
||||
|
||||
## Customization Options
|
||||
|
||||
### Disable Success Messages
|
||||
Edit the script and comment out this line:
|
||||
```bash
|
||||
# check_completions # Comment out to disable success notifications
|
||||
```
|
||||
|
||||
### Change Check Frequency
|
||||
For cron job, modify the timing:
|
||||
```bash
|
||||
*/10 * * * * # Check every 10 minutes instead of 5
|
||||
```
|
||||
|
||||
For systemd timer, update OnCalendar:
|
||||
```ini
|
||||
OnCalendar=*:0/10 # Check every 10 minutes
|
||||
```
|
||||
|
||||
### Add More Monitoring
|
||||
You can extend the script to monitor:
|
||||
- Disk space on cache directory
|
||||
- Network connectivity to TrueNAS
|
||||
- GPU utilization during transcoding
|
||||
- Queue depth and processing rates
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### No Notifications Received
|
||||
1. Check webhook URL is correct and accessible
|
||||
2. Test webhook manually:
|
||||
```bash
|
||||
curl -H "Content-Type: application/json" -X POST -d '{"content":"Test message"}' "YOUR_WEBHOOK_URL"
|
||||
```
|
||||
3. Check script logs: `/tmp/tdarr-monitor/`
|
||||
|
||||
### False Positives
|
||||
- Adjust the timing logic in the script
|
||||
- Filter out specific log patterns that aren't actual errors
|
||||
- Tune the timeout thresholds
|
||||
|
||||
### Missing SSH Access
|
||||
- Ensure SSH key authentication is set up for the `tdarr` server
|
||||
- Test: `ssh tdarr "echo 'SSH working'"`
|
||||
|
||||
## Security Notes
|
||||
- Keep your Discord webhook URL private
|
||||
- Consider using environment variables for sensitive data
|
||||
- Restrict file permissions on the script (chmod 750)
|
||||
|
||||
---
|
||||
*This monitoring solution provides real-time alerts for Tdarr issues without requiring external monitoring infrastructure.*
|
||||
Loading…
Reference in New Issue
Block a user