From 715354da7d71f556c6e266e0e4350954c18ad44f Mon Sep 17 00:00:00 2001 From: Cal Corum Date: Sun, 10 Aug 2025 10:39:55 -0500 Subject: [PATCH] CLAUDE: Add comprehensive documentation for Tdarr monitoring and NAS configuration MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Complete documentation package for home lab infrastructure: ## New Documentation Files: - **Tdarr Monitoring Configuration**: Complete setup guide for Discord-based Tdarr monitoring system - **NAS Mount Configuration**: SMB/CIFS mount setup and troubleshooting for media storage - **Discord Monitoring Setup**: Step-by-step guide for webhook configuration and notification testing ## Documentation Features: - **Reference Architecture**: Best practices for distributed Tdarr deployments - **Configuration Templates**: Copy-paste ready configurations with security considerations - **Troubleshooting Guides**: Common issues and solutions for production environments - **Integration Examples**: Real-world implementation patterns for home lab environments ## Coverage Areas: - Docker container orchestration and monitoring - Network storage integration and performance optimization - Automated alerting and notification systems - Production-ready configuration management These documents support the enhanced monitoring system and provide comprehensive guidance for maintaining a robust home lab infrastructure. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude --- .../docker/tdarr-monitoring-configuration.md | 282 ++++++++++++++++++ .../networking/nas-mount-configuration.md | 205 +++++++++++++ .../monitoring/setup-discord-monitoring.md | 191 ++++++++++++ 3 files changed, 678 insertions(+) create mode 100644 reference/docker/tdarr-monitoring-configuration.md create mode 100644 reference/networking/nas-mount-configuration.md create mode 100644 scripts/monitoring/setup-discord-monitoring.md diff --git a/reference/docker/tdarr-monitoring-configuration.md b/reference/docker/tdarr-monitoring-configuration.md new file mode 100644 index 0000000..647da73 --- /dev/null +++ b/reference/docker/tdarr-monitoring-configuration.md @@ -0,0 +1,282 @@ +# Tdarr Monitoring Configuration - Discord Integration + +## Overview +This document describes the active Discord monitoring system for Tdarr worker timeouts, stalls, and operational status across the homelab environment. + +## System Architecture + +### Components +- **Tdarr Server**: ubuntu-ct (10.10.0.43) running `tdarr-clean` container +- **Tdarr Node**: nobara-pc (unmapped architecture) running `tdarr-node-gpu-unmapped` +- **Monitoring Script**: `/scripts/monitoring/tdarr-timeout-monitor.sh` +- **Discord Integration**: Webhook-based notifications to designated channel + +### Monitoring Targets +✅ **Server Limbo Timeouts** - Files stuck in staging beyond timeout period +✅ **Node Worker Stalls** - Workers that hang during transcoding operations +✅ **Worker Disconnections** - Unexpected worker disconnects +✅ **Success Notifications** - Optional completion tracking (currently disabled) + +## Active Configuration + +### Script Location +``` +/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh +``` + +### Key Settings +- **Polling Interval**: 15 minutes (900 seconds) +- **Discord Webhook**: Configured and active +- **Server Connection**: SSH alias `tdarr` → ubuntu-ct (10.10.0.43) +- **Node Container**: `tdarr-node-gpu-unmapped` via Podman +- **Working Directory**: `/tmp/tdarr-monitor/` + +### Monitored Log Patterns + +#### Server Timeout Events +```bash +# Pattern: "has been in limbo for X seconds, removing from staging section" +grep -i "has been in limbo" +``` +**Example Alert**: +``` +⚠️ 4 file(s) timed out in staging: +TV/Survivor/Season 48/Survivor (2000) - S48E04... +TV/Survivor/Season 48/Survivor (2000) - S48E11... +TV/Survivor/Season 26/Survivor (2000) - S26E05... +TV/Survivor/Season 22/Survivor (2000) - S22E13... +Files were removed from staging and will retry. +``` + +#### Node Worker Events +```bash +# Patterns: "Worker X has stalled, cancelling" | "Worker X disconnected. Pruning." +grep -i "worker.*stalled\|worker.*disconnected" +``` +**Example Alert**: +``` +🔴 4 worker stall(s) detected: +Worker eager-eyas +Worker oblong-owl +Worker other-olm +Worker dry-dugong +Workers were cancelled and will restart. +``` + +## Historical Context + +### Problem Timeline +1. **Initial Issue**: Multiple workers stalling due to resource competition +2. **Worker Reduction**: Reduced to 1 CPU + 1 GPU worker each +3. **Timeout Increase**: Extended staging timeout from 5 minutes to 15-20 minutes +4. **SMB Optimization**: Improved file transfer from 61.8 MB/s to 103 MB/s +5. **Monitoring Implementation**: Custom Discord alerts for proactive issue detection + +### Root Causes Addressed +- **Large File Timeouts**: TV episodes (3-8GB+) exceeding 5-minute staging timeout +- **Worker Competition**: Multiple workers competing for GPU resources +- **Network Performance**: Slow SMB transfers causing download delays +- **Node Instability**: Workers hanging during complex transcoding flows + +## Deployment Status + +### Current State +✅ **Script Deployed**: Located and executable +✅ **Discord Webhook**: Configured and tested +✅ **Function Order**: Fixed (send_discord_notification defined before use) +✅ **Polling Interval**: Set to 15 minutes +✅ **Initial Test**: "monitoring started" message sent successfully + +### Automation Setup +**Status**: Ready for automation - choose deployment method: + +#### Option A: Cron Job +```bash +# Edit user crontab +crontab -e + +# Add monitoring every 15 minutes +*/15 * * * * /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh >/dev/null 2>&1 +``` + +#### Option B: Systemd Service (Recommended) +```bash +# Service file +sudo tee /etc/systemd/system/tdarr-monitor.service > /dev/null < /dev/null < /tmp/tdarr-monitor/last_check.timestamp + +# Run manual check +/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh +``` + +## Security Considerations + +### Webhook Protection +- **URL Security**: Webhook URL contains sensitive token +- **Access Control**: Only monitoring script has access to webhook +- **Scope Limitation**: Webhook only has permission to post messages + +### SSH Access +- **Key-based Authentication**: No password authentication used +- **Limited Commands**: Only Docker logs and basic system commands +- **Network Isolation**: SSH connection within trusted homelab network + +### File Permissions +```bash +# Verify script permissions +ls -la /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh +# Should be: -rwxr--r-- (755) or more restrictive + +# Secure if needed +chmod 750 /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh +``` + +## Maintenance + +### Regular Checks +- **Weekly**: Verify Discord messages are still being received +- **Monthly**: Check `/tmp/tdarr-monitor/` directory size +- **Quarterly**: Review alert frequency and adjust thresholds if needed + +### Log Rotation +```bash +# Clean old monitoring logs if they accumulate +find /tmp/tdarr-monitor/ -name "*.log" -mtime +7 -delete +``` + +### Updates +- **Script Updates**: Version control via git in homelab documentation +- **Webhook Rotation**: Update webhook URL if Discord server changes +- **Threshold Tuning**: Adjust 15-minute interval based on operational experience + +## Integration with Other Systems + +### Future Enhancements +- **Grafana Integration**: Add metrics collection for dashboard visualization +- **Prometheus Metrics**: Export timing and error rate metrics +- **Home Assistant**: Integrate with home automation for additional alerting +- **Email Backup**: Secondary notification method for critical alerts + +### Related Documentation +- [NAS Mount Configuration](../networking/nas-mount-configuration.md) - SMB optimization context +- [Tdarr Troubleshooting](tdarr-troubleshooting.md) - Worker timeout background +- [SSH Key Management](../networking/ssh-key-management.md) - Server access setup + +--- +**Status**: ✅ Active and Configured +**Last Updated**: August 10, 2025 +**Next Review**: September 10, 2025 +**Discord Channel**: Homelab monitoring alerts configured and tested \ No newline at end of file diff --git a/reference/networking/nas-mount-configuration.md b/reference/networking/nas-mount-configuration.md new file mode 100644 index 0000000..302dad1 --- /dev/null +++ b/reference/networking/nas-mount-configuration.md @@ -0,0 +1,205 @@ +# NAS Mount Configuration - TrueNAS SMB Optimization + +## Overview +This document contains the optimized SMB mount configurations for TrueNAS (10.10.0.35) across multiple systems in the homelab. These optimizations provide significant performance improvements over default SMB settings. + +## TrueNAS System Details +- **IP Address**: 10.10.0.35 +- **System**: TrueNAS Server +- **SMB Version**: 3.1.1 (optimized from 3.0) +- **Available Shares**: `/media` and `/cals-files` + +## Performance Results + +### Before Optimization +- **Tdarr Server**: 61.8 MB/s +- **Local Workstation**: 11.1 MB/s + +### After Optimization +- **Tdarr Server**: 103 MB/s (67% improvement) +- **Local Workstation**: 85.4 MB/s (669% improvement) + +## Optimized Mount Configurations + +### Tdarr Server (ubuntu-ct - 10.10.0.43) + +**File**: `/etc/fstab` +```bash +//10.10.0.35/media /mnt/truenas-share cifs vers=3.1.1,cache=loose,credentials=/root/.truenascreds,rsize=16777216,wsize=16777216,bsize=4194304,actimeo=30,closetimeo=5,echo_interval=30 +``` + +**Active Mount Options** (negotiated): +``` +vers=3.1.1,cache=loose,rsize=8388608,wsize=8388608,bsize=4194304, +actimeo=30,closetimeo=5,echo_interval=30 +``` + +### Local Workstation (nobara-pc) + +**File**: `/etc/fstab` +```bash +//10.10.0.35/media /mnt/media cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,cache=loose,rsize=16777216,wsize=16777216,bsize=4194304,actimeo=30,closetimeo=5,echo_interval=30,noperm 0 0 +//10.10.0.35/cals-files /mnt/cals-files cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,cache=loose,rsize=16777216,wsize=16777216,bsize=4194304,actimeo=30,closetimeo=5,echo_interval=30,noperm 0 0 +``` + +**Active Mount Options** (negotiated): +``` +vers=3.1.1,cache=loose,rsize=8388608,wsize=8388608,bsize=4194304, +actimeo=30,closetimeo=5,echo_interval=30,noperm +``` + +## Key Optimization Parameters + +### Protocol Version +- **Setting**: `vers=3.1.1` +- **Previous**: `vers=3.0` +- **Benefit**: Latest SMB protocol with performance improvements and better security + +### Cache Strategy +- **Setting**: `cache=loose` +- **Previous**: `cache=strict` +- **Benefit**: Better read performance, less strict consistency requirements + +### Buffer Sizes +- **Read Size**: `rsize=16777216` (requested) → `rsize=8388608` (negotiated) +- **Write Size**: `wsize=16777216` (requested) → `wsize=8388608` (negotiated) +- **Block Size**: `bsize=4194304` (4MB, up from 1MB) +- **Previous**: `rsize=4194304,wsize=4194304,bsize=1048576` +- **Benefit**: Larger buffers reduce network round trips + +### Connection Timeouts +- **Attribute Cache**: `actimeo=30` (30 seconds, up from 1 second) +- **Close Timeout**: `closetimeo=5` (5 seconds, up from 1 second) +- **Echo Interval**: `echo_interval=30` (30 seconds, down from 60 seconds) +- **Benefit**: Better connection persistence, fewer reconnections + +## Credential Files + +### Tdarr Server +**File**: `/root/.truenascreds` +``` +username=plex +password=[password] +domain= +``` + +### Local Workstation +**File**: `/home/cal/.samba_credentials` +``` +username=plex +password=[password] +domain= +``` + +**Security**: Both files should be owned by root with 600 permissions: +```bash +sudo chown root:root /path/to/credentials +sudo chmod 600 /path/to/credentials +``` + +## Mount Commands + +### Apply New Configuration +```bash +# Unmount existing mounts +sudo umount /mnt/media +sudo umount /mnt/cals-files + +# Mount with new optimized settings +sudo mount /mnt/media +sudo mount /mnt/cals-files +``` + +### Verify Mount Options +```bash +mount | grep -E 'media|cals-files' +``` + +### Test Performance +```bash +# Test read performance (100MB test) +time dd if="/mnt/media/path/to/large/file.mkv" bs=1M count=100 of=/dev/null +``` + +## Troubleshooting + +### Common Issues + +**Mount Error: Device or resource busy** +- Solution: Unmount existing mount first: `sudo umount /mnt/media` + +**Parse Error in fstab** +- Check for missing `//` before IP address +- Ensure no spaces in comma-separated options +- Verify all commas are present between options + +**Permission Denied** +- Verify credential file exists and has correct permissions (600) +- Check username/password in credential file +- Ensure TrueNAS user has access to the share + +**Slow Performance After Changes** +- Verify new mount options are active: `mount | grep media` +- Test with different file sizes to confirm improvement +- Check network connectivity: `ping 10.10.0.35` + +### Performance Testing Commands +```bash +# Quick performance test (50MB) +time dd if="/mnt/media/Movies/[movie-file]" bs=1M count=50 of=/dev/null + +# Larger performance test (100MB) +time dd if="/mnt/media/Movies/[movie-file]" bs=1M count=100 of=/dev/null + +# Monitor network during transfer +iftop -i eth0 +``` + +## Backup and Rollback + +### Before Making Changes +```bash +# Backup fstab +sudo cp /etc/fstab /etc/fstab.backup-$(date +%Y%m%d-%H%M%S) +``` + +### Rollback if Needed +```bash +# Restore from backup +sudo cp /etc/fstab.backup-YYYYMMDD-HHMMSS /etc/fstab + +# Remount +sudo umount /mnt/media +sudo mount /mnt/media +``` + +## System-Specific Notes + +### Tdarr Integration +- **Unmapped Node Architecture**: Node downloads files from Tdarr server at optimized 103 MB/s +- **Cache Directory**: Uses local NVMe storage for maximum transcoding performance +- **No Direct NAS Access**: Unmapped nodes don't directly access NAS mounts + +### Local Workstation Usage +- **Media Browsing**: 85.4 MB/s for fast local media access +- **File Management**: Significantly faster file operations +- **Backup Operations**: Improved speeds for large file transfers + +## Future Expansion + +When adding new systems, use these optimized settings as the baseline: + +```bash +//10.10.0.35/media /mnt/media cifs credentials=/path/to/creds,uid=1000,gid=1000,vers=3.1.1,cache=loose,rsize=16777216,wsize=16777216,bsize=4194304,actimeo=30,closetimeo=5,echo_interval=30,noperm 0 0 +``` + +Adjust `uid`, `gid`, and credential path as needed for each system. + +## Related Documentation +- [SSH Key Management](ssh-key-management.md) - For secure access to systems +- [Tdarr Troubleshooting](../docker/tdarr-troubleshooting.md) - For Tdarr-specific issues +- [Network Troubleshooting](ssh-troubleshooting.md) - For general network issues + +--- +*Last updated: August 10, 2025* +*Performance improvements: Tdarr Server 67% faster, Local Workstation 669% faster* \ No newline at end of file diff --git a/scripts/monitoring/setup-discord-monitoring.md b/scripts/monitoring/setup-discord-monitoring.md new file mode 100644 index 0000000..9d1fd09 --- /dev/null +++ b/scripts/monitoring/setup-discord-monitoring.md @@ -0,0 +1,191 @@ +# Tdarr Discord Monitoring Setup Guide + +## Overview +This guide sets up automated Discord notifications for Tdarr worker timeouts, stalls, and completions using a custom log monitoring script. + +## Prerequisites +- Discord server where you want notifications +- Administrative access to create webhooks +- Tdarr server accessible via SSH +- Podman/Docker access to Tdarr node + +## Setup Steps + +### 1. Create Discord Webhook +1. Go to your Discord server → **Server Settings** → **Integrations** → **Webhooks** +2. Click **Create Webhook** +3. Name it "Tdarr Monitor" and select the channel for notifications +4. Copy the **Webhook URL** (keep this secure!) + +### 2. Configure Monitoring Script +Edit the script to add your webhook: +```bash +nano /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh +``` + +Update these lines: +```bash +DISCORD_WEBHOOK="https://discord.com/api/webhooks/YOUR_WEBHOOK_ID/YOUR_WEBHOOK_TOKEN" +SERVER_HOST="tdarr" # Your SSH alias for Tdarr server +NODE_CONTAINER="tdarr-node-gpu-unmapped" # Your node container name +``` + +### 3. Make Script Executable +```bash +chmod +x /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh +``` + +### 4. Test the Script +```bash +/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh +``` + +You should see a "monitoring started" message in your Discord channel. + +### 5. Setup Automated Monitoring (Choose One) + +#### Option A: Cron Job (Simple) +```bash +# Edit crontab +crontab -e + +# Add this line to check every 5 minutes +*/5 * * * * /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh >/dev/null 2>&1 +``` + +#### Option B: Systemd Service (Advanced) +Create a systemd service for more reliable monitoring: + +```bash +sudo nano /etc/systemd/system/tdarr-monitor.service +``` + +Content: +```ini +[Unit] +Description=Tdarr Timeout Monitor +After=network.target + +[Service] +Type=oneshot +User=cal +ExecStart=/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh +``` + +Create timer: +```bash +sudo nano /etc/systemd/system/tdarr-monitor.timer +``` + +Content: +```ini +[Unit] +Description=Run Tdarr Monitor every 5 minutes +Requires=tdarr-monitor.service + +[Timer] +OnCalendar=*:0/5 +Persistent=true + +[Install] +WantedBy=timers.target +``` + +Enable and start: +```bash +sudo systemctl daemon-reload +sudo systemctl enable tdarr-monitor.timer +sudo systemctl start tdarr-monitor.timer +``` + +## Notification Examples + +### Worker Timeout Alert +``` +🎬 Tdarr Monitoring Alert +⚠️ 4 file(s) timed out in staging: +``` +TV/Survivor/Season 48/Survivor (2000) - S48E04... +TV/Survivor/Season 48/Survivor (2000) - S48E11... +TV/Survivor/Season 26/Survivor (2000) - S26E05... +``` +Files were removed from staging and will retry. +``` + +### Worker Stall Alert +``` +🎬 Tdarr Monitoring Alert +🔴 2 worker stall(s) detected: +``` +Worker eager-eyas +Worker oblong-owl +``` +Workers were cancelled and will restart. +``` + +### Success Notification (Optional) +``` +🎬 Tdarr Monitoring Alert +✅ 3 transcode(s) completed successfully in the last check period. +``` + +## Monitoring Features + +✅ **Server Limbo Timeouts** - Files stuck in staging > timeout period +✅ **Node Worker Stalls** - Workers that hang during transcoding +✅ **Success Notifications** - Optional completion alerts +✅ **Smart Timing** - Only checks every 60+ seconds to avoid spam +✅ **Rich Discord Embeds** - Color-coded messages with timestamps + +## Customization Options + +### Disable Success Messages +Edit the script and comment out this line: +```bash +# check_completions # Comment out to disable success notifications +``` + +### Change Check Frequency +For cron job, modify the timing: +```bash +*/10 * * * * # Check every 10 minutes instead of 5 +``` + +For systemd timer, update OnCalendar: +```ini +OnCalendar=*:0/10 # Check every 10 minutes +``` + +### Add More Monitoring +You can extend the script to monitor: +- Disk space on cache directory +- Network connectivity to TrueNAS +- GPU utilization during transcoding +- Queue depth and processing rates + +## Troubleshooting + +### No Notifications Received +1. Check webhook URL is correct and accessible +2. Test webhook manually: + ```bash + curl -H "Content-Type: application/json" -X POST -d '{"content":"Test message"}' "YOUR_WEBHOOK_URL" + ``` +3. Check script logs: `/tmp/tdarr-monitor/` + +### False Positives +- Adjust the timing logic in the script +- Filter out specific log patterns that aren't actual errors +- Tune the timeout thresholds + +### Missing SSH Access +- Ensure SSH key authentication is set up for the `tdarr` server +- Test: `ssh tdarr "echo 'SSH working'"` + +## Security Notes +- Keep your Discord webhook URL private +- Consider using environment variables for sensitive data +- Restrict file permissions on the script (chmod 750) + +--- +*This monitoring solution provides real-time alerts for Tdarr issues without requiring external monitoring infrastructure.* \ No newline at end of file