CLAUDE: Update Tdarr context for ubuntu-manticore deployment

Rewrote documentation to reflect current deployment on ubuntu-manticore
(10.10.0.226) with actual performance metrics and queue status:
- Server specs: Ubuntu 24.04, GTX 1070, Docker Compose
- Storage: NFS media (48TB) + local NVMe cache (1.9TB)
- Performance: ~13 files/hour, 64% compression, HEVC output
- Queue: 7,675 pending, 37,406 total jobs processed
- Added operational commands, API access, GPU sharing notes
- Moved gaming-aware scheduler to legacy section (not needed on dedicated server)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Cal Corum 2025-12-07 01:17:27 -06:00
parent 117788f216
commit b8b4b13130

View File

@ -1,152 +1,204 @@
# Tdarr Transcoding System - Technology Context # Tdarr Transcoding System - Technology Context
## Overview ## Overview
Tdarr is a distributed transcoding system that converts media files to optimized formats. This implementation uses an intelligent gaming-aware scheduler with unmapped node architecture for optimal performance and system stability. Tdarr is a distributed transcoding system that converts media files to optimized formats. The current deployment runs on a dedicated Ubuntu server with GPU transcoding and NFS-based media storage.
## Architecture Patterns ## Current Deployment
### Distributed Unmapped Node Architecture (Recommended) ### Server: ubuntu-manticore (10.10.0.226)
**Pattern**: Server-Node separation with local high-speed cache - **OS**: Ubuntu 24.04.3 LTS (Noble Numbat)
- **Server**: Tdarr Server manages queue, web interface, and coordination - **GPU**: NVIDIA GeForce GTX 1070 (8GB VRAM)
- **Node**: Unmapped nodes with local NVMe cache for processing - **Driver**: 570.195.03
- **Benefits**: 3-5x performance improvement, network I/O reduction, linear scaling - **Container Runtime**: Docker with Compose
- **Web UI**: http://10.10.0.226:8265
**When to Use**: ### Storage Architecture
- Multiple transcoding nodes across network | Mount | Source | Purpose |
- High-performance requirements (10GB+ files) |-------|--------|---------|
- Network bandwidth limitations | `/mnt/truenas/media` | NFS from 10.10.0.35 | Media library (48TB total, ~29TB used) |
- Gaming systems requiring GPU priority management | `/mnt/NV2/tdarr-cache` | Local NVMe | Transcode work directory (1.9TB, ~40% used) |
### Configuration Principles ### Container Configuration
1. **Cache Optimization**: Use local NVMe storage for work directories **Location**: `/home/cal/docker/tdarr/docker-compose.yml`
2. **Gaming Detection**: Automatic pause during GPU-intensive activities
3. **Resource Isolation**: Container limits prevent kernel-level crashes
4. **Monitoring Integration**: Automated cleanup and Discord notifications
## Core Components
### Gaming-Aware Scheduler
**Purpose**: Automatically manages Tdarr node to avoid conflicts with gaming
**Location**: `scripts/tdarr-schedule-manager.sh`
**Key Features**:
- Detects gaming processes (Steam, Lutris, Wine, etc.)
- GPU usage monitoring (>15% threshold)
- Configurable time windows
- Automated temporary directory cleanup
**Schedule Format**: `"HOUR_START-HOUR_END:DAYS"`
- `"22-07:daily"` - Overnight transcoding
- `"09-17:1-5"` - Business hours weekdays only
- `"14-16:6,7"` - Weekend afternoon window
### Monitoring System
**Purpose**: Prevents staging section timeouts and system instability
**Location**: `scripts/monitoring/tdarr-timeout-monitor.sh`
**Capabilities**:
- Staging timeout detection (300-second hardcoded limit)
- Automatic work directory cleanup
- Discord notifications with user pings
- Log rotation and retention management
### Container Architecture
**Server Configuration**:
```yaml ```yaml
# Hybrid storage with resource limits version: "3.8"
services: services:
tdarr: tdarr:
image: ghcr.io/haveagitgat/tdarr:latest image: ghcr.io/haveagitgat/tdarr:latest
ports: ["8265:8266"] container_name: tdarr-server
restart: unless-stopped
ports:
- "8265:8265" # Web UI
- "8266:8266" # Server port (for nodes)
environment:
- PUID=1000
- PGID=1000
- TZ=America/Chicago
- serverIP=0.0.0.0
- serverPort=8266
- webUIPort=8265
volumes: volumes:
- "./tdarr-data:/app/configs" - ./server-data:/app/server
- "/mnt/media:/media" - ./configs:/app/configs
- ./logs:/app/logs
- /mnt/truenas/media:/media
tdarr-node:
image: ghcr.io/haveagitgat/tdarr_node:latest
container_name: tdarr-node
restart: unless-stopped
environment:
- PUID=1000
- PGID=1000
- TZ=America/Chicago
- serverIP=tdarr
- serverPort=8266
- nodeName=manticore-gpu
volumes:
- ./node-data:/app/configs
- /mnt/truenas/media:/media
- /mnt/NV2/tdarr-cache:/temp
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
depends_on:
- tdarr
``` ```
**Node Configuration**: ### Node Configuration
- **Node Name**: manticore-gpu
- **Node Type**: Mapped (both server and node access same NFS mount)
- **Workers**: 1 GPU transcode worker, 4 GPU healthcheck workers
- **Schedule**: Disabled (runs 24/7)
### Current Queue Status (Dec 2025)
| Metric | Value |
|--------|-------|
| Transcode Queue | ~7,675 files |
| Success/Not Required | 8,378 files |
| Healthy Files | 16,628 files |
| Job History | 37,406 total jobs |
### Performance Metrics
- **Throughput**: ~13 files/hour (varies by file size)
- **Average Compression**: ~64% of original size (35% space savings)
- **Codec**: HEVC (h265) output at 1080p
- **Typical File Sizes**: 3-7 GB input → 2-4.5 GB output
## Architecture Patterns
### Mapped Node with Shared Storage
**Pattern**: Server and node share the same media mount via NFS
- **Advantage**: Simpler configuration, no file transfer overhead
- **Trade-off**: Depends on stable NFS connection during transcoding
**When to Use**:
- Dedicated transcoding server (not a gaming/desktop system)
- Reliable network storage infrastructure
- Single-node deployments
### Local NVMe Cache
Work directory on local NVMe (`/mnt/NV2/tdarr-cache:/temp`) provides:
- Fast read/write for transcode operations
- Isolation from network latency during processing
- Sufficient space for large remux files (1TB+ available)
## Operational Notes
### Recent Activity
System is actively processing with strong throughput. Recent successful transcodes include:
- Dead Like Me (2003) - multiple episodes
- Supernatural (2005) - S03 episodes
- I Dream of Jeannie (1965) - S01 episodes
- Da Vinci's Demons (2013) - S01 episodes
### Minor Issues
- **Occasional File Not Found (400)**: Files deleted/moved while queued fail after 5 retries
- Impact: Minimal - system continues processing remaining queue
- Resolution: Automatic - failed files are skipped
### Monitoring
- **Server Logs**: `/home/cal/docker/tdarr/logs/Tdarr_Server_Log.txt`
- **Docker Logs**: `docker logs tdarr-server` / `docker logs tdarr-node`
- **Library Scans**: Automatic hourly scans (2 libraries: ZWgKkmzJp, EjfWXCdU8)
### Common Operations
**Check Status**:
```bash ```bash
# Unmapped node with local cache ssh 10.10.0.226 "docker ps --format 'table {{.Names}}\t{{.Status}}' | grep tdarr"
podman run -d \
--name tdarr-node-gpu \
-e nodeType=unmapped \
-v "/mnt/NV2/tdarr-cache:/cache" \
--device nvidia.com/gpu=all \
ghcr.io/haveagitgat/tdarr_node:latest
``` ```
## Implementation Patterns **View Recent Logs**:
```bash
ssh 10.10.0.226 "docker logs tdarr-node --since 1h 2>&1 | tail -50"
```
### Performance Optimization **Restart Services**:
1. **Local Cache Strategy**: Download → Process → Upload (vs. streaming) ```bash
2. **Resource Limits**: Prevent memory exhaustion and kernel crashes ssh 10.10.0.226 "cd /home/cal/docker/tdarr && docker compose restart"
3. **Network Resilience**: CIFS mount options for stability ```
4. **Automated Cleanup**: Prevent accumulation of stuck directories
### Error Prevention **Check GPU Usage**:
1. **Plugin Safety**: Null-safe forEach operations `(streams || []).forEach()` ```bash
2. **Clean Installation**: Avoid custom plugin mounts causing version conflicts ssh 10.10.0.226 "nvidia-smi"
3. **Container Isolation**: Resource limits prevent system-level crashes ```
4. **Network Stability**: Unmapped architecture reduces CIFS dependency
### Gaming Integration ### API Access
1. **Process Detection**: Monitor for gaming applications and utilities Base URL: `http://10.10.0.226:8265/api/v2/`
2. **GPU Threshold**: Stop transcoding when GPU usage >15%
3. **Time Windows**: Respect user-defined allowed transcoding hours
4. **Manual Override**: Direct start/stop commands bypass scheduler
## Common Workflows **Get Node Status**:
```bash
curl -s "http://10.10.0.226:8265/api/v2/get-nodes" | jq '.'
```
### Initial Setup ## GPU Resource Sharing
1. Start server with "Allow unmapped Nodes" enabled This server also runs Jellyfin with GPU transcoding. Coordinate usage:
2. Configure node as unmapped with local cache - Tdarr uses NVENC for encoding
3. Install gaming-aware scheduler via cron - Jellyfin uses NVDEC for decoding
4. Set up monitoring system for automated cleanup - Both can run simultaneously for different workloads
- Monitor GPU memory if running concurrent heavy transcodes
### Troubleshooting Patterns ## Legacy: Gaming-Aware Architecture
1. **forEach Errors**: Clean plugin installation, avoid custom mounts The previous deployment on the local desktop used an unmapped node architecture with gaming detection. This is preserved for reference but not currently in use:
2. **Staging Timeouts**: Monitor system handles automatic cleanup
3. **System Crashes**: Convert to unmapped node architecture
4. **Network Issues**: Implement CIFS resilience options
### Performance Tuning ### Unmapped Node Pattern (Historical)
1. **Cache Size**: 100-500GB NVMe for concurrent jobs For gaming desktops requiring GPU priority management:
2. **Bandwidth**: Unmapped nodes reduce streaming requirements - Node downloads files to local cache before processing
3. **Scaling**: Linear scaling with additional unmapped nodes - Gaming detection pauses transcoding automatically
4. **GPU Priority**: Gaming detection ensures responsive system - Scheduler script manages time windows
**When to Consider**:
- Transcoding on a gaming/desktop system
- Need GPU priority for interactive applications
- Multiple nodes across network
## Best Practices ## Best Practices
### Production Deployment ### For Current Deployment
- Use unmapped node architecture for stability 1. Monitor NFS stability - Tdarr depends on reliable media access
- Implement comprehensive monitoring 2. Check cache disk space periodically (`df -h /mnt/NV2`)
- Configure gaming-aware scheduling for desktop systems 3. Review queue for stale files after media library changes
- Set appropriate container resource limits 4. GPU memory: Leave headroom for Jellyfin concurrent usage
### Development Guidelines ### Error Prevention
- Test with internal Tdarr test files first 1. **Plugin Updates**: Automatic hourly plugin sync from server
- Implement null-safety checks in custom plugins 2. **Retry Logic**: 5 attempts with exponential backoff for file operations
- Use structured logging for troubleshooting 3. **Container Health**: `restart: unless-stopped` ensures recovery
- Separate concerns: scheduling, monitoring, processing
### Security Considerations ### Troubleshooting Patterns
- Container isolation prevents system-level failures 1. **File Not Found**: Source was deleted - clear from queue via UI
- Resource limits protect against memory exhaustion 2. **Slow Transcodes**: Check NFS latency, GPU utilization
- Network mount resilience prevents kernel crashes 3. **Node Disconnected**: Restart node container, check server connectivity
- Automated cleanup prevents disk space issues
## Migration Patterns ## Space Savings Estimate
With ~7,675 files in queue averaging 35% reduction:
- If average input is 5 GB → saves ~1.75 GB per file
- Potential savings: ~13 TB when queue completes
### From Mapped to Unmapped Nodes This technology context reflects the ubuntu-manticore deployment as of December 2025.
1. Enable "Allow unmapped Nodes" in server options
2. Update node configuration (add nodeType=unmapped)
3. Change cache volume to local storage
4. Remove media volume mapping
5. Test workflow and monitor performance
### Plugin System Cleanup
1. Remove all custom plugin mounts
2. Force server restart to regenerate plugin ZIP
3. Restart nodes to download fresh plugins
4. Verify forEach fixes in downloaded plugins
This technology context provides the foundation for implementing, troubleshooting, and optimizing Tdarr transcoding systems in home lab environments.