CLAUDE: Update Tdarr context for ubuntu-manticore deployment

Rewrote documentation to reflect current deployment on ubuntu-manticore
(10.10.0.226) with actual performance metrics and queue status:
- Server specs: Ubuntu 24.04, GTX 1070, Docker Compose
- Storage: NFS media (48TB) + local NVMe cache (1.9TB)
- Performance: ~13 files/hour, 64% compression, HEVC output
- Queue: 7,675 pending, 37,406 total jobs processed
- Added operational commands, API access, GPU sharing notes
- Moved gaming-aware scheduler to legacy section (not needed on dedicated server)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Cal Corum 2025-12-07 01:17:27 -06:00
parent 117788f216
commit b8b4b13130

View File

@ -1,152 +1,204 @@
# Tdarr Transcoding System - Technology Context
## Overview
Tdarr is a distributed transcoding system that converts media files to optimized formats. This implementation uses an intelligent gaming-aware scheduler with unmapped node architecture for optimal performance and system stability.
Tdarr is a distributed transcoding system that converts media files to optimized formats. The current deployment runs on a dedicated Ubuntu server with GPU transcoding and NFS-based media storage.
## Architecture Patterns
## Current Deployment
### Distributed Unmapped Node Architecture (Recommended)
**Pattern**: Server-Node separation with local high-speed cache
- **Server**: Tdarr Server manages queue, web interface, and coordination
- **Node**: Unmapped nodes with local NVMe cache for processing
- **Benefits**: 3-5x performance improvement, network I/O reduction, linear scaling
### Server: ubuntu-manticore (10.10.0.226)
- **OS**: Ubuntu 24.04.3 LTS (Noble Numbat)
- **GPU**: NVIDIA GeForce GTX 1070 (8GB VRAM)
- **Driver**: 570.195.03
- **Container Runtime**: Docker with Compose
- **Web UI**: http://10.10.0.226:8265
**When to Use**:
- Multiple transcoding nodes across network
- High-performance requirements (10GB+ files)
- Network bandwidth limitations
- Gaming systems requiring GPU priority management
### Storage Architecture
| Mount | Source | Purpose |
|-------|--------|---------|
| `/mnt/truenas/media` | NFS from 10.10.0.35 | Media library (48TB total, ~29TB used) |
| `/mnt/NV2/tdarr-cache` | Local NVMe | Transcode work directory (1.9TB, ~40% used) |
### Configuration Principles
1. **Cache Optimization**: Use local NVMe storage for work directories
2. **Gaming Detection**: Automatic pause during GPU-intensive activities
3. **Resource Isolation**: Container limits prevent kernel-level crashes
4. **Monitoring Integration**: Automated cleanup and Discord notifications
### Container Configuration
**Location**: `/home/cal/docker/tdarr/docker-compose.yml`
## Core Components
### Gaming-Aware Scheduler
**Purpose**: Automatically manages Tdarr node to avoid conflicts with gaming
**Location**: `scripts/tdarr-schedule-manager.sh`
**Key Features**:
- Detects gaming processes (Steam, Lutris, Wine, etc.)
- GPU usage monitoring (>15% threshold)
- Configurable time windows
- Automated temporary directory cleanup
**Schedule Format**: `"HOUR_START-HOUR_END:DAYS"`
- `"22-07:daily"` - Overnight transcoding
- `"09-17:1-5"` - Business hours weekdays only
- `"14-16:6,7"` - Weekend afternoon window
### Monitoring System
**Purpose**: Prevents staging section timeouts and system instability
**Location**: `scripts/monitoring/tdarr-timeout-monitor.sh`
**Capabilities**:
- Staging timeout detection (300-second hardcoded limit)
- Automatic work directory cleanup
- Discord notifications with user pings
- Log rotation and retention management
### Container Architecture
**Server Configuration**:
```yaml
# Hybrid storage with resource limits
version: "3.8"
services:
tdarr:
image: ghcr.io/haveagitgat/tdarr:latest
ports: ["8265:8266"]
container_name: tdarr-server
restart: unless-stopped
ports:
- "8265:8265" # Web UI
- "8266:8266" # Server port (for nodes)
environment:
- PUID=1000
- PGID=1000
- TZ=America/Chicago
- serverIP=0.0.0.0
- serverPort=8266
- webUIPort=8265
volumes:
- "./tdarr-data:/app/configs"
- "/mnt/media:/media"
- ./server-data:/app/server
- ./configs:/app/configs
- ./logs:/app/logs
- /mnt/truenas/media:/media
tdarr-node:
image: ghcr.io/haveagitgat/tdarr_node:latest
container_name: tdarr-node
restart: unless-stopped
environment:
- PUID=1000
- PGID=1000
- TZ=America/Chicago
- serverIP=tdarr
- serverPort=8266
- nodeName=manticore-gpu
volumes:
- ./node-data:/app/configs
- /mnt/truenas/media:/media
- /mnt/NV2/tdarr-cache:/temp
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
depends_on:
- tdarr
```
**Node Configuration**:
### Node Configuration
- **Node Name**: manticore-gpu
- **Node Type**: Mapped (both server and node access same NFS mount)
- **Workers**: 1 GPU transcode worker, 4 GPU healthcheck workers
- **Schedule**: Disabled (runs 24/7)
### Current Queue Status (Dec 2025)
| Metric | Value |
|--------|-------|
| Transcode Queue | ~7,675 files |
| Success/Not Required | 8,378 files |
| Healthy Files | 16,628 files |
| Job History | 37,406 total jobs |
### Performance Metrics
- **Throughput**: ~13 files/hour (varies by file size)
- **Average Compression**: ~64% of original size (35% space savings)
- **Codec**: HEVC (h265) output at 1080p
- **Typical File Sizes**: 3-7 GB input → 2-4.5 GB output
## Architecture Patterns
### Mapped Node with Shared Storage
**Pattern**: Server and node share the same media mount via NFS
- **Advantage**: Simpler configuration, no file transfer overhead
- **Trade-off**: Depends on stable NFS connection during transcoding
**When to Use**:
- Dedicated transcoding server (not a gaming/desktop system)
- Reliable network storage infrastructure
- Single-node deployments
### Local NVMe Cache
Work directory on local NVMe (`/mnt/NV2/tdarr-cache:/temp`) provides:
- Fast read/write for transcode operations
- Isolation from network latency during processing
- Sufficient space for large remux files (1TB+ available)
## Operational Notes
### Recent Activity
System is actively processing with strong throughput. Recent successful transcodes include:
- Dead Like Me (2003) - multiple episodes
- Supernatural (2005) - S03 episodes
- I Dream of Jeannie (1965) - S01 episodes
- Da Vinci's Demons (2013) - S01 episodes
### Minor Issues
- **Occasional File Not Found (400)**: Files deleted/moved while queued fail after 5 retries
- Impact: Minimal - system continues processing remaining queue
- Resolution: Automatic - failed files are skipped
### Monitoring
- **Server Logs**: `/home/cal/docker/tdarr/logs/Tdarr_Server_Log.txt`
- **Docker Logs**: `docker logs tdarr-server` / `docker logs tdarr-node`
- **Library Scans**: Automatic hourly scans (2 libraries: ZWgKkmzJp, EjfWXCdU8)
### Common Operations
**Check Status**:
```bash
# Unmapped node with local cache
podman run -d \
--name tdarr-node-gpu \
-e nodeType=unmapped \
-v "/mnt/NV2/tdarr-cache:/cache" \
--device nvidia.com/gpu=all \
ghcr.io/haveagitgat/tdarr_node:latest
ssh 10.10.0.226 "docker ps --format 'table {{.Names}}\t{{.Status}}' | grep tdarr"
```
## Implementation Patterns
**View Recent Logs**:
```bash
ssh 10.10.0.226 "docker logs tdarr-node --since 1h 2>&1 | tail -50"
```
### Performance Optimization
1. **Local Cache Strategy**: Download → Process → Upload (vs. streaming)
2. **Resource Limits**: Prevent memory exhaustion and kernel crashes
3. **Network Resilience**: CIFS mount options for stability
4. **Automated Cleanup**: Prevent accumulation of stuck directories
**Restart Services**:
```bash
ssh 10.10.0.226 "cd /home/cal/docker/tdarr && docker compose restart"
```
### Error Prevention
1. **Plugin Safety**: Null-safe forEach operations `(streams || []).forEach()`
2. **Clean Installation**: Avoid custom plugin mounts causing version conflicts
3. **Container Isolation**: Resource limits prevent system-level crashes
4. **Network Stability**: Unmapped architecture reduces CIFS dependency
**Check GPU Usage**:
```bash
ssh 10.10.0.226 "nvidia-smi"
```
### Gaming Integration
1. **Process Detection**: Monitor for gaming applications and utilities
2. **GPU Threshold**: Stop transcoding when GPU usage >15%
3. **Time Windows**: Respect user-defined allowed transcoding hours
4. **Manual Override**: Direct start/stop commands bypass scheduler
### API Access
Base URL: `http://10.10.0.226:8265/api/v2/`
## Common Workflows
**Get Node Status**:
```bash
curl -s "http://10.10.0.226:8265/api/v2/get-nodes" | jq '.'
```
### Initial Setup
1. Start server with "Allow unmapped Nodes" enabled
2. Configure node as unmapped with local cache
3. Install gaming-aware scheduler via cron
4. Set up monitoring system for automated cleanup
## GPU Resource Sharing
This server also runs Jellyfin with GPU transcoding. Coordinate usage:
- Tdarr uses NVENC for encoding
- Jellyfin uses NVDEC for decoding
- Both can run simultaneously for different workloads
- Monitor GPU memory if running concurrent heavy transcodes
### Troubleshooting Patterns
1. **forEach Errors**: Clean plugin installation, avoid custom mounts
2. **Staging Timeouts**: Monitor system handles automatic cleanup
3. **System Crashes**: Convert to unmapped node architecture
4. **Network Issues**: Implement CIFS resilience options
## Legacy: Gaming-Aware Architecture
The previous deployment on the local desktop used an unmapped node architecture with gaming detection. This is preserved for reference but not currently in use:
### Performance Tuning
1. **Cache Size**: 100-500GB NVMe for concurrent jobs
2. **Bandwidth**: Unmapped nodes reduce streaming requirements
3. **Scaling**: Linear scaling with additional unmapped nodes
4. **GPU Priority**: Gaming detection ensures responsive system
### Unmapped Node Pattern (Historical)
For gaming desktops requiring GPU priority management:
- Node downloads files to local cache before processing
- Gaming detection pauses transcoding automatically
- Scheduler script manages time windows
**When to Consider**:
- Transcoding on a gaming/desktop system
- Need GPU priority for interactive applications
- Multiple nodes across network
## Best Practices
### Production Deployment
- Use unmapped node architecture for stability
- Implement comprehensive monitoring
- Configure gaming-aware scheduling for desktop systems
- Set appropriate container resource limits
### For Current Deployment
1. Monitor NFS stability - Tdarr depends on reliable media access
2. Check cache disk space periodically (`df -h /mnt/NV2`)
3. Review queue for stale files after media library changes
4. GPU memory: Leave headroom for Jellyfin concurrent usage
### Development Guidelines
- Test with internal Tdarr test files first
- Implement null-safety checks in custom plugins
- Use structured logging for troubleshooting
- Separate concerns: scheduling, monitoring, processing
### Error Prevention
1. **Plugin Updates**: Automatic hourly plugin sync from server
2. **Retry Logic**: 5 attempts with exponential backoff for file operations
3. **Container Health**: `restart: unless-stopped` ensures recovery
### Security Considerations
- Container isolation prevents system-level failures
- Resource limits protect against memory exhaustion
- Network mount resilience prevents kernel crashes
- Automated cleanup prevents disk space issues
### Troubleshooting Patterns
1. **File Not Found**: Source was deleted - clear from queue via UI
2. **Slow Transcodes**: Check NFS latency, GPU utilization
3. **Node Disconnected**: Restart node container, check server connectivity
## Migration Patterns
## Space Savings Estimate
With ~7,675 files in queue averaging 35% reduction:
- If average input is 5 GB → saves ~1.75 GB per file
- Potential savings: ~13 TB when queue completes
### From Mapped to Unmapped Nodes
1. Enable "Allow unmapped Nodes" in server options
2. Update node configuration (add nodeType=unmapped)
3. Change cache volume to local storage
4. Remove media volume mapping
5. Test workflow and monitor performance
### Plugin System Cleanup
1. Remove all custom plugin mounts
2. Force server restart to regenerate plugin ZIP
3. Restart nodes to download fresh plugins
4. Verify forEach fixes in downloaded plugins
This technology context provides the foundation for implementing, troubleshooting, and optimizing Tdarr transcoding systems in home lab environments.
This technology context reflects the ubuntu-manticore deployment as of December 2025.