diff --git a/tdarr/CONTEXT.md b/tdarr/CONTEXT.md index 0f6b399..3c83ed5 100644 --- a/tdarr/CONTEXT.md +++ b/tdarr/CONTEXT.md @@ -1,152 +1,204 @@ # Tdarr Transcoding System - Technology Context ## Overview -Tdarr is a distributed transcoding system that converts media files to optimized formats. This implementation uses an intelligent gaming-aware scheduler with unmapped node architecture for optimal performance and system stability. +Tdarr is a distributed transcoding system that converts media files to optimized formats. The current deployment runs on a dedicated Ubuntu server with GPU transcoding and NFS-based media storage. -## Architecture Patterns +## Current Deployment -### Distributed Unmapped Node Architecture (Recommended) -**Pattern**: Server-Node separation with local high-speed cache -- **Server**: Tdarr Server manages queue, web interface, and coordination -- **Node**: Unmapped nodes with local NVMe cache for processing -- **Benefits**: 3-5x performance improvement, network I/O reduction, linear scaling +### Server: ubuntu-manticore (10.10.0.226) +- **OS**: Ubuntu 24.04.3 LTS (Noble Numbat) +- **GPU**: NVIDIA GeForce GTX 1070 (8GB VRAM) +- **Driver**: 570.195.03 +- **Container Runtime**: Docker with Compose +- **Web UI**: http://10.10.0.226:8265 -**When to Use**: -- Multiple transcoding nodes across network -- High-performance requirements (10GB+ files) -- Network bandwidth limitations -- Gaming systems requiring GPU priority management +### Storage Architecture +| Mount | Source | Purpose | +|-------|--------|---------| +| `/mnt/truenas/media` | NFS from 10.10.0.35 | Media library (48TB total, ~29TB used) | +| `/mnt/NV2/tdarr-cache` | Local NVMe | Transcode work directory (1.9TB, ~40% used) | -### Configuration Principles -1. **Cache Optimization**: Use local NVMe storage for work directories -2. **Gaming Detection**: Automatic pause during GPU-intensive activities -3. **Resource Isolation**: Container limits prevent kernel-level crashes -4. **Monitoring Integration**: Automated cleanup and Discord notifications +### Container Configuration +**Location**: `/home/cal/docker/tdarr/docker-compose.yml` -## Core Components - -### Gaming-Aware Scheduler -**Purpose**: Automatically manages Tdarr node to avoid conflicts with gaming -**Location**: `scripts/tdarr-schedule-manager.sh` - -**Key Features**: -- Detects gaming processes (Steam, Lutris, Wine, etc.) -- GPU usage monitoring (>15% threshold) -- Configurable time windows -- Automated temporary directory cleanup - -**Schedule Format**: `"HOUR_START-HOUR_END:DAYS"` -- `"22-07:daily"` - Overnight transcoding -- `"09-17:1-5"` - Business hours weekdays only -- `"14-16:6,7"` - Weekend afternoon window - -### Monitoring System -**Purpose**: Prevents staging section timeouts and system instability -**Location**: `scripts/monitoring/tdarr-timeout-monitor.sh` - -**Capabilities**: -- Staging timeout detection (300-second hardcoded limit) -- Automatic work directory cleanup -- Discord notifications with user pings -- Log rotation and retention management - -### Container Architecture -**Server Configuration**: ```yaml -# Hybrid storage with resource limits +version: "3.8" services: tdarr: image: ghcr.io/haveagitgat/tdarr:latest - ports: ["8265:8266"] + container_name: tdarr-server + restart: unless-stopped + ports: + - "8265:8265" # Web UI + - "8266:8266" # Server port (for nodes) + environment: + - PUID=1000 + - PGID=1000 + - TZ=America/Chicago + - serverIP=0.0.0.0 + - serverPort=8266 + - webUIPort=8265 volumes: - - "./tdarr-data:/app/configs" - - "/mnt/media:/media" + - ./server-data:/app/server + - ./configs:/app/configs + - ./logs:/app/logs + - /mnt/truenas/media:/media + + tdarr-node: + image: ghcr.io/haveagitgat/tdarr_node:latest + container_name: tdarr-node + restart: unless-stopped + environment: + - PUID=1000 + - PGID=1000 + - TZ=America/Chicago + - serverIP=tdarr + - serverPort=8266 + - nodeName=manticore-gpu + volumes: + - ./node-data:/app/configs + - /mnt/truenas/media:/media + - /mnt/NV2/tdarr-cache:/temp + deploy: + resources: + reservations: + devices: + - driver: nvidia + count: all + capabilities: [gpu] + depends_on: + - tdarr ``` -**Node Configuration**: +### Node Configuration +- **Node Name**: manticore-gpu +- **Node Type**: Mapped (both server and node access same NFS mount) +- **Workers**: 1 GPU transcode worker, 4 GPU healthcheck workers +- **Schedule**: Disabled (runs 24/7) + +### Current Queue Status (Dec 2025) +| Metric | Value | +|--------|-------| +| Transcode Queue | ~7,675 files | +| Success/Not Required | 8,378 files | +| Healthy Files | 16,628 files | +| Job History | 37,406 total jobs | + +### Performance Metrics +- **Throughput**: ~13 files/hour (varies by file size) +- **Average Compression**: ~64% of original size (35% space savings) +- **Codec**: HEVC (h265) output at 1080p +- **Typical File Sizes**: 3-7 GB input → 2-4.5 GB output + +## Architecture Patterns + +### Mapped Node with Shared Storage +**Pattern**: Server and node share the same media mount via NFS +- **Advantage**: Simpler configuration, no file transfer overhead +- **Trade-off**: Depends on stable NFS connection during transcoding + +**When to Use**: +- Dedicated transcoding server (not a gaming/desktop system) +- Reliable network storage infrastructure +- Single-node deployments + +### Local NVMe Cache +Work directory on local NVMe (`/mnt/NV2/tdarr-cache:/temp`) provides: +- Fast read/write for transcode operations +- Isolation from network latency during processing +- Sufficient space for large remux files (1TB+ available) + +## Operational Notes + +### Recent Activity +System is actively processing with strong throughput. Recent successful transcodes include: +- Dead Like Me (2003) - multiple episodes +- Supernatural (2005) - S03 episodes +- I Dream of Jeannie (1965) - S01 episodes +- Da Vinci's Demons (2013) - S01 episodes + +### Minor Issues +- **Occasional File Not Found (400)**: Files deleted/moved while queued fail after 5 retries + - Impact: Minimal - system continues processing remaining queue + - Resolution: Automatic - failed files are skipped + +### Monitoring +- **Server Logs**: `/home/cal/docker/tdarr/logs/Tdarr_Server_Log.txt` +- **Docker Logs**: `docker logs tdarr-server` / `docker logs tdarr-node` +- **Library Scans**: Automatic hourly scans (2 libraries: ZWgKkmzJp, EjfWXCdU8) + +### Common Operations + +**Check Status**: ```bash -# Unmapped node with local cache -podman run -d \ - --name tdarr-node-gpu \ - -e nodeType=unmapped \ - -v "/mnt/NV2/tdarr-cache:/cache" \ - --device nvidia.com/gpu=all \ - ghcr.io/haveagitgat/tdarr_node:latest +ssh 10.10.0.226 "docker ps --format 'table {{.Names}}\t{{.Status}}' | grep tdarr" ``` -## Implementation Patterns +**View Recent Logs**: +```bash +ssh 10.10.0.226 "docker logs tdarr-node --since 1h 2>&1 | tail -50" +``` -### Performance Optimization -1. **Local Cache Strategy**: Download → Process → Upload (vs. streaming) -2. **Resource Limits**: Prevent memory exhaustion and kernel crashes -3. **Network Resilience**: CIFS mount options for stability -4. **Automated Cleanup**: Prevent accumulation of stuck directories +**Restart Services**: +```bash +ssh 10.10.0.226 "cd /home/cal/docker/tdarr && docker compose restart" +``` -### Error Prevention -1. **Plugin Safety**: Null-safe forEach operations `(streams || []).forEach()` -2. **Clean Installation**: Avoid custom plugin mounts causing version conflicts -3. **Container Isolation**: Resource limits prevent system-level crashes -4. **Network Stability**: Unmapped architecture reduces CIFS dependency +**Check GPU Usage**: +```bash +ssh 10.10.0.226 "nvidia-smi" +``` -### Gaming Integration -1. **Process Detection**: Monitor for gaming applications and utilities -2. **GPU Threshold**: Stop transcoding when GPU usage >15% -3. **Time Windows**: Respect user-defined allowed transcoding hours -4. **Manual Override**: Direct start/stop commands bypass scheduler +### API Access +Base URL: `http://10.10.0.226:8265/api/v2/` -## Common Workflows +**Get Node Status**: +```bash +curl -s "http://10.10.0.226:8265/api/v2/get-nodes" | jq '.' +``` -### Initial Setup -1. Start server with "Allow unmapped Nodes" enabled -2. Configure node as unmapped with local cache -3. Install gaming-aware scheduler via cron -4. Set up monitoring system for automated cleanup +## GPU Resource Sharing +This server also runs Jellyfin with GPU transcoding. Coordinate usage: +- Tdarr uses NVENC for encoding +- Jellyfin uses NVDEC for decoding +- Both can run simultaneously for different workloads +- Monitor GPU memory if running concurrent heavy transcodes -### Troubleshooting Patterns -1. **forEach Errors**: Clean plugin installation, avoid custom mounts -2. **Staging Timeouts**: Monitor system handles automatic cleanup -3. **System Crashes**: Convert to unmapped node architecture -4. **Network Issues**: Implement CIFS resilience options +## Legacy: Gaming-Aware Architecture +The previous deployment on the local desktop used an unmapped node architecture with gaming detection. This is preserved for reference but not currently in use: -### Performance Tuning -1. **Cache Size**: 100-500GB NVMe for concurrent jobs -2. **Bandwidth**: Unmapped nodes reduce streaming requirements -3. **Scaling**: Linear scaling with additional unmapped nodes -4. **GPU Priority**: Gaming detection ensures responsive system +### Unmapped Node Pattern (Historical) +For gaming desktops requiring GPU priority management: +- Node downloads files to local cache before processing +- Gaming detection pauses transcoding automatically +- Scheduler script manages time windows + +**When to Consider**: +- Transcoding on a gaming/desktop system +- Need GPU priority for interactive applications +- Multiple nodes across network ## Best Practices -### Production Deployment -- Use unmapped node architecture for stability -- Implement comprehensive monitoring -- Configure gaming-aware scheduling for desktop systems -- Set appropriate container resource limits +### For Current Deployment +1. Monitor NFS stability - Tdarr depends on reliable media access +2. Check cache disk space periodically (`df -h /mnt/NV2`) +3. Review queue for stale files after media library changes +4. GPU memory: Leave headroom for Jellyfin concurrent usage -### Development Guidelines -- Test with internal Tdarr test files first -- Implement null-safety checks in custom plugins -- Use structured logging for troubleshooting -- Separate concerns: scheduling, monitoring, processing +### Error Prevention +1. **Plugin Updates**: Automatic hourly plugin sync from server +2. **Retry Logic**: 5 attempts with exponential backoff for file operations +3. **Container Health**: `restart: unless-stopped` ensures recovery -### Security Considerations -- Container isolation prevents system-level failures -- Resource limits protect against memory exhaustion -- Network mount resilience prevents kernel crashes -- Automated cleanup prevents disk space issues +### Troubleshooting Patterns +1. **File Not Found**: Source was deleted - clear from queue via UI +2. **Slow Transcodes**: Check NFS latency, GPU utilization +3. **Node Disconnected**: Restart node container, check server connectivity -## Migration Patterns +## Space Savings Estimate +With ~7,675 files in queue averaging 35% reduction: +- If average input is 5 GB → saves ~1.75 GB per file +- Potential savings: ~13 TB when queue completes -### From Mapped to Unmapped Nodes -1. Enable "Allow unmapped Nodes" in server options -2. Update node configuration (add nodeType=unmapped) -3. Change cache volume to local storage -4. Remove media volume mapping -5. Test workflow and monitor performance - -### Plugin System Cleanup -1. Remove all custom plugin mounts -2. Force server restart to regenerate plugin ZIP -3. Restart nodes to download fresh plugins -4. Verify forEach fixes in downloaded plugins - -This technology context provides the foundation for implementing, troubleshooting, and optimizing Tdarr transcoding systems in home lab environments. \ No newline at end of file +This technology context reflects the ubuntu-manticore deployment as of December 2025.