claude-home/patterns/docker/distributed-transcoding.md
Cal Corum df81d475ef CLAUDE: Update Tdarr documentation with file transfer optimizations
- Document hybrid storage strategy for server (local DB/configs, network backups)
- Add production unmapped node configuration with NVMe cache optimization
- Document Docker→Podman migration benefits and GPU improvements
- Update cache paths to reflect actual NVMe location (/mnt/NV2/tdarr-cache)
- Add gaming-aware scheduler and enhanced monitoring system documentation
- Update configuration file paths to current production locations
- Document 100x database performance improvement with local storage

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-10 16:20:27 -05:00

9.8 KiB

Tdarr Distributed Transcoding Pattern

Overview

Tdarr distributed transcoding with unmapped nodes provides optimal performance for enterprise-scale video processing across multiple machines.

Architecture Pattern

┌─────────────────┐    ┌──────────────────────────────────┐
│   Tdarr Server  │    │        Unmapped Nodes            │
│                 │    │  ┌─────────┐ ┌─────────┐         │
│  - Web Interface│◄──►│  │ Node 1  │ │ Node 2  │ ...     │
│  - Job Queue    │    │  │ GPU+CPU │ │ GPU+CPU │         │
│  - File Mgmt    │    │  │NVMe Cache│ │NVMe Cache│         │
│                 │    │  └─────────┘ └─────────┘         │
└─────────────────┘    └──────────────────────────────────┘
         │                              │
         └──────── Shared Storage ──────┘
              (NAS/SAN for media files)

Key Components

  • Server: Centralizes job management and web interface
  • Unmapped Nodes: Independent transcoding with local cache
  • Shared Storage: Source and final file repository

Configuration Templates

Server Configuration (Optimized)

# docker-compose.yml - Hybrid Storage Strategy
version: "3.4"
services:
  tdarr:
    container_name: tdarr
    image: ghcr.io/haveagitgat/tdarr:latest
    restart: unless-stopped
    network_mode: bridge
    ports:
      - 8265:8265 # webUI port
      - 8266:8266 # server port
    environment:
      - TZ=America/Chicago
      - PUID=0
      - PGID=0
      - UMASK_SET=002
      - serverIP=0.0.0.0
      - serverPort=8266
      - webUIPort=8265
      - internalNode=false  # Disable for distributed setup
      - inContainer=true
      - ffmpegVersion=6
      - nodeName=docker-server
    volumes:
      # Hybrid storage strategy - Local for performance, Network for persistence
      - ./tdarr/server:/app/server           # Local: Database, configs, logs
      - ./tdarr/configs:/app/configs         # Local: Fast config access
      - ./tdarr/logs:/app/logs               # Local: Logging performance
      - /mnt/truenas-share/tdarr/tdarr-server/Backups:/app/server/Tdarr/Backups  # Network: Backups only
      
      # Media and cache (when using mapped nodes)
      - /mnt/truenas-share:/media            # Network: Source media
      - /mnt/truenas-share/tdarr/tdarr-cache:/temp  # Network: Shared cache (mapped nodes only)

Unmapped Node Configuration (Production)

#!/bin/bash
# Tdarr Unmapped Node with GPU Support - NVMe Cache Optimization  
# Production script: scripts/tdarr/start-tdarr-gpu-podman-clean.sh

CONTAINER_NAME="tdarr-node-gpu-unmapped"
SERVER_IP="10.10.0.43"
SERVER_PORT="8266"
NODE_NAME="nobara-pc-gpu-unmapped"

# Clean container management
if podman ps -a --format "{{.Names}}" | grep -q "^${CONTAINER_NAME}$"; then
    podman stop "${CONTAINER_NAME}" 2>/dev/null || true
    podman rm "${CONTAINER_NAME}" 2>/dev/null || true
fi

# Production unmapped node with optimized cache
podman run -d --name "${CONTAINER_NAME}" \
    --gpus all \
    --restart unless-stopped \
    -e TZ=America/Chicago \
    -e UMASK_SET=002 \
    -e nodeName="${NODE_NAME}" \
    -e serverIP="${SERVER_IP}" \
    -e serverPort="${SERVER_PORT}" \
    -e nodeType=unmapped \
    -e inContainer=true \
    -e ffmpegVersion=6 \
    -e logLevel=DEBUG \
    -e NVIDIA_DRIVER_CAPABILITIES=all \
    -e NVIDIA_VISIBLE_DEVICES=all \
    -v "/mnt/NV2/tdarr-cache:/cache" \                    # NVMe cache (3-7GB/s)
    -v "/mnt/media:/app/unmappedNodeCache/nobara-pc-gpu-unmapped/media" \
    ghcr.io/haveagitgat/tdarr_node:latest

File Transfer Optimizations

Hybrid Storage Strategy (Server)

The server uses a hybrid approach balancing performance and reliability:

# Local storage (SSD/NVMe) - High Performance Operations
./tdarr/server:/app/server           # Database - frequent read/write
./tdarr/configs:/app/configs         # Config files - startup performance  
./tdarr/logs:/app/logs               # Log files - continuous writing

# Network storage (NAS) - Persistence & Backup
/mnt/truenas-share/tdarr/tdarr-server/Backups:/app/server/Tdarr/Backups  # Infrequent access

Benefits:

  • Database performance: Local SQLite operations (100x faster than network)
  • Log performance: Eliminates network I/O bottleneck for continuous logging
  • Reliability: Critical backups stored on redundant NAS storage
  • Config speed: Fast server startup with local configuration files

Container Platform Migration: Docker → Podman

Advantages of Podman for Tdarr:

# Enhanced GPU support
--gpus all                           # Improved NVIDIA integration
-e NVIDIA_DRIVER_CAPABILITIES=all    # Full GPU access
-e NVIDIA_VISIBLE_DEVICES=all        # All GPU visibility

# Better resource management  
--restart unless-stopped             # Smarter restart policies
# Rootless containers (when needed)  # Enhanced security

Migration Benefits:

  • GPU reliability: Better NVIDIA container integration
  • Resource isolation: Improved container resource management
  • System integration: Better integration with systemd and cgroups
  • Performance: Reduced overhead compared to Docker daemon

Performance Optimization

Cache Storage Strategy (Updated)

# Production cache storage hierarchy (NVMe optimized)
/mnt/NV2/tdarr-cache/               # NVMe SSD (3-7GB/s) - PRODUCTION
├── tdarr-workDir-{jobId}/          # Active transcoding  
├── download/                       # Source file staging (API downloads)
└── upload/                         # Result file staging (API uploads)

# Alternative configurations:
/dev/shm/tdarr-cache/               # RAM disk (fastest, volatile, limited size)
/mnt/truenas-share/tdarr/tdarr-cache/  # Network cache (mapped nodes only)

# Performance comparison:
# NVMe cache:     3-7GB/s   (unmapped nodes - RECOMMENDED)
# Network cache:  100MB/s   (mapped nodes - legacy)
# RAM cache:      10GB/s+   (limited by available RAM)

Network I/O Pattern

Optimized Workflow:
1. 📥 Download source (once): NAS → Local NVMe
2. ⚡ Transcode: Local NVMe → Local NVMe  
3. 📤 Upload result (once): Local NVMe → NAS

vs Legacy Mapped Workflow:
1. 🐌 Read source: NAS → Node (streaming)
2. 🐌 Write temp: Node → NAS (streaming) 
3. 🐌 Read temp: NAS → Node (streaming)
4. 🐌 Write final: Node → NAS (streaming)

Scaling Patterns

Horizontal Scaling

# Multiple nodes with load balancing
nodes:
  - name: "gpu-node-1"      # RTX 4090 + NVMe
    role: "heavy-transcode"
  - name: "gpu-node-2"      # RTX 3080 + NVMe  
    role: "standard-transcode"
  - name: "cpu-node-1"      # Multi-core + SSD
    role: "audio-processing"

Resource Specialization

# GPU-optimized node
-e hardwareEncoding=true
-e nvencTemporalAQ=1
-e processes_GPU=2

# CPU-optimized node  
-e hardwareEncoding=false
-e processes_CPU=8
-e ffmpegThreads=16

Monitoring and Operations

Health Checks

# Node connectivity
curl -f http://server:8266/api/v2/status || exit 1

# Cache usage monitoring  
df -h /mnt/nvme/tdarr-cache
du -sh /mnt/nvme/tdarr-cache/*

# Performance metrics
podman stats tdarr-node-1

Log Analysis

# Node registration
podman logs tdarr-node-1 | grep "Node connected"

# Transfer speeds
podman logs tdarr-node-1 | grep -E "(Download|Upload).*MB/s"

# Transcode performance
podman logs tdarr-node-1 | grep -E "fps=.*"

Security Considerations

Network Access

  • Server requires incoming connections on ports 8265/8266
  • Nodes require outbound access to server
  • Consider VPN for cross-site deployments

File Permissions

# Ensure consistent UID/GID across nodes
-e PUID=1000
-e PGID=1000

# Cache directory permissions
chown -R 1000:1000 /mnt/nvme/tdarr-cache
chmod 755 /mnt/nvme/tdarr-cache

Production Enhancements

Gaming-Aware Scheduler

For GPU nodes that serve dual purposes (gaming + transcoding):

# Automated scheduler with gaming detection
scripts/tdarr/tdarr-schedule-manager.sh install

# Configure time windows (example: night-only transcoding)
scripts/tdarr/tdarr-schedule-manager.sh preset night-only  # 10PM-7AM only

Features:

  • Automatic GPU conflict prevention: Detects Steam, gaming processes, GPU >15% usage
  • Configurable time windows: "22-07:daily" (10PM-7AM), "09-17:1-5" (work hours)
  • Real-time monitoring: 1-minute cron checks with instant gaming response
  • Automated cleanup: Removes abandoned temp directories every 6 hours
  • Zero-intervention operation: Stops/starts Tdarr automatically based on rules

Benefits:

  • Gaming priority: Never interferes with gaming sessions
  • Resource optimization: Maximizes transcoding during off-hours
  • System stability: Prevents GPU contention and system slowdowns
  • Maintenance-free: Handles cleanup and scheduling without user intervention

Enhanced Monitoring System

Script: scripts/monitoring/tdarr-timeout-monitor.sh

  • Staging timeout detection: Monitors for download failures and cleanup issues
  • Discord notifications: Professional alerts with user pings for critical issues
  • Automatic recovery: Cleans up stuck work directories and partial downloads
  • Log management: Timestamped logs with automatic rotation
  • Troubleshooting: reference/docker/tdarr-troubleshooting.md
  • Gaming Scheduler: scripts/tdarr/README.md
  • Automation Scripts: scripts/tdarr/ (production-ready node management)
  • Performance: reference/docker/nvidia-troubleshooting.md