# Tdarr Container Memory Corruption Fixes **Date**: 2025-08-11 **Issue**: Kernel memory corruption in tdarr-ffmpeg process causing system crash **Root Cause**: CIFS streaming of large video files during transcoding overwhelming kernel page cache ## Critical Issues Identified 1. **CIFS Network Mount Stress**: Container directly mounts CIFS shares experiencing network issues 2. **No Resource Limits**: Container lacks memory, CPU, and I/O constraints 3. **Mapped Node Architecture**: Forces streaming 10GB+ remux files over network during transcoding 4. **Missing Error Handling**: No timeout handling or graceful degradation for network storage issues 5. **Container Platform**: Using Podman without proper cgroup resource isolation ## Recommended Changes ### 1. Convert to Unmapped Node Architecture (CRITICAL) **Current problematic configuration**: ```bash # REMOVE these CIFS volume mounts: -v "/mnt/media/TV:/media/TV" \ -v "/mnt/media/Movies:/media/Movies" \ ``` **New unmapped configuration**: ```bash # Update in scripts/tdarr/start-tdarr-gpu-podman-clean.sh podman run -d --name "${CONTAINER_NAME}" \ --gpus all \ --restart unless-stopped \ -e nodeType=unmapped \ # KEY CHANGE: unmapped mode -e unmappedNodeCache=/cache \ -v "/mnt/NV2/tdarr-cache:/cache" \ # NVMe local cache only # CIFS mounts REMOVED entirely ``` **Benefits**: - Eliminates CIFS streaming during transcoding - Prevents kernel memory corruption - 3-5x performance improvement with NVMe cache ### 2. Implement Container Resource Limits (CRITICAL) Add to container configuration: ```bash podman run -d --name "${CONTAINER_NAME}" \ --memory=32g \ # Limit to 32GB (50% of system RAM) --memory-swap=40g \ # Allow 8GB additional swap --cpus="14" \ # Reserve 2 cores for system --pids-limit=1000 \ # Prevent fork bomb scenarios --ulimit nofile=65536:65536 \ # File descriptor limits --ulimit memlock=67108864:67108864 \ # Prevent excessive memory locking ``` ### 3. Add I/O and Network Limits ```bash # Add bandwidth controls --device-read-bps /dev/nvme0n1:1g \ # Limit cache read to 1GB/s --device-write-bps /dev/nvme0n1:1g \ # Limit cache write to 1GB/s --network none \ # No direct network (use server API) ``` ### 4. Enhanced Error Handling and Monitoring **Server-side configuration**: ```yaml # In docker-compose.yml for Tdarr server environment: - fileTimeout=1800 # 30 minutes for large file operations - downloadTimeout=1800 # Extended timeout for large downloads - uploadTimeout=1800 # Extended timeout for large uploads ``` **Monitoring setup**: ```bash # Enable existing monitoring system /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh # Add to cron for 20-minute checks: */20 * * * * /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh ``` ### 5. Gaming-Aware Scheduling Integration ```bash # Install the gaming-aware scheduler /mnt/NV2/Development/claude-home/scripts/tdarr/tdarr-schedule-manager.sh install # Configure for night-only transcoding during troubleshooting /mnt/NV2/Development/claude-home/scripts/tdarr/tdarr-schedule-manager.sh preset night-only ``` ## Implementation Priority ### Phase 1: Immediate (Prevent Crashes) 1. Add resource limits to existing container 2. Install monitoring system for early warning 3. Configure CIFS resilience parameters ### Phase 2: Architecture Migration (Performance + Stability) 1. Convert to unmapped node architecture 2. Remove CIFS volume mounts from container 3. Test with single large file (10GB+ remux) ### Phase 3: Production Hardening 1. Gaming-aware scheduling integration 2. Comprehensive monitoring with Discord alerts 3. Automated recovery scripts ## Expected Results After implementing these changes: - **Memory corruption eliminated**: No direct CIFS I/O during transcoding - **System stability**: Resource limits prevent kernel exhaustion - **Performance improvement**: 3-5x faster transcoding with NVMe cache - **Network resilience**: Unmapped nodes handle network issues gracefully - **Automated recovery**: Monitoring system prevents cascade failures ## Files to Modify 1. `/mnt/NV2/Development/claude-home/scripts/tdarr/start-tdarr-gpu-podman-clean.sh` - Main container startup script 2. Tdarr server docker-compose configuration - Add timeout settings 3. Cron configuration - Add monitoring script ## Testing Plan 1. **Test with resource limits first** - Verify container restraints work 2. **Convert to unmapped architecture** - Test with small files initially 3. **Process large remux file** - Verify no memory corruption occurs 4. **Simulate network issues** - Confirm graceful handling