claude-home/docker/examples/tdarr-container-fixes.md

# Tdarr Container Memory Corruption Fixes

**Date**: 2025-08-11
**Issue**: Kernel memory corruption in tdarr-ffmpeg process causing system crash
**Root Cause**: CIFS streaming of large video files during transcoding overwhelming kernel page cache

## Critical Issues Identified

1. **CIFS Network Mount Stress**: Container directly mounts CIFS shares experiencing network issues
2. **No Resource Limits**: Container lacks memory, CPU, and I/O constraints
3. **Mapped Node Architecture**: Forces streaming 10GB+ remux files over network during transcoding
4. **Missing Error Handling**: No timeout handling or graceful degradation for network storage issues
5. **Container Platform**: Using Podman without proper cgroup resource isolation

## Recommended Changes

### 1. Convert to Unmapped Node Architecture (CRITICAL)

**Current problematic configuration**:
```bash
# REMOVE these CIFS volume mounts:
-v "/mnt/media/TV:/media/TV" \
-v "/mnt/media/Movies:/media/Movies" \
```

**New unmapped configuration**:
```bash
# Update in scripts/tdarr/start-tdarr-gpu-podman-clean.sh
podman run -d --name "${CONTAINER_NAME}" \
    --gpus all \
    --restart unless-stopped \
    -e nodeType=unmapped \                    # KEY CHANGE: unmapped mode
    -e unmappedNodeCache=/cache \
    -v "/mnt/NV2/tdarr-cache:/cache" \       # NVMe local cache only
    # CIFS mounts REMOVED entirely
```

**Benefits**:
- Eliminates CIFS streaming during transcoding
- Prevents kernel memory corruption
- 3-5x performance improvement with NVMe cache

### 2. Implement Container Resource Limits (CRITICAL)

Add to container configuration:
```bash
podman run -d --name "${CONTAINER_NAME}" \
    --memory=32g \                          # Limit to 32GB (50% of system RAM)
    --memory-swap=40g \                     # Allow 8GB additional swap
    --cpus="14" \                          # Reserve 2 cores for system
    --pids-limit=1000 \                    # Prevent fork bomb scenarios
    --ulimit nofile=65536:65536 \          # File descriptor limits
    --ulimit memlock=67108864:67108864 \   # Prevent excessive memory locking
```

### 3. Add I/O and Network Limits

```bash
# Add bandwidth controls
--device-read-bps /dev/nvme0n1:1g \       # Limit cache read to 1GB/s
--device-write-bps /dev/nvme0n1:1g \      # Limit cache write to 1GB/s
--network none \                           # No direct network (use server API)
```

### 4. Enhanced Error Handling and Monitoring

**Server-side configuration**:
```yaml
# In docker-compose.yml for Tdarr server
environment:
  - fileTimeout=1800              # 30 minutes for large file operations
  - downloadTimeout=1800          # Extended timeout for large downloads
  - uploadTimeout=1800            # Extended timeout for large uploads
```

**Monitoring setup**:
```bash
# Enable existing monitoring system
/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh

# Add to cron for 20-minute checks:
*/20 * * * * /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
```

### 5. Gaming-Aware Scheduling Integration

```bash
# Install the gaming-aware scheduler
/mnt/NV2/Development/claude-home/scripts/tdarr/tdarr-schedule-manager.sh install

# Configure for night-only transcoding during troubleshooting
/mnt/NV2/Development/claude-home/scripts/tdarr/tdarr-schedule-manager.sh preset night-only
```

## Implementation Priority

### Phase 1: Immediate (Prevent Crashes)
1. Add resource limits to existing container
2. Install monitoring system for early warning
3. Configure CIFS resilience parameters

### Phase 2: Architecture Migration (Performance + Stability)
1. Convert to unmapped node architecture
2. Remove CIFS volume mounts from container
3. Test with single large file (10GB+ remux)

### Phase 3: Production Hardening
1. Gaming-aware scheduling integration
2. Comprehensive monitoring with Discord alerts
3. Automated recovery scripts

## Expected Results

After implementing these changes:
- **Memory corruption eliminated**: No direct CIFS I/O during transcoding
- **System stability**: Resource limits prevent kernel exhaustion
- **Performance improvement**: 3-5x faster transcoding with NVMe cache
- **Network resilience**: Unmapped nodes handle network issues gracefully
- **Automated recovery**: Monitoring system prevents cascade failures

## Files to Modify

1. `/mnt/NV2/Development/claude-home/scripts/tdarr/start-tdarr-gpu-podman-clean.sh` - Main container startup script
2. Tdarr server docker-compose configuration - Add timeout settings
3. Cron configuration - Add monitoring script

## Testing Plan

1. **Test with resource limits first** - Verify container restraints work
2. **Convert to unmapped architecture** - Test with small files initially
3. **Process large remux file** - Verify no memory corruption occurs
4. **Simulate network issues** - Confirm graceful handling