All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
Adds title, description, type, domain, and tags frontmatter to every doc for improved KB semantic search. The description field is prepended to every search chunk, and domain/type/tags enable filtered queries. Type values: context, guide, runbook, reference, troubleshooting Domain values match directory structure (networking, docker, etc.) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5.2 KiB
5.2 KiB
| title | description | type | domain | tags | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Tdarr Container Memory Fixes | Container configuration changes to prevent kernel memory corruption caused by CIFS streaming during Tdarr transcoding, including unmapped node conversion, resource limits, and I/O constraints. | runbook | docker |
|
Tdarr Container Memory Corruption Fixes
Date: 2025-08-11
Issue: Kernel memory corruption in tdarr-ffmpeg process causing system crash
Root Cause: CIFS streaming of large video files during transcoding overwhelming kernel page cache
Critical Issues Identified
- CIFS Network Mount Stress: Container directly mounts CIFS shares experiencing network issues
- No Resource Limits: Container lacks memory, CPU, and I/O constraints
- Mapped Node Architecture: Forces streaming 10GB+ remux files over network during transcoding
- Missing Error Handling: No timeout handling or graceful degradation for network storage issues
- Container Platform: Using Podman without proper cgroup resource isolation
Recommended Changes
1. Convert to Unmapped Node Architecture (CRITICAL)
Current problematic configuration:
# REMOVE these CIFS volume mounts:
-v "/mnt/media/TV:/media/TV" \
-v "/mnt/media/Movies:/media/Movies" \
New unmapped configuration:
# Update in scripts/tdarr/start-tdarr-gpu-podman-clean.sh
podman run -d --name "${CONTAINER_NAME}" \
--gpus all \
--restart unless-stopped \
-e nodeType=unmapped \ # KEY CHANGE: unmapped mode
-e unmappedNodeCache=/cache \
-v "/mnt/NV2/tdarr-cache:/cache" \ # NVMe local cache only
# CIFS mounts REMOVED entirely
Benefits:
- Eliminates CIFS streaming during transcoding
- Prevents kernel memory corruption
- 3-5x performance improvement with NVMe cache
2. Implement Container Resource Limits (CRITICAL)
Add to container configuration:
podman run -d --name "${CONTAINER_NAME}" \
--memory=32g \ # Limit to 32GB (50% of system RAM)
--memory-swap=40g \ # Allow 8GB additional swap
--cpus="14" \ # Reserve 2 cores for system
--pids-limit=1000 \ # Prevent fork bomb scenarios
--ulimit nofile=65536:65536 \ # File descriptor limits
--ulimit memlock=67108864:67108864 \ # Prevent excessive memory locking
3. Add I/O and Network Limits
# Add bandwidth controls
--device-read-bps /dev/nvme0n1:1g \ # Limit cache read to 1GB/s
--device-write-bps /dev/nvme0n1:1g \ # Limit cache write to 1GB/s
--network none \ # No direct network (use server API)
4. Enhanced Error Handling and Monitoring
Server-side configuration:
# In docker-compose.yml for Tdarr server
environment:
- fileTimeout=1800 # 30 minutes for large file operations
- downloadTimeout=1800 # Extended timeout for large downloads
- uploadTimeout=1800 # Extended timeout for large uploads
Monitoring setup:
# Enable existing monitoring system
/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
# Add to cron for 20-minute checks:
*/20 * * * * /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
5. Gaming-Aware Scheduling Integration
# Install the gaming-aware scheduler
/mnt/NV2/Development/claude-home/scripts/tdarr/tdarr-schedule-manager.sh install
# Configure for night-only transcoding during troubleshooting
/mnt/NV2/Development/claude-home/scripts/tdarr/tdarr-schedule-manager.sh preset night-only
Implementation Priority
Phase 1: Immediate (Prevent Crashes)
- Add resource limits to existing container
- Install monitoring system for early warning
- Configure CIFS resilience parameters
Phase 2: Architecture Migration (Performance + Stability)
- Convert to unmapped node architecture
- Remove CIFS volume mounts from container
- Test with single large file (10GB+ remux)
Phase 3: Production Hardening
- Gaming-aware scheduling integration
- Comprehensive monitoring with Discord alerts
- Automated recovery scripts
Expected Results
After implementing these changes:
- Memory corruption eliminated: No direct CIFS I/O during transcoding
- System stability: Resource limits prevent kernel exhaustion
- Performance improvement: 3-5x faster transcoding with NVMe cache
- Network resilience: Unmapped nodes handle network issues gracefully
- Automated recovery: Monitoring system prevents cascade failures
Files to Modify
/mnt/NV2/Development/claude-home/scripts/tdarr/start-tdarr-gpu-podman-clean.sh- Main container startup script- Tdarr server docker-compose configuration - Add timeout settings
- Cron configuration - Add monitoring script
Testing Plan
- Test with resource limits first - Verify container restraints work
- Convert to unmapped architecture - Test with small files initially
- Process large remux file - Verify no memory corruption occurs
- Simulate network issues - Confirm graceful handling