- Add crash-analysis-summary.md: Complete incident timeline and root cause analysis - Add tdarr-container-fixes.md: Container resource limits and unmapped node conversion - Add cifs-mount-resilience-fixes.md: CIFS mount options for kernel stability - Update tdarr-troubleshooting.md: Link to new system crash prevention measures - Update nas-mount-configuration.md: Add stability considerations for production systems Root cause: CIFS streaming of large files during transcoding caused kernel memory corruption and system deadlock. Documents provide comprehensive prevention strategy. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
4.8 KiB
4.8 KiB
Tdarr Container Memory Corruption Fixes
Date: 2025-08-11
Issue: Kernel memory corruption in tdarr-ffmpeg process causing system crash
Root Cause: CIFS streaming of large video files during transcoding overwhelming kernel page cache
Critical Issues Identified
- CIFS Network Mount Stress: Container directly mounts CIFS shares experiencing network issues
- No Resource Limits: Container lacks memory, CPU, and I/O constraints
- Mapped Node Architecture: Forces streaming 10GB+ remux files over network during transcoding
- Missing Error Handling: No timeout handling or graceful degradation for network storage issues
- Container Platform: Using Podman without proper cgroup resource isolation
Recommended Changes
1. Convert to Unmapped Node Architecture (CRITICAL)
Current problematic configuration:
# REMOVE these CIFS volume mounts:
-v "/mnt/media/TV:/media/TV" \
-v "/mnt/media/Movies:/media/Movies" \
New unmapped configuration:
# Update in scripts/tdarr/start-tdarr-gpu-podman-clean.sh
podman run -d --name "${CONTAINER_NAME}" \
--gpus all \
--restart unless-stopped \
-e nodeType=unmapped \ # KEY CHANGE: unmapped mode
-e unmappedNodeCache=/cache \
-v "/mnt/NV2/tdarr-cache:/cache" \ # NVMe local cache only
# CIFS mounts REMOVED entirely
Benefits:
- Eliminates CIFS streaming during transcoding
- Prevents kernel memory corruption
- 3-5x performance improvement with NVMe cache
2. Implement Container Resource Limits (CRITICAL)
Add to container configuration:
podman run -d --name "${CONTAINER_NAME}" \
--memory=32g \ # Limit to 32GB (50% of system RAM)
--memory-swap=40g \ # Allow 8GB additional swap
--cpus="14" \ # Reserve 2 cores for system
--pids-limit=1000 \ # Prevent fork bomb scenarios
--ulimit nofile=65536:65536 \ # File descriptor limits
--ulimit memlock=67108864:67108864 \ # Prevent excessive memory locking
3. Add I/O and Network Limits
# Add bandwidth controls
--device-read-bps /dev/nvme0n1:1g \ # Limit cache read to 1GB/s
--device-write-bps /dev/nvme0n1:1g \ # Limit cache write to 1GB/s
--network none \ # No direct network (use server API)
4. Enhanced Error Handling and Monitoring
Server-side configuration:
# In docker-compose.yml for Tdarr server
environment:
- fileTimeout=1800 # 30 minutes for large file operations
- downloadTimeout=1800 # Extended timeout for large downloads
- uploadTimeout=1800 # Extended timeout for large uploads
Monitoring setup:
# Enable existing monitoring system
/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
# Add to cron for 20-minute checks:
*/20 * * * * /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
5. Gaming-Aware Scheduling Integration
# Install the gaming-aware scheduler
/mnt/NV2/Development/claude-home/scripts/tdarr/tdarr-schedule-manager.sh install
# Configure for night-only transcoding during troubleshooting
/mnt/NV2/Development/claude-home/scripts/tdarr/tdarr-schedule-manager.sh preset night-only
Implementation Priority
Phase 1: Immediate (Prevent Crashes)
- Add resource limits to existing container
- Install monitoring system for early warning
- Configure CIFS resilience parameters
Phase 2: Architecture Migration (Performance + Stability)
- Convert to unmapped node architecture
- Remove CIFS volume mounts from container
- Test with single large file (10GB+ remux)
Phase 3: Production Hardening
- Gaming-aware scheduling integration
- Comprehensive monitoring with Discord alerts
- Automated recovery scripts
Expected Results
After implementing these changes:
- Memory corruption eliminated: No direct CIFS I/O during transcoding
- System stability: Resource limits prevent kernel exhaustion
- Performance improvement: 3-5x faster transcoding with NVMe cache
- Network resilience: Unmapped nodes handle network issues gracefully
- Automated recovery: Monitoring system prevents cascade failures
Files to Modify
/mnt/NV2/Development/claude-home/scripts/tdarr/start-tdarr-gpu-podman-clean.sh- Main container startup script- Tdarr server docker-compose configuration - Add timeout settings
- Cron configuration - Add monitoring script
Testing Plan
- Test with resource limits first - Verify container restraints work
- Convert to unmapped architecture - Test with small files initially
- Process large remux file - Verify no memory corruption occurs
- Simulate network issues - Confirm graceful handling