Complete restructure from patterns/examples/reference to technology-focused directories: • Created technology-specific directories with comprehensive documentation: - /tdarr/ - Transcoding automation with gaming-aware scheduling - /docker/ - Container management with GPU acceleration patterns - /vm-management/ - Virtual machine automation and cloud-init - /networking/ - SSH infrastructure, reverse proxy, and security - /monitoring/ - System health checks and Discord notifications - /databases/ - Database patterns and troubleshooting - /development/ - Programming language patterns (bash, nodejs, python, vuejs) • Enhanced CLAUDE.md with intelligent context loading: - Technology-first loading rules for automatic context provision - Troubleshooting keyword triggers for emergency scenarios - Documentation maintenance protocols with automated reminders - Context window management for optimal documentation updates • Preserved valuable content from .claude/tmp/: - SSH security improvements and server inventory - Tdarr CIFS troubleshooting and Docker iptables solutions - Operational scripts with proper technology classification • Benefits achieved: - Self-contained technology directories with complete context - Automatic loading of relevant documentation based on keywords - Emergency-ready troubleshooting with comprehensive guides - Scalable structure for future technology additions - Eliminated context bloat through targeted loading 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
132 lines
4.8 KiB
Markdown
132 lines
4.8 KiB
Markdown
# Tdarr Container Memory Corruption Fixes
|
|
|
|
**Date**: 2025-08-11
|
|
**Issue**: Kernel memory corruption in tdarr-ffmpeg process causing system crash
|
|
**Root Cause**: CIFS streaming of large video files during transcoding overwhelming kernel page cache
|
|
|
|
## Critical Issues Identified
|
|
|
|
1. **CIFS Network Mount Stress**: Container directly mounts CIFS shares experiencing network issues
|
|
2. **No Resource Limits**: Container lacks memory, CPU, and I/O constraints
|
|
3. **Mapped Node Architecture**: Forces streaming 10GB+ remux files over network during transcoding
|
|
4. **Missing Error Handling**: No timeout handling or graceful degradation for network storage issues
|
|
5. **Container Platform**: Using Podman without proper cgroup resource isolation
|
|
|
|
## Recommended Changes
|
|
|
|
### 1. Convert to Unmapped Node Architecture (CRITICAL)
|
|
|
|
**Current problematic configuration**:
|
|
```bash
|
|
# REMOVE these CIFS volume mounts:
|
|
-v "/mnt/media/TV:/media/TV" \
|
|
-v "/mnt/media/Movies:/media/Movies" \
|
|
```
|
|
|
|
**New unmapped configuration**:
|
|
```bash
|
|
# Update in scripts/tdarr/start-tdarr-gpu-podman-clean.sh
|
|
podman run -d --name "${CONTAINER_NAME}" \
|
|
--gpus all \
|
|
--restart unless-stopped \
|
|
-e nodeType=unmapped \ # KEY CHANGE: unmapped mode
|
|
-e unmappedNodeCache=/cache \
|
|
-v "/mnt/NV2/tdarr-cache:/cache" \ # NVMe local cache only
|
|
# CIFS mounts REMOVED entirely
|
|
```
|
|
|
|
**Benefits**:
|
|
- Eliminates CIFS streaming during transcoding
|
|
- Prevents kernel memory corruption
|
|
- 3-5x performance improvement with NVMe cache
|
|
|
|
### 2. Implement Container Resource Limits (CRITICAL)
|
|
|
|
Add to container configuration:
|
|
```bash
|
|
podman run -d --name "${CONTAINER_NAME}" \
|
|
--memory=32g \ # Limit to 32GB (50% of system RAM)
|
|
--memory-swap=40g \ # Allow 8GB additional swap
|
|
--cpus="14" \ # Reserve 2 cores for system
|
|
--pids-limit=1000 \ # Prevent fork bomb scenarios
|
|
--ulimit nofile=65536:65536 \ # File descriptor limits
|
|
--ulimit memlock=67108864:67108864 \ # Prevent excessive memory locking
|
|
```
|
|
|
|
### 3. Add I/O and Network Limits
|
|
|
|
```bash
|
|
# Add bandwidth controls
|
|
--device-read-bps /dev/nvme0n1:1g \ # Limit cache read to 1GB/s
|
|
--device-write-bps /dev/nvme0n1:1g \ # Limit cache write to 1GB/s
|
|
--network none \ # No direct network (use server API)
|
|
```
|
|
|
|
### 4. Enhanced Error Handling and Monitoring
|
|
|
|
**Server-side configuration**:
|
|
```yaml
|
|
# In docker-compose.yml for Tdarr server
|
|
environment:
|
|
- fileTimeout=1800 # 30 minutes for large file operations
|
|
- downloadTimeout=1800 # Extended timeout for large downloads
|
|
- uploadTimeout=1800 # Extended timeout for large uploads
|
|
```
|
|
|
|
**Monitoring setup**:
|
|
```bash
|
|
# Enable existing monitoring system
|
|
/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
|
|
|
|
# Add to cron for 20-minute checks:
|
|
*/20 * * * * /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
|
|
```
|
|
|
|
### 5. Gaming-Aware Scheduling Integration
|
|
|
|
```bash
|
|
# Install the gaming-aware scheduler
|
|
/mnt/NV2/Development/claude-home/scripts/tdarr/tdarr-schedule-manager.sh install
|
|
|
|
# Configure for night-only transcoding during troubleshooting
|
|
/mnt/NV2/Development/claude-home/scripts/tdarr/tdarr-schedule-manager.sh preset night-only
|
|
```
|
|
|
|
## Implementation Priority
|
|
|
|
### Phase 1: Immediate (Prevent Crashes)
|
|
1. Add resource limits to existing container
|
|
2. Install monitoring system for early warning
|
|
3. Configure CIFS resilience parameters
|
|
|
|
### Phase 2: Architecture Migration (Performance + Stability)
|
|
1. Convert to unmapped node architecture
|
|
2. Remove CIFS volume mounts from container
|
|
3. Test with single large file (10GB+ remux)
|
|
|
|
### Phase 3: Production Hardening
|
|
1. Gaming-aware scheduling integration
|
|
2. Comprehensive monitoring with Discord alerts
|
|
3. Automated recovery scripts
|
|
|
|
## Expected Results
|
|
|
|
After implementing these changes:
|
|
- **Memory corruption eliminated**: No direct CIFS I/O during transcoding
|
|
- **System stability**: Resource limits prevent kernel exhaustion
|
|
- **Performance improvement**: 3-5x faster transcoding with NVMe cache
|
|
- **Network resilience**: Unmapped nodes handle network issues gracefully
|
|
- **Automated recovery**: Monitoring system prevents cascade failures
|
|
|
|
## Files to Modify
|
|
|
|
1. `/mnt/NV2/Development/claude-home/scripts/tdarr/start-tdarr-gpu-podman-clean.sh` - Main container startup script
|
|
2. Tdarr server docker-compose configuration - Add timeout settings
|
|
3. Cron configuration - Add monitoring script
|
|
|
|
## Testing Plan
|
|
|
|
1. **Test with resource limits first** - Verify container restraints work
|
|
2. **Convert to unmapped architecture** - Test with small files initially
|
|
3. **Process large remux file** - Verify no memory corruption occurs
|
|
4. **Simulate network issues** - Confirm graceful handling |