Reindex Knowledge Base / reindex (push) Successful in 3s

Details

docs: add YAML frontmatter to all 151 markdown files

Adds title, description, type, domain, and tags frontmatter to every
doc for improved KB semantic search. The description field is prepended
to every search chunk, and domain/type/tags enable filtered queries.

Type values: context, guide, runbook, reference, troubleshooting
Domain values match directory structure (networking, docker, etc.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-12 09:00:44 -05:00

5.2 KiB

Raw Blame History

title

description

type

domain

Tdarr Container Memory Corruption Fixes

Date: 2025-08-11
Issue: Kernel memory corruption in tdarr-ffmpeg process causing system crash
Root Cause: CIFS streaming of large video files during transcoding overwhelming kernel page cache

Critical Issues Identified

CIFS Network Mount Stress: Container directly mounts CIFS shares experiencing network issues
No Resource Limits: Container lacks memory, CPU, and I/O constraints
Mapped Node Architecture: Forces streaming 10GB+ remux files over network during transcoding
Missing Error Handling: No timeout handling or graceful degradation for network storage issues
Container Platform: Using Podman without proper cgroup resource isolation

Recommended Changes

1. Convert to Unmapped Node Architecture (CRITICAL)

Current problematic configuration:

# REMOVE these CIFS volume mounts:
-v "/mnt/media/TV:/media/TV" \
-v "/mnt/media/Movies:/media/Movies" \

New unmapped configuration:

# Update in scripts/tdarr/start-tdarr-gpu-podman-clean.sh
podman run -d --name "${CONTAINER_NAME}" \
    --gpus all \
    --restart unless-stopped \
    -e nodeType=unmapped \                    # KEY CHANGE: unmapped mode
    -e unmappedNodeCache=/cache \
    -v "/mnt/NV2/tdarr-cache:/cache" \       # NVMe local cache only
    # CIFS mounts REMOVED entirely

Benefits:

Eliminates CIFS streaming during transcoding
Prevents kernel memory corruption
3-5x performance improvement with NVMe cache

2. Implement Container Resource Limits (CRITICAL)

Add to container configuration:

podman run -d --name "${CONTAINER_NAME}" \
    --memory=32g \                          # Limit to 32GB (50% of system RAM)
    --memory-swap=40g \                     # Allow 8GB additional swap
    --cpus="14" \                          # Reserve 2 cores for system
    --pids-limit=1000 \                    # Prevent fork bomb scenarios
    --ulimit nofile=65536:65536 \          # File descriptor limits
    --ulimit memlock=67108864:67108864 \   # Prevent excessive memory locking

3. Add I/O and Network Limits

# Add bandwidth controls
--device-read-bps /dev/nvme0n1:1g \       # Limit cache read to 1GB/s
--device-write-bps /dev/nvme0n1:1g \      # Limit cache write to 1GB/s
--network none \                           # No direct network (use server API)

4. Enhanced Error Handling and Monitoring

Server-side configuration:

# In docker-compose.yml for Tdarr server
environment:
  - fileTimeout=1800              # 30 minutes for large file operations
  - downloadTimeout=1800          # Extended timeout for large downloads
  - uploadTimeout=1800            # Extended timeout for large uploads

Monitoring setup:

# Enable existing monitoring system
/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh

# Add to cron for 20-minute checks:
*/20 * * * * /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh

5. Gaming-Aware Scheduling Integration

# Install the gaming-aware scheduler
/mnt/NV2/Development/claude-home/scripts/tdarr/tdarr-schedule-manager.sh install

# Configure for night-only transcoding during troubleshooting
/mnt/NV2/Development/claude-home/scripts/tdarr/tdarr-schedule-manager.sh preset night-only

Implementation Priority

Phase 1: Immediate (Prevent Crashes)

Add resource limits to existing container
Install monitoring system for early warning
Configure CIFS resilience parameters

Phase 2: Architecture Migration (Performance + Stability)

Convert to unmapped node architecture
Remove CIFS volume mounts from container
Test with single large file (10GB+ remux)

Phase 3: Production Hardening

Gaming-aware scheduling integration
Comprehensive monitoring with Discord alerts
Automated recovery scripts

Expected Results

After implementing these changes:

Memory corruption eliminated: No direct CIFS I/O during transcoding
System stability: Resource limits prevent kernel exhaustion
Performance improvement: 3-5x faster transcoding with NVMe cache
Network resilience: Unmapped nodes handle network issues gracefully
Automated recovery: Monitoring system prevents cascade failures

Files to Modify

/mnt/NV2/Development/claude-home/scripts/tdarr/start-tdarr-gpu-podman-clean.sh - Main container startup script
Tdarr server docker-compose configuration - Add timeout settings
Cron configuration - Add monitoring script

Testing Plan

Test with resource limits first - Verify container restraints work
Convert to unmapped architecture - Test with small files initially
Process large remux file - Verify no memory corruption occurs
Simulate network issues - Confirm graceful handling

5.2 KiB Raw Blame History