Cal Corum 10c9e0d854 CLAUDE: Migrate to technology-first documentation architecture

Complete restructure from patterns/examples/reference to technology-focused directories:

• Created technology-specific directories with comprehensive documentation:
  - /tdarr/ - Transcoding automation with gaming-aware scheduling
  - /docker/ - Container management with GPU acceleration patterns
  - /vm-management/ - Virtual machine automation and cloud-init
  - /networking/ - SSH infrastructure, reverse proxy, and security
  - /monitoring/ - System health checks and Discord notifications
  - /databases/ - Database patterns and troubleshooting
  - /development/ - Programming language patterns (bash, nodejs, python, vuejs)

• Enhanced CLAUDE.md with intelligent context loading:
  - Technology-first loading rules for automatic context provision
  - Troubleshooting keyword triggers for emergency scenarios
  - Documentation maintenance protocols with automated reminders
  - Context window management for optimal documentation updates

• Preserved valuable content from .claude/tmp/:
  - SSH security improvements and server inventory
  - Tdarr CIFS troubleshooting and Docker iptables solutions
  - Operational scripts with proper technology classification

• Benefits achieved:
  - Self-contained technology directories with complete context
  - Automatic loading of relevant documentation based on keywords
  - Emergency-ready troubleshooting with comprehensive guides
  - Scalable structure for future technology additions
  - Eliminated context bloat through targeted loading

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-08-12 23:20:15 -05:00

4.8 KiB

Raw Blame History

Tdarr Container Memory Corruption Fixes

Date: 2025-08-11
Issue: Kernel memory corruption in tdarr-ffmpeg process causing system crash
Root Cause: CIFS streaming of large video files during transcoding overwhelming kernel page cache

Critical Issues Identified

CIFS Network Mount Stress: Container directly mounts CIFS shares experiencing network issues
No Resource Limits: Container lacks memory, CPU, and I/O constraints
Mapped Node Architecture: Forces streaming 10GB+ remux files over network during transcoding
Missing Error Handling: No timeout handling or graceful degradation for network storage issues
Container Platform: Using Podman without proper cgroup resource isolation

Recommended Changes

1. Convert to Unmapped Node Architecture (CRITICAL)

Current problematic configuration:

# REMOVE these CIFS volume mounts:
-v "/mnt/media/TV:/media/TV" \
-v "/mnt/media/Movies:/media/Movies" \

New unmapped configuration:

# Update in scripts/tdarr/start-tdarr-gpu-podman-clean.sh
podman run -d --name "${CONTAINER_NAME}" \
    --gpus all \
    --restart unless-stopped \
    -e nodeType=unmapped \                    # KEY CHANGE: unmapped mode
    -e unmappedNodeCache=/cache \
    -v "/mnt/NV2/tdarr-cache:/cache" \       # NVMe local cache only
    # CIFS mounts REMOVED entirely

Benefits:

Eliminates CIFS streaming during transcoding
Prevents kernel memory corruption
3-5x performance improvement with NVMe cache

2. Implement Container Resource Limits (CRITICAL)

Add to container configuration:

podman run -d --name "${CONTAINER_NAME}" \
    --memory=32g \                          # Limit to 32GB (50% of system RAM)
    --memory-swap=40g \                     # Allow 8GB additional swap
    --cpus="14" \                          # Reserve 2 cores for system
    --pids-limit=1000 \                    # Prevent fork bomb scenarios
    --ulimit nofile=65536:65536 \          # File descriptor limits
    --ulimit memlock=67108864:67108864 \   # Prevent excessive memory locking

3. Add I/O and Network Limits

# Add bandwidth controls
--device-read-bps /dev/nvme0n1:1g \       # Limit cache read to 1GB/s
--device-write-bps /dev/nvme0n1:1g \      # Limit cache write to 1GB/s
--network none \                           # No direct network (use server API)

4. Enhanced Error Handling and Monitoring

Server-side configuration:

# In docker-compose.yml for Tdarr server
environment:
  - fileTimeout=1800              # 30 minutes for large file operations
  - downloadTimeout=1800          # Extended timeout for large downloads
  - uploadTimeout=1800            # Extended timeout for large uploads

Monitoring setup:

# Enable existing monitoring system
/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh

# Add to cron for 20-minute checks:
*/20 * * * * /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh

5. Gaming-Aware Scheduling Integration

# Install the gaming-aware scheduler
/mnt/NV2/Development/claude-home/scripts/tdarr/tdarr-schedule-manager.sh install

# Configure for night-only transcoding during troubleshooting
/mnt/NV2/Development/claude-home/scripts/tdarr/tdarr-schedule-manager.sh preset night-only

Implementation Priority

Phase 1: Immediate (Prevent Crashes)

Add resource limits to existing container
Install monitoring system for early warning
Configure CIFS resilience parameters

Phase 2: Architecture Migration (Performance + Stability)

Convert to unmapped node architecture
Remove CIFS volume mounts from container
Test with single large file (10GB+ remux)

Phase 3: Production Hardening

Gaming-aware scheduling integration
Comprehensive monitoring with Discord alerts
Automated recovery scripts

Expected Results

After implementing these changes:

Memory corruption eliminated: No direct CIFS I/O during transcoding
System stability: Resource limits prevent kernel exhaustion
Performance improvement: 3-5x faster transcoding with NVMe cache
Network resilience: Unmapped nodes handle network issues gracefully
Automated recovery: Monitoring system prevents cascade failures

Files to Modify

/mnt/NV2/Development/claude-home/scripts/tdarr/start-tdarr-gpu-podman-clean.sh - Main container startup script
Tdarr server docker-compose configuration - Add timeout settings
Cron configuration - Add monitoring script

Testing Plan

Test with resource limits first - Verify container restraints work
Convert to unmapped architecture - Test with small files initially
Process large remux file - Verify no memory corruption occurs
Simulate network issues - Confirm graceful handling

4.8 KiB Raw Blame History