claude-home/tdarr/examples/tdarr-cifs-troubleshooting-2025-08-11.md
Cal Corum 10c9e0d854 CLAUDE: Migrate to technology-first documentation architecture
Complete restructure from patterns/examples/reference to technology-focused directories:

• Created technology-specific directories with comprehensive documentation:
  - /tdarr/ - Transcoding automation with gaming-aware scheduling
  - /docker/ - Container management with GPU acceleration patterns
  - /vm-management/ - Virtual machine automation and cloud-init
  - /networking/ - SSH infrastructure, reverse proxy, and security
  - /monitoring/ - System health checks and Discord notifications
  - /databases/ - Database patterns and troubleshooting
  - /development/ - Programming language patterns (bash, nodejs, python, vuejs)

• Enhanced CLAUDE.md with intelligent context loading:
  - Technology-first loading rules for automatic context provision
  - Troubleshooting keyword triggers for emergency scenarios
  - Documentation maintenance protocols with automated reminders
  - Context window management for optimal documentation updates

• Preserved valuable content from .claude/tmp/:
  - SSH security improvements and server inventory
  - Tdarr CIFS troubleshooting and Docker iptables solutions
  - Operational scripts with proper technology classification

• Benefits achieved:
  - Self-contained technology directories with complete context
  - Automatic loading of relevant documentation based on keywords
  - Emergency-ready troubleshooting with comprehensive guides
  - Scalable structure for future technology additions
  - Eliminated context bloat through targeted loading

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-12 23:20:15 -05:00

143 lines
6.6 KiB
Markdown

# Tdarr CIFS Troubleshooting Session - 2025-08-11
## Problem Statement
Tdarr unmapped node experiencing persistent download timeouts at 9:08 PM with large files (31GB+ remux), causing "Cancelling" messages and stuck downloads. Downloads would hang for 33+ minutes before timing out, despite container remaining running.
## Initial Hypothesis: Mapped vs Unmapped Node Issue
**Status**: ❌ **DISPROVEN**
- Suspected unmapped node timeout configuration differences
- Windows PC running mapped Tdarr node works fine (slow but stable)
- Both mapped and unmapped Linux nodes exhibited identical timeout issues
- **Conclusion**: Architecture type was not the root cause
## Key Insight: Windows vs Linux Performance Difference
**Observation**: Windows Tdarr node (mapped mode) works without timeouts, Linux nodes (both mapped/unmapped) fail
**Implication**: Platform-specific issue, likely network stack or CIFS implementation
## Root Cause Discovery Process
### Phase 1: Linux Client CIFS Analysis
**Method**: Direct CIFS mount testing on Tdarr node machine (nobara-pc)
**Initial CIFS Mount Configuration** (problematic):
```bash
//10.10.0.35/media on /mnt/media type cifs (rw,relatime,vers=3.1.1,cache=strict,upcall_target=app,username=root,uid=1000,forceuid,gid=1000,forcegid,addr=10.10.0.35,file_mode=0755,dir_mode=0755,soft,nounix,serverino,mapposix,noperm,reparse=nfs,nativesocket,symlink=native,rsize=4194304,wsize=4194304,bsize=1048576,retrans=1,echo_interval=60,actimeo=30,closetimeo=1,_netdev,x-systemd.automount,x-systemd.device-timeout=10,x-systemd.mount-timeout=30)
```
**Critical Issues Identified**:
- `soft` - Mount fails on timeout instead of retrying indefinitely
- `retrans=1` - Only 1 retry attempt (NFS option, invalid for CIFS)
- `closetimeo=1` - Very short close timeout (1 second)
- `cache=strict` - No local caching, poor performance for large files
- `x-systemd.mount-timeout=30` - 30-second mount timeout
**Optimization Applied**:
```bash
//10.10.0.35/media /mnt/media cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,hard,rsize=16777216,wsize=16777216,cache=loose,actimeo=60,echo_interval=30,_netdev,x-systemd.automount,x-systemd.device-timeout=60,x-systemd.mount-timeout=120,noperm 0 0
```
**Performance Testing Results**:
- **Local SSD**: `dd` 800MB in 0.217s (4.0 GB/s) - baseline
- **CIFS 1MB blocks**: 42.7 MB/s - fast, no issues
- **CIFS 4MB blocks**: 205 MB/s - fast, no issues
- **CIFS 8MB blocks**: 83.1 MB/s - **3-minute terminal freeze**
**Critical Discovery**: Block size dependency causing I/O blocking with large transfers
### Phase 2: Tdarr Server-Side Analysis
**Method**: Test Tdarr API download path directly
**API Test Command**:
```bash
curl -X POST "http://10.10.0.43:8265/api/v2/file/download" \
-H "Content-Type: application/json" \
-d '{"filePath": "/media/Movies/Jumanji (1995)/Jumanji (1995) Remux-1080p Proper.mkv"}' \
-o /tmp/tdarr-api-test.mkv
```
**Results**:
- **Performance**: 55.7-58.6 MB/s sustained
- **Progress**: Downloaded 15.3GB of 23GB (66%)
- **Failure**: **Download hung at 66% completion**
- **Timing**: Hung after ~5 minutes (consistent with previous timeout patterns)
### Phase 3: Tdarr Server CIFS Configuration Analysis
**Method**: Examine server-side storage mount
**Server CIFS Mount** (problematic):
```bash
//10.10.0.35/media /mnt/truenas-share cifs credentials=/root/.truenascreds,vers=3.1.1,rsize=4194304,wsize=4194304,cache=strict,actimeo=30,echo_interval=60,noperm 0 0
```
**Server Issues Identified**:
- **Missing `hard`** - Defaults to `soft` mount behavior
- `cache=strict` - No local caching (same issue as client)
- **No retry/timeout extensions** - Uses unreliable kernel defaults
- **No systemd timeout protection**
## Root Cause Confirmed
**Primary Issue**: Tdarr server's CIFS mount to TrueNAS using suboptimal configuration
**Impact**: Large file streaming via Tdarr API hangs when server's CIFS mount hits I/O blocking
**Evidence**: API download hung at exact same pattern as node timeouts (66% through large file)
## Solution Strategy
**Fix Tdarr Server CIFS Mount Configuration**:
```bash
//10.10.0.35/media /mnt/truenas-share cifs credentials=/root/.truenascreds,vers=3.1.1,hard,rsize=4194304,wsize=4194304,cache=loose,actimeo=60,echo_interval=30,_netdev,x-systemd.device-timeout=60,x-systemd.mount-timeout=120,noperm 0 0
```
**Key Optimizations**:
- `hard` - Retry indefinitely instead of timing out
- `cache=loose` - Enable local caching for large file performance
- `actimeo=60` - Longer attribute caching
- `echo_interval=30` - More frequent keep-alives
- Extended systemd timeouts for reliability
## Implementation Steps
1. **Update server `/etc/fstab`** with optimized CIFS configuration
2. **Remount server storage**:
```bash
ssh tdarr "sudo umount /mnt/truenas-share"
ssh tdarr "sudo systemctl daemon-reload"
ssh tdarr "sudo mount /mnt/truenas-share"
```
3. **Test large file API download** to verify fix
4. **Resume Tdarr transcoding** with confidence in large file handling
## Technical Insights
### CIFS vs SMB Protocol Differences
- **Windows nodes**: Use native SMB implementation (stable)
- **Linux nodes**: Use kernel CIFS module (prone to I/O blocking with poor configuration)
- **Block size sensitivity**: Large block transfers require careful timeout/retry configuration
### Tdarr Architecture Impact
- **Unmapped nodes**: Download entire files via API before processing (high bandwidth, vulnerable to server CIFS issues)
- **Mapped nodes**: Stream files during processing (lower bandwidth, still vulnerable to server CIFS issues)
- **Root cause affects both architectures** since server-side storage access is the bottleneck
### Performance Expectations Post-Fix
- **Consistent 50-100 MB/s** for large file downloads
- **No timeout failures** with properly configured hard mounts
- **Reliable processing** of 31GB+ remux files
## Files Modified
- **Client**: `/etc/fstab` on nobara-pc (CIFS optimization applied)
- **Server**: `/etc/fstab` on tdarr server (pending optimization)
## Monitoring and Validation
- **Success criteria**: Tdarr API download of 23GB+ file completes without hanging
- **Performance target**: Sustained 50+ MB/s throughout entire transfer
- **Reliability target**: No timeouts during large file processing
## Session Outcome
**Status**: ✅ **ROOT CAUSE IDENTIFIED AND SOLUTION READY**
- Eliminated client-side variables through systematic testing
- Confirmed server-side CIFS configuration as bottleneck
- Validated fix strategy through client-side optimization success
- Ready to implement server-side solution
---
*Session Date: 2025-08-11*
*Duration: ~3 hours*
*Methods: Direct testing, API analysis, mount configuration review*