claude-home/tdarr/troubleshooting.md

# Tdarr Troubleshooting Guide

## forEach Error Resolution

### Problem: TypeError: Cannot read properties of undefined (reading 'forEach')
**Symptoms**: Scanning phase fails at "Tagging video res" step, preventing all transcodes
**Root Cause**: Custom plugin mounts override community plugins with incompatible versions

### Solution: Clean Plugin Installation
1. **Remove custom plugin mounts** from docker-compose.yml
2. **Force plugin regeneration**:
   ```bash
   ssh tdarr "docker restart tdarr"
   podman restart tdarr-node-gpu
   ```
3. **Verify clean plugins**: Check for null-safety fixes `(streams || []).forEach()`

### Plugin Safety Patterns
```javascript
// ❌ Unsafe - causes forEach errors
args.variables.ffmpegCommand.streams.forEach()

// ✅ Safe - null-safe forEach
(args.variables.ffmpegCommand.streams || []).forEach()
```

## Staging Section Timeout Issues

### Problem: Files removed from staging after 300 seconds
**Symptoms**:
- `.tmp` files stuck in work directories
- ENOTEMPTY errors during cleanup
- Subsequent jobs blocked

### Solution: Automated Monitoring System
**Monitor Script**: `/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh`

**Automatic Actions**:
- Detects staging timeouts every 20 minutes
- Removes stuck work directories
- Sends Discord notifications
- Logs all cleanup activities

### Manual Cleanup Commands
```bash
# Check staging section
ssh tdarr "docker logs tdarr | tail -50"

# Find stuck work directories
find /mnt/NV2/tdarr-cache -name "tdarr-workDir*" -type d

# Force cleanup stuck directory
rm -rf /mnt/NV2/tdarr-cache/tdarr-workDir-[ID]
```

## System Stability Issues

### Problem: Kernel crashes during intensive transcoding
**Root Cause**: CIFS network issues during large file streaming (mapped nodes)

### Solution: Convert to Unmapped Node Architecture
1. **Enable unmapped nodes** in server Options
2. **Update node configuration**:
   ```bash
   # Add to container environment
   -e nodeType=unmapped
   -e unmappedNodeCache=/cache

   # Use local cache volume
   -v "/mnt/NV2/tdarr-cache:/cache"

   # Remove media volume (no longer needed)
   ```
3. **Benefits**: Eliminates CIFS streaming, prevents kernel crashes

### Container Resource Limits
```yaml
# Prevent memory exhaustion
deploy:
  resources:
    limits:
      memory: 8G
      cpus: '6'
```

## Gaming Detection Issues

### Problem: Tdarr doesn't stop during gaming
**Check gaming detection**:
```bash
# Test current gaming detection
./tdarr-schedule-manager.sh test

# View scheduler logs
tail -f /tmp/tdarr-scheduler.log

# Verify GPU usage detection
nvidia-smi
```

### Gaming Process Detection
**Monitored Processes**:
- Steam, Lutris, Heroic Games Launcher
- Wine, Bottles (Windows compatibility)
- GameMode, MangoHUD (utilities)
- **GPU usage >15%** (configurable threshold)

### Configuration Adjustments
```bash
# Edit gaming detection threshold
./tdarr-schedule-manager.sh edit

# Apply preset configurations
./tdarr-schedule-manager.sh preset gaming-only  # No time limits
./tdarr-schedule-manager.sh preset night-only   # 10PM-7AM only
```

## Network and Access Issues

### Server Connection Problems
**Server Access Commands**:
```bash
# SSH to Tdarr server
ssh tdarr

# Check server status
ssh tdarr "docker ps | grep tdarr"

# View server logs
ssh tdarr "docker logs tdarr"

# Access server container
ssh tdarr "docker exec -it tdarr /bin/bash"
```

### Node Registration Issues
```bash
# Check node logs
podman logs tdarr-node-gpu

# Verify node registration
# Look for "Node registered" in server logs
ssh tdarr "docker logs tdarr | grep -i node"

# Test node connectivity
curl http://10.10.0.43:8265/api/v2/status
```

## Performance Issues

### Slow Transcoding Performance
**Diagnosis**:
1. **Check cache location**: Should be local NVMe, not network
2. **Verify unmapped mode**: `nodeType=unmapped` in container
3. **Monitor I/O**: `iotop` during transcoding

**Expected Performance**:
- **Mapped nodes**: Constant SMB streaming (~100MB/s)
- **Unmapped nodes**: Download once → Process locally → Upload once

### GPU Utilization Problems
```bash
# Monitor GPU usage during transcoding
watch nvidia-smi

# Check GPU device access in container
podman exec tdarr-node-gpu nvidia-smi

# Verify NVENC encoder availability
podman exec tdarr-node-gpu ffmpeg -encoders | grep nvenc
```

## Plugin System Issues

### Plugin Loading Failures
**Troubleshooting Steps**:
1. **Check plugin directory**: Ensure no custom mounts override community plugins
2. **Verify dependencies**: FlowHelper files (`metadataUtils.js`, `letterboxUtils.js`)
3. **Test plugin syntax**:
   ```bash
   # Test plugin in Node.js
   node -e "require('./path/to/plugin.js')"
   ```

### Custom Plugin Integration
**Safe Integration Pattern**:
1. **Selective mounting**: Mount only specific required plugins
2. **Dependency verification**: Include all FlowHelper dependencies
3. **Version compatibility**: Ensure plugins match Tdarr version
4. **Null-safety checks**: Add `|| []` to forEach operations

## Monitoring and Logging

### Log Locations
```bash
# Scheduler logs
tail -f /tmp/tdarr-scheduler.log

# Monitor logs
tail -f /tmp/tdarr-monitor/monitor.log

# Server logs
ssh tdarr "docker logs tdarr"

# Node logs
podman logs tdarr-node-gpu
```

### Discord Notification Issues
**Check webhook configuration**:
```bash
# Test Discord webhook
curl -X POST [WEBHOOK_URL] \
  -H "Content-Type: application/json" \
  -d '{"content": "Test message"}'
```

**Common Issues**:
- JSON escaping in message content
- Markdown formatting in Discord
- User ping placement (outside code blocks)

## Emergency Recovery

### Complete System Reset
```bash
# Stop all containers
podman stop tdarr-node-gpu
ssh tdarr "docker stop tdarr"

# Clean cache directories
rm -rf /mnt/NV2/tdarr-cache/tdarr-workDir*

# Remove scheduler
crontab -e  # Delete tdarr lines

# Restart with clean configuration
./start-tdarr-gpu-podman-clean.sh
./tdarr-schedule-manager.sh preset work-safe
./tdarr-schedule-manager.sh install
```

### Data Recovery
**Important**: Tdarr processes files in-place, original files remain untouched
- **Queue data**: Stored in server configuration (`/app/configs`)
- **Progress data**: Lost on container restart (unmapped nodes)
- **Cache files**: Safe to delete, will re-download

## Database Modification & Requeue

### Problem: UI "Requeue All" Button Has No Effect
**Symptoms**: Clicking "Requeue all items (transcode)" in library UI does nothing

**Workaround**: Modify SQLite DB directly, then trigger scan:
```bash
# 1. Reset file statuses in DB (run Python on manticore)
python3 -c "
import sqlite3
conn = sqlite3.connect('/home/cal/docker/tdarr/server-data/Tdarr/DB2/SQL/database.db')
conn.execute(\"UPDATE filejsondb SET json_data = json_set(json_data, '$.TranscodeDecisionMaker', '') WHERE json_extract(json_data, '$.DB') = '<LIBRARY_ID>'\")
conn.commit()
conn.close()
"

# 2. Restart Tdarr
cd /home/cal/docker/tdarr && docker compose down && docker compose up -d

# 3. Trigger scan (required — DB changes alone won't queue files)
curl -s -X POST "http://localhost:8265/api/v2/scan-files" \
  -H "Content-Type: application/json" \
  -d '{"data":{"scanConfig":{"dbID":"<LIBRARY_ID>","arrayOrPath":"/media/Movies/","mode":"scanFindNew"}}}'
```

**Library IDs**: Movies=`ZWgKkmzJp`, TV Shows=`EjfWXCdU8`

**Note**: The CRUD API (`/api/v2/cruddb`) silently ignores write operations (update/insert/upsert all return 200 but don't persist). Always modify the SQLite DB directly.

### Problem: Library filterCodecsSkip Blocks Flow Plugins
**Symptoms**: Job report shows "File video_codec_name (hevc) is in ignored codecs"
**Cause**: `filterCodecsSkip: "hevc"` in library settings skips files before the flow runs
**Solution**: Clear the filter in DB — the flow's own logic handles codec decisions:
```bash
# In librarysettingsjsondb, set filterCodecsSkip to empty string
```

## Flow Plugin Issues

### Problem: clrSubDef Disposition Change Not Persisting (SRT→ASS Re-encode)
**Symptoms**: Job log shows "Clearing default flag from subtitle stream" but output file still has default subtitle. SRT subtitles become ASS in output.
**Root Cause**: The `clrSubDef` custom function pushed `-disposition:{outputIndex} 0` to `outputArgs` without also specifying `-c:{outputIndex} copy`. Tdarr's Execute plugin skips adding default `-c:N copy` for streams with custom `outputArgs`. Without a codec spec, ffmpeg re-encodes SRT→ASS (MKV default), resetting the disposition.
**Fix**: Always include codec copy when adding outputArgs:
```javascript
// WRONG - causes re-encode
stream.outputArgs.push('-disposition:{outputIndex}', '0');
// RIGHT - preserves codec, changes only disposition
stream.outputArgs.push('-c:{outputIndex}', 'copy', '-disposition:{outputIndex}', '0');
```

### Problem: ensAC3str Matches Commentary Tracks as Existing AC3 Stereo
**Symptoms**: File has commentary AC3 2ch track but no main-audio AC3 stereo. Plugin logs "File already has en stream in ac3, 2 channels".
**Root Cause**: The community `ffmpegCommandEnsureAudioStream` plugin doesn't filter by track title — any AC3 2ch eng track satisfies the check, including commentary.
**Fix**: Replaced with `customFunction` that filters out tracks with "commentary" in the title tag before checking. Updated in flow `KeayMCz5Y` via direct SQLite modification.

### Combined Impact: Roku Playback Hang
When both bugs occur together (TrueHD default audio + default subtitle not cleared), Jellyfin must transcode audio AND burn-in subtitles simultaneously over HLS. The ~30s startup delay causes Roku to timeout at ~33% loading. Fixing either bug alone unblocks playback — clearing the subtitle default is sufficient since TrueHD-only transcoding is fast enough.

## Common Error Patterns

### "Copy failed" in Staging Section
**Cause**: Network timeout during file transfer to unmapped node
**Solution**: Monitoring system automatically retries

### "ENOTEMPTY" Directory Cleanup Errors
**Cause**: Partial downloads leave files in work directories
**Solution**: Force remove directories, monitoring handles automatically

### Node Disconnection During Processing
**Cause**: Gaming detection or manual stop during active job
**Result**: File returns to queue automatically, safe to restart

## Prevention Best Practices

1. **Use unmapped node architecture** for stability
2. **Implement monitoring system** for automatic cleanup
3. **Configure gaming-aware scheduling** for desktop systems
4. **Set container resource limits** to prevent crashes
5. **Use clean plugin installation** to avoid forEach errors
6. **Monitor system resources** during intensive operations

This troubleshooting guide covers the most common issues and their resolutions for production Tdarr deployments.