claude-home/tdarr/troubleshooting.md
Cal Corum 10c9e0d854 CLAUDE: Migrate to technology-first documentation architecture
Complete restructure from patterns/examples/reference to technology-focused directories:

• Created technology-specific directories with comprehensive documentation:
  - /tdarr/ - Transcoding automation with gaming-aware scheduling
  - /docker/ - Container management with GPU acceleration patterns
  - /vm-management/ - Virtual machine automation and cloud-init
  - /networking/ - SSH infrastructure, reverse proxy, and security
  - /monitoring/ - System health checks and Discord notifications
  - /databases/ - Database patterns and troubleshooting
  - /development/ - Programming language patterns (bash, nodejs, python, vuejs)

• Enhanced CLAUDE.md with intelligent context loading:
  - Technology-first loading rules for automatic context provision
  - Troubleshooting keyword triggers for emergency scenarios
  - Documentation maintenance protocols with automated reminders
  - Context window management for optimal documentation updates

• Preserved valuable content from .claude/tmp/:
  - SSH security improvements and server inventory
  - Tdarr CIFS troubleshooting and Docker iptables solutions
  - Operational scripts with proper technology classification

• Benefits achieved:
  - Self-contained technology directories with complete context
  - Automatic loading of relevant documentation based on keywords
  - Emergency-ready troubleshooting with comprehensive guides
  - Scalable structure for future technology additions
  - Eliminated context bloat through targeted loading

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-12 23:20:15 -05:00

272 lines
7.2 KiB
Markdown

# Tdarr Troubleshooting Guide
## forEach Error Resolution
### Problem: TypeError: Cannot read properties of undefined (reading 'forEach')
**Symptoms**: Scanning phase fails at "Tagging video res" step, preventing all transcodes
**Root Cause**: Custom plugin mounts override community plugins with incompatible versions
### Solution: Clean Plugin Installation
1. **Remove custom plugin mounts** from docker-compose.yml
2. **Force plugin regeneration**:
```bash
ssh tdarr "docker restart tdarr"
podman restart tdarr-node-gpu
```
3. **Verify clean plugins**: Check for null-safety fixes `(streams || []).forEach()`
### Plugin Safety Patterns
```javascript
// ❌ Unsafe - causes forEach errors
args.variables.ffmpegCommand.streams.forEach()
// ✅ Safe - null-safe forEach
(args.variables.ffmpegCommand.streams || []).forEach()
```
## Staging Section Timeout Issues
### Problem: Files removed from staging after 300 seconds
**Symptoms**:
- `.tmp` files stuck in work directories
- ENOTEMPTY errors during cleanup
- Subsequent jobs blocked
### Solution: Automated Monitoring System
**Monitor Script**: `/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh`
**Automatic Actions**:
- Detects staging timeouts every 20 minutes
- Removes stuck work directories
- Sends Discord notifications
- Logs all cleanup activities
### Manual Cleanup Commands
```bash
# Check staging section
ssh tdarr "docker logs tdarr | tail -50"
# Find stuck work directories
find /mnt/NV2/tdarr-cache -name "tdarr-workDir*" -type d
# Force cleanup stuck directory
rm -rf /mnt/NV2/tdarr-cache/tdarr-workDir-[ID]
```
## System Stability Issues
### Problem: Kernel crashes during intensive transcoding
**Root Cause**: CIFS network issues during large file streaming (mapped nodes)
### Solution: Convert to Unmapped Node Architecture
1. **Enable unmapped nodes** in server Options
2. **Update node configuration**:
```bash
# Add to container environment
-e nodeType=unmapped
-e unmappedNodeCache=/cache
# Use local cache volume
-v "/mnt/NV2/tdarr-cache:/cache"
# Remove media volume (no longer needed)
```
3. **Benefits**: Eliminates CIFS streaming, prevents kernel crashes
### Container Resource Limits
```yaml
# Prevent memory exhaustion
deploy:
resources:
limits:
memory: 8G
cpus: '6'
```
## Gaming Detection Issues
### Problem: Tdarr doesn't stop during gaming
**Check gaming detection**:
```bash
# Test current gaming detection
./tdarr-schedule-manager.sh test
# View scheduler logs
tail -f /tmp/tdarr-scheduler.log
# Verify GPU usage detection
nvidia-smi
```
### Gaming Process Detection
**Monitored Processes**:
- Steam, Lutris, Heroic Games Launcher
- Wine, Bottles (Windows compatibility)
- GameMode, MangoHUD (utilities)
- **GPU usage >15%** (configurable threshold)
### Configuration Adjustments
```bash
# Edit gaming detection threshold
./tdarr-schedule-manager.sh edit
# Apply preset configurations
./tdarr-schedule-manager.sh preset gaming-only # No time limits
./tdarr-schedule-manager.sh preset night-only # 10PM-7AM only
```
## Network and Access Issues
### Server Connection Problems
**Server Access Commands**:
```bash
# SSH to Tdarr server
ssh tdarr
# Check server status
ssh tdarr "docker ps | grep tdarr"
# View server logs
ssh tdarr "docker logs tdarr"
# Access server container
ssh tdarr "docker exec -it tdarr /bin/bash"
```
### Node Registration Issues
```bash
# Check node logs
podman logs tdarr-node-gpu
# Verify node registration
# Look for "Node registered" in server logs
ssh tdarr "docker logs tdarr | grep -i node"
# Test node connectivity
curl http://10.10.0.43:8265/api/v2/status
```
## Performance Issues
### Slow Transcoding Performance
**Diagnosis**:
1. **Check cache location**: Should be local NVMe, not network
2. **Verify unmapped mode**: `nodeType=unmapped` in container
3. **Monitor I/O**: `iotop` during transcoding
**Expected Performance**:
- **Mapped nodes**: Constant SMB streaming (~100MB/s)
- **Unmapped nodes**: Download once → Process locally → Upload once
### GPU Utilization Problems
```bash
# Monitor GPU usage during transcoding
watch nvidia-smi
# Check GPU device access in container
podman exec tdarr-node-gpu nvidia-smi
# Verify NVENC encoder availability
podman exec tdarr-node-gpu ffmpeg -encoders | grep nvenc
```
## Plugin System Issues
### Plugin Loading Failures
**Troubleshooting Steps**:
1. **Check plugin directory**: Ensure no custom mounts override community plugins
2. **Verify dependencies**: FlowHelper files (`metadataUtils.js`, `letterboxUtils.js`)
3. **Test plugin syntax**:
```bash
# Test plugin in Node.js
node -e "require('./path/to/plugin.js')"
```
### Custom Plugin Integration
**Safe Integration Pattern**:
1. **Selective mounting**: Mount only specific required plugins
2. **Dependency verification**: Include all FlowHelper dependencies
3. **Version compatibility**: Ensure plugins match Tdarr version
4. **Null-safety checks**: Add `|| []` to forEach operations
## Monitoring and Logging
### Log Locations
```bash
# Scheduler logs
tail -f /tmp/tdarr-scheduler.log
# Monitor logs
tail -f /tmp/tdarr-monitor/monitor.log
# Server logs
ssh tdarr "docker logs tdarr"
# Node logs
podman logs tdarr-node-gpu
```
### Discord Notification Issues
**Check webhook configuration**:
```bash
# Test Discord webhook
curl -X POST [WEBHOOK_URL] \
-H "Content-Type: application/json" \
-d '{"content": "Test message"}'
```
**Common Issues**:
- JSON escaping in message content
- Markdown formatting in Discord
- User ping placement (outside code blocks)
## Emergency Recovery
### Complete System Reset
```bash
# Stop all containers
podman stop tdarr-node-gpu
ssh tdarr "docker stop tdarr"
# Clean cache directories
rm -rf /mnt/NV2/tdarr-cache/tdarr-workDir*
# Remove scheduler
crontab -e # Delete tdarr lines
# Restart with clean configuration
./start-tdarr-gpu-podman-clean.sh
./tdarr-schedule-manager.sh preset work-safe
./tdarr-schedule-manager.sh install
```
### Data Recovery
**Important**: Tdarr processes files in-place, original files remain untouched
- **Queue data**: Stored in server configuration (`/app/configs`)
- **Progress data**: Lost on container restart (unmapped nodes)
- **Cache files**: Safe to delete, will re-download
## Common Error Patterns
### "Copy failed" in Staging Section
**Cause**: Network timeout during file transfer to unmapped node
**Solution**: Monitoring system automatically retries
### "ENOTEMPTY" Directory Cleanup Errors
**Cause**: Partial downloads leave files in work directories
**Solution**: Force remove directories, monitoring handles automatically
### Node Disconnection During Processing
**Cause**: Gaming detection or manual stop during active job
**Result**: File returns to queue automatically, safe to restart
## Prevention Best Practices
1. **Use unmapped node architecture** for stability
2. **Implement monitoring system** for automatic cleanup
3. **Configure gaming-aware scheduling** for desktop systems
4. **Set container resource limits** to prevent crashes
5. **Use clean plugin installation** to avoid forEach errors
6. **Monitor system resources** during intensive operations
This troubleshooting guide covers the most common issues and their resolutions for production Tdarr deployments.