claude-home/reference/docker/crash-analysis-summary.md

# KDE Plasma Crash Analysis Summary

**Date**: 2025-08-11
**Incident**: Hard system crash requiring forced reboot
**Analysis Period**: ~11:00 - 11:58 (crash timeline)

## Executive Summary

KDE Plasma did not actually crash - the system experienced **kernel-level deadlocks** caused by CIFS network issues combined with intensive Tdarr transcoding operations. The desktop environment became unresponsive as a symptom of deeper kernel problems.

## Timeline of Events

### 11:05 - Network Issues Begin
```
CIFS: VFS: \\10.10.0.35 has not responded in 90 seconds. Reconnecting...
CIFS: VFS: reconnect tcon failed rc = -11
```

### 11:22:18 - Kernel Memory Corruption
```
BUG: Bad page state in process tdarr-ffmpeg pfn:a1af35
page: refcount:0 mapcount:0 mapping:00000000438f9be4 index:0x0 pfn:0xa1af35
aops:cifs_addr_ops [cifs] ino:2f15 dentry name(?):"Alice Through the Looking Glass (2016) Remux-1080p-TdarrCacheF"
```

### 11:23:21+ - RCU Stall Deadlock
```
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-0 rcu_node (CPUs 0-15): P456776
task:ffprobe state:R running task
```

### 11:26:40+ - System Deadlock
```
INFO: task NetworkManager:1806 blocked for more than 122 seconds
INFO: task tailscaled:188215 blocked for more than 122 seconds
INFO: task ThreadPoolForeg:125721 blocked for more than 122 seconds
```

### 11:46:56 - Display Issues (Symptom)
```
qt.qpa.wayland: There are no outputs - creating placeholder screen
kwin_wayland_drm: atomic commit failed: Invalid argument
```

## Root Cause Analysis

### Primary Cause: CIFS + Transcoding Interaction
1. **Network instability** to NAS (10.10.0.35) starting at 11:05
2. **Tdarr container** streaming large video file (10GB+ remux) over CIFS during transcoding
3. **Kernel memory corruption** in CIFS address operations during heavy I/O
4. **RCU deadlock** preventing kernel from completing critical operations
5. **System-wide hang** affecting all processes including desktop environment

### Contributing Factors
- **No container resource limits** - Tdarr could consume unlimited memory
- **Mapped node architecture** - Forces streaming large files over network during processing
- **Aggressive CIFS buffers** - 16MB buffers under memory pressure
- **Inadequate timeout handling** - 90-second hangs before retry attempts
- **No interruption capability** - Kernel couldn't abort hung CIFS operations

## Why Hard Reboot Was Required

The kernel reached a state where:
- **RCU subsystem deadlocked** - Critical kernel operations couldn't complete
- **NetworkManager blocked** - Network stack unresponsive
- **Memory management corrupted** - Page allocation failures
- **Display driver affected** - GPU operations failed due to kernel issues

Normal shutdown impossible due to kernel-level deadlock.

## Evidence Summary

### System Recovered Cleanly
- **After reboot at 11:58:56** - All services started normally
- **No hardware failures** - All components functional
- **Memory test clean** - 62GB available, no corruption detected
- **KDE Plasma working** - Desktop environment fully operational

### KDE Plasma Was Victim, Not Cause
- **Wayland errors were symptoms** - Display issues occurred after kernel problems
- **No Plasma-specific crashes** - No segfaults or application failures in logs
- **Recovery immediate** - Desktop worked perfectly after reboot

## Recommended Actions

### Immediate (Prevent Recurrence)
1. **Implement Tdarr container resource limits** - Prevent memory exhaustion
2. **Update CIFS mount options** - Better timeout and error handling
3. **Convert to unmapped Tdarr node** - Eliminate CIFS streaming during transcoding

### Monitoring (Early Detection)
1. **CIFS error monitoring** - Detect network issues before escalation
2. **Container resource monitoring** - Alert on memory/CPU exhaustion
3. **RCU stall detection** - Kernel deadlock early warning

### Architecture (Long-term Stability)
1. **Unmapped transcoding architecture** - Process files locally on NVMe cache
2. **Gaming-aware scheduling** - Prevent resource conflicts
3. **Automated recovery procedures** - Handle network issues gracefully

## Key Learnings

1. **Network storage + intensive I/O = risk** - CIFS streaming large files during transcoding can trigger kernel issues
2. **Container resource limits critical** - Unlimited resources can destabilize entire system
3. **Timeouts prevent hangs** - Proper timeout configuration prevents 90-second deadlocks
4. **Desktop symptoms != desktop cause** - Display issues often indicate deeper system problems

## Files Created

1. **`tdarr-container-fixes.md`** - Specific container configuration changes
2. **`cifs-mount-resilience-fixes.md`** - CIFS mount option improvements
3. **`crash-analysis-summary.md`** - This comprehensive analysis

## Next Steps

Implement the recommendations in the order specified in the individual fix documents:
1. Phase 1: Immediate fixes to prevent crashes
2. Phase 2: Architecture migration for stability
3. Phase 3: Production hardening and monitoring

The system is currently stable, but without these changes, similar crashes are likely when processing large files over network storage during periods of network instability.