claude-home/reference/docker/crash-analysis-summary.md
Cal Corum 34702a37fc CLAUDE: Add comprehensive KDE Plasma crash analysis and prevention documentation
- Add crash-analysis-summary.md: Complete incident timeline and root cause analysis
- Add tdarr-container-fixes.md: Container resource limits and unmapped node conversion
- Add cifs-mount-resilience-fixes.md: CIFS mount options for kernel stability
- Update tdarr-troubleshooting.md: Link to new system crash prevention measures
- Update nas-mount-configuration.md: Add stability considerations for production systems

Root cause: CIFS streaming of large files during transcoding caused kernel memory
corruption and system deadlock. Documents provide comprehensive prevention strategy.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-11 12:29:31 -05:00

122 lines
5.1 KiB
Markdown

# KDE Plasma Crash Analysis Summary
**Date**: 2025-08-11
**Incident**: Hard system crash requiring forced reboot
**Analysis Period**: ~11:00 - 11:58 (crash timeline)
## Executive Summary
KDE Plasma did not actually crash - the system experienced **kernel-level deadlocks** caused by CIFS network issues combined with intensive Tdarr transcoding operations. The desktop environment became unresponsive as a symptom of deeper kernel problems.
## Timeline of Events
### 11:05 - Network Issues Begin
```
CIFS: VFS: \\10.10.0.35 has not responded in 90 seconds. Reconnecting...
CIFS: VFS: reconnect tcon failed rc = -11
```
### 11:22:18 - Kernel Memory Corruption
```
BUG: Bad page state in process tdarr-ffmpeg pfn:a1af35
page: refcount:0 mapcount:0 mapping:00000000438f9be4 index:0x0 pfn:0xa1af35
aops:cifs_addr_ops [cifs] ino:2f15 dentry name(?):"Alice Through the Looking Glass (2016) Remux-1080p-TdarrCacheF"
```
### 11:23:21+ - RCU Stall Deadlock
```
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-0 rcu_node (CPUs 0-15): P456776
task:ffprobe state:R running task
```
### 11:26:40+ - System Deadlock
```
INFO: task NetworkManager:1806 blocked for more than 122 seconds
INFO: task tailscaled:188215 blocked for more than 122 seconds
INFO: task ThreadPoolForeg:125721 blocked for more than 122 seconds
```
### 11:46:56 - Display Issues (Symptom)
```
qt.qpa.wayland: There are no outputs - creating placeholder screen
kwin_wayland_drm: atomic commit failed: Invalid argument
```
## Root Cause Analysis
### Primary Cause: CIFS + Transcoding Interaction
1. **Network instability** to NAS (10.10.0.35) starting at 11:05
2. **Tdarr container** streaming large video file (10GB+ remux) over CIFS during transcoding
3. **Kernel memory corruption** in CIFS address operations during heavy I/O
4. **RCU deadlock** preventing kernel from completing critical operations
5. **System-wide hang** affecting all processes including desktop environment
### Contributing Factors
- **No container resource limits** - Tdarr could consume unlimited memory
- **Mapped node architecture** - Forces streaming large files over network during processing
- **Aggressive CIFS buffers** - 16MB buffers under memory pressure
- **Inadequate timeout handling** - 90-second hangs before retry attempts
- **No interruption capability** - Kernel couldn't abort hung CIFS operations
## Why Hard Reboot Was Required
The kernel reached a state where:
- **RCU subsystem deadlocked** - Critical kernel operations couldn't complete
- **NetworkManager blocked** - Network stack unresponsive
- **Memory management corrupted** - Page allocation failures
- **Display driver affected** - GPU operations failed due to kernel issues
Normal shutdown impossible due to kernel-level deadlock.
## Evidence Summary
### System Recovered Cleanly
- **After reboot at 11:58:56** - All services started normally
- **No hardware failures** - All components functional
- **Memory test clean** - 62GB available, no corruption detected
- **KDE Plasma working** - Desktop environment fully operational
### KDE Plasma Was Victim, Not Cause
- **Wayland errors were symptoms** - Display issues occurred after kernel problems
- **No Plasma-specific crashes** - No segfaults or application failures in logs
- **Recovery immediate** - Desktop worked perfectly after reboot
## Recommended Actions
### Immediate (Prevent Recurrence)
1. **Implement Tdarr container resource limits** - Prevent memory exhaustion
2. **Update CIFS mount options** - Better timeout and error handling
3. **Convert to unmapped Tdarr node** - Eliminate CIFS streaming during transcoding
### Monitoring (Early Detection)
1. **CIFS error monitoring** - Detect network issues before escalation
2. **Container resource monitoring** - Alert on memory/CPU exhaustion
3. **RCU stall detection** - Kernel deadlock early warning
### Architecture (Long-term Stability)
1. **Unmapped transcoding architecture** - Process files locally on NVMe cache
2. **Gaming-aware scheduling** - Prevent resource conflicts
3. **Automated recovery procedures** - Handle network issues gracefully
## Key Learnings
1. **Network storage + intensive I/O = risk** - CIFS streaming large files during transcoding can trigger kernel issues
2. **Container resource limits critical** - Unlimited resources can destabilize entire system
3. **Timeouts prevent hangs** - Proper timeout configuration prevents 90-second deadlocks
4. **Desktop symptoms != desktop cause** - Display issues often indicate deeper system problems
## Files Created
1. **`tdarr-container-fixes.md`** - Specific container configuration changes
2. **`cifs-mount-resilience-fixes.md`** - CIFS mount option improvements
3. **`crash-analysis-summary.md`** - This comprehensive analysis
## Next Steps
Implement the recommendations in the order specified in the individual fix documents:
1. Phase 1: Immediate fixes to prevent crashes
2. Phase 2: Architecture migration for stability
3. Phase 3: Production hardening and monitoring
The system is currently stable, but without these changes, similar crashes are likely when processing large files over network storage during periods of network instability.