- Add crash-analysis-summary.md: Complete incident timeline and root cause analysis - Add tdarr-container-fixes.md: Container resource limits and unmapped node conversion - Add cifs-mount-resilience-fixes.md: CIFS mount options for kernel stability - Update tdarr-troubleshooting.md: Link to new system crash prevention measures - Update nas-mount-configuration.md: Add stability considerations for production systems Root cause: CIFS streaming of large files during transcoding caused kernel memory corruption and system deadlock. Documents provide comprehensive prevention strategy. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
122 lines
5.1 KiB
Markdown
122 lines
5.1 KiB
Markdown
# KDE Plasma Crash Analysis Summary
|
|
|
|
**Date**: 2025-08-11
|
|
**Incident**: Hard system crash requiring forced reboot
|
|
**Analysis Period**: ~11:00 - 11:58 (crash timeline)
|
|
|
|
## Executive Summary
|
|
|
|
KDE Plasma did not actually crash - the system experienced **kernel-level deadlocks** caused by CIFS network issues combined with intensive Tdarr transcoding operations. The desktop environment became unresponsive as a symptom of deeper kernel problems.
|
|
|
|
## Timeline of Events
|
|
|
|
### 11:05 - Network Issues Begin
|
|
```
|
|
CIFS: VFS: \\10.10.0.35 has not responded in 90 seconds. Reconnecting...
|
|
CIFS: VFS: reconnect tcon failed rc = -11
|
|
```
|
|
|
|
### 11:22:18 - Kernel Memory Corruption
|
|
```
|
|
BUG: Bad page state in process tdarr-ffmpeg pfn:a1af35
|
|
page: refcount:0 mapcount:0 mapping:00000000438f9be4 index:0x0 pfn:0xa1af35
|
|
aops:cifs_addr_ops [cifs] ino:2f15 dentry name(?):"Alice Through the Looking Glass (2016) Remux-1080p-TdarrCacheF"
|
|
```
|
|
|
|
### 11:23:21+ - RCU Stall Deadlock
|
|
```
|
|
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
|
|
rcu: Tasks blocked on level-0 rcu_node (CPUs 0-15): P456776
|
|
task:ffprobe state:R running task
|
|
```
|
|
|
|
### 11:26:40+ - System Deadlock
|
|
```
|
|
INFO: task NetworkManager:1806 blocked for more than 122 seconds
|
|
INFO: task tailscaled:188215 blocked for more than 122 seconds
|
|
INFO: task ThreadPoolForeg:125721 blocked for more than 122 seconds
|
|
```
|
|
|
|
### 11:46:56 - Display Issues (Symptom)
|
|
```
|
|
qt.qpa.wayland: There are no outputs - creating placeholder screen
|
|
kwin_wayland_drm: atomic commit failed: Invalid argument
|
|
```
|
|
|
|
## Root Cause Analysis
|
|
|
|
### Primary Cause: CIFS + Transcoding Interaction
|
|
1. **Network instability** to NAS (10.10.0.35) starting at 11:05
|
|
2. **Tdarr container** streaming large video file (10GB+ remux) over CIFS during transcoding
|
|
3. **Kernel memory corruption** in CIFS address operations during heavy I/O
|
|
4. **RCU deadlock** preventing kernel from completing critical operations
|
|
5. **System-wide hang** affecting all processes including desktop environment
|
|
|
|
### Contributing Factors
|
|
- **No container resource limits** - Tdarr could consume unlimited memory
|
|
- **Mapped node architecture** - Forces streaming large files over network during processing
|
|
- **Aggressive CIFS buffers** - 16MB buffers under memory pressure
|
|
- **Inadequate timeout handling** - 90-second hangs before retry attempts
|
|
- **No interruption capability** - Kernel couldn't abort hung CIFS operations
|
|
|
|
## Why Hard Reboot Was Required
|
|
|
|
The kernel reached a state where:
|
|
- **RCU subsystem deadlocked** - Critical kernel operations couldn't complete
|
|
- **NetworkManager blocked** - Network stack unresponsive
|
|
- **Memory management corrupted** - Page allocation failures
|
|
- **Display driver affected** - GPU operations failed due to kernel issues
|
|
|
|
Normal shutdown impossible due to kernel-level deadlock.
|
|
|
|
## Evidence Summary
|
|
|
|
### System Recovered Cleanly
|
|
- **After reboot at 11:58:56** - All services started normally
|
|
- **No hardware failures** - All components functional
|
|
- **Memory test clean** - 62GB available, no corruption detected
|
|
- **KDE Plasma working** - Desktop environment fully operational
|
|
|
|
### KDE Plasma Was Victim, Not Cause
|
|
- **Wayland errors were symptoms** - Display issues occurred after kernel problems
|
|
- **No Plasma-specific crashes** - No segfaults or application failures in logs
|
|
- **Recovery immediate** - Desktop worked perfectly after reboot
|
|
|
|
## Recommended Actions
|
|
|
|
### Immediate (Prevent Recurrence)
|
|
1. **Implement Tdarr container resource limits** - Prevent memory exhaustion
|
|
2. **Update CIFS mount options** - Better timeout and error handling
|
|
3. **Convert to unmapped Tdarr node** - Eliminate CIFS streaming during transcoding
|
|
|
|
### Monitoring (Early Detection)
|
|
1. **CIFS error monitoring** - Detect network issues before escalation
|
|
2. **Container resource monitoring** - Alert on memory/CPU exhaustion
|
|
3. **RCU stall detection** - Kernel deadlock early warning
|
|
|
|
### Architecture (Long-term Stability)
|
|
1. **Unmapped transcoding architecture** - Process files locally on NVMe cache
|
|
2. **Gaming-aware scheduling** - Prevent resource conflicts
|
|
3. **Automated recovery procedures** - Handle network issues gracefully
|
|
|
|
## Key Learnings
|
|
|
|
1. **Network storage + intensive I/O = risk** - CIFS streaming large files during transcoding can trigger kernel issues
|
|
2. **Container resource limits critical** - Unlimited resources can destabilize entire system
|
|
3. **Timeouts prevent hangs** - Proper timeout configuration prevents 90-second deadlocks
|
|
4. **Desktop symptoms != desktop cause** - Display issues often indicate deeper system problems
|
|
|
|
## Files Created
|
|
|
|
1. **`tdarr-container-fixes.md`** - Specific container configuration changes
|
|
2. **`cifs-mount-resilience-fixes.md`** - CIFS mount option improvements
|
|
3. **`crash-analysis-summary.md`** - This comprehensive analysis
|
|
|
|
## Next Steps
|
|
|
|
Implement the recommendations in the order specified in the individual fix documents:
|
|
1. Phase 1: Immediate fixes to prevent crashes
|
|
2. Phase 2: Architecture migration for stability
|
|
3. Phase 3: Production hardening and monitoring
|
|
|
|
The system is currently stable, but without these changes, similar crashes are likely when processing large files over network storage during periods of network instability. |