Cal Corum 34702a37fc CLAUDE: Add comprehensive KDE Plasma crash analysis and prevention documentation

- Add crash-analysis-summary.md: Complete incident timeline and root cause analysis
- Add tdarr-container-fixes.md: Container resource limits and unmapped node conversion
- Add cifs-mount-resilience-fixes.md: CIFS mount options for kernel stability
- Update tdarr-troubleshooting.md: Link to new system crash prevention measures
- Update nas-mount-configuration.md: Add stability considerations for production systems

Root cause: CIFS streaming of large files during transcoding caused kernel memory
corruption and system deadlock. Documents provide comprehensive prevention strategy.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-08-11 12:29:31 -05:00

5.1 KiB

Raw Blame History

KDE Plasma Crash Analysis Summary

Date: 2025-08-11
Incident: Hard system crash requiring forced reboot
Analysis Period: ~11:00 - 11:58 (crash timeline)

Executive Summary

KDE Plasma did not actually crash - the system experienced kernel-level deadlocks caused by CIFS network issues combined with intensive Tdarr transcoding operations. The desktop environment became unresponsive as a symptom of deeper kernel problems.

Timeline of Events

11:05 - Network Issues Begin

CIFS: VFS: \\10.10.0.35 has not responded in 90 seconds. Reconnecting...
CIFS: VFS: reconnect tcon failed rc = -11

11:22:18 - Kernel Memory Corruption

BUG: Bad page state in process tdarr-ffmpeg pfn:a1af35
page: refcount:0 mapcount:0 mapping:00000000438f9be4 index:0x0 pfn:0xa1af35
aops:cifs_addr_ops [cifs] ino:2f15 dentry name(?):"Alice Through the Looking Glass (2016) Remux-1080p-TdarrCacheF"

11:23:21+ - RCU Stall Deadlock

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-0 rcu_node (CPUs 0-15): P456776
task:ffprobe state:R running task

11:26:40+ - System Deadlock

INFO: task NetworkManager:1806 blocked for more than 122 seconds
INFO: task tailscaled:188215 blocked for more than 122 seconds  
INFO: task ThreadPoolForeg:125721 blocked for more than 122 seconds

11:46:56 - Display Issues (Symptom)

qt.qpa.wayland: There are no outputs - creating placeholder screen
kwin_wayland_drm: atomic commit failed: Invalid argument

Root Cause Analysis

Primary Cause: CIFS + Transcoding Interaction

Network instability to NAS (10.10.0.35) starting at 11:05
Tdarr container streaming large video file (10GB+ remux) over CIFS during transcoding
Kernel memory corruption in CIFS address operations during heavy I/O
RCU deadlock preventing kernel from completing critical operations
System-wide hang affecting all processes including desktop environment

Contributing Factors

No container resource limits - Tdarr could consume unlimited memory
Mapped node architecture - Forces streaming large files over network during processing
Aggressive CIFS buffers - 16MB buffers under memory pressure
Inadequate timeout handling - 90-second hangs before retry attempts
No interruption capability - Kernel couldn't abort hung CIFS operations

Why Hard Reboot Was Required

The kernel reached a state where:

RCU subsystem deadlocked - Critical kernel operations couldn't complete
NetworkManager blocked - Network stack unresponsive
Memory management corrupted - Page allocation failures
Display driver affected - GPU operations failed due to kernel issues

Normal shutdown impossible due to kernel-level deadlock.

Evidence Summary

System Recovered Cleanly

After reboot at 11:58:56 - All services started normally
No hardware failures - All components functional
Memory test clean - 62GB available, no corruption detected
KDE Plasma working - Desktop environment fully operational

KDE Plasma Was Victim, Not Cause

Wayland errors were symptoms - Display issues occurred after kernel problems
No Plasma-specific crashes - No segfaults or application failures in logs
Recovery immediate - Desktop worked perfectly after reboot

Recommended Actions

Immediate (Prevent Recurrence)

Implement Tdarr container resource limits - Prevent memory exhaustion
Update CIFS mount options - Better timeout and error handling
Convert to unmapped Tdarr node - Eliminate CIFS streaming during transcoding

Monitoring (Early Detection)

CIFS error monitoring - Detect network issues before escalation
Container resource monitoring - Alert on memory/CPU exhaustion
RCU stall detection - Kernel deadlock early warning

Architecture (Long-term Stability)

Unmapped transcoding architecture - Process files locally on NVMe cache
Gaming-aware scheduling - Prevent resource conflicts
Automated recovery procedures - Handle network issues gracefully

Key Learnings

Network storage + intensive I/O = risk - CIFS streaming large files during transcoding can trigger kernel issues
Container resource limits critical - Unlimited resources can destabilize entire system
Timeouts prevent hangs - Proper timeout configuration prevents 90-second deadlocks
Desktop symptoms != desktop cause - Display issues often indicate deeper system problems

Files Created

tdarr-container-fixes.md - Specific container configuration changes
cifs-mount-resilience-fixes.md - CIFS mount option improvements
crash-analysis-summary.md - This comprehensive analysis

Next Steps

Implement the recommendations in the order specified in the individual fix documents:

Phase 1: Immediate fixes to prevent crashes
Phase 2: Architecture migration for stability
Phase 3: Production hardening and monitoring

The system is currently stable, but without these changes, similar crashes are likely when processing large files over network storage during periods of network instability.

5.1 KiB Raw Blame History