claude-home/reference/docker/crash-analysis-summary.md
Cal Corum 34702a37fc CLAUDE: Add comprehensive KDE Plasma crash analysis and prevention documentation
- Add crash-analysis-summary.md: Complete incident timeline and root cause analysis
- Add tdarr-container-fixes.md: Container resource limits and unmapped node conversion
- Add cifs-mount-resilience-fixes.md: CIFS mount options for kernel stability
- Update tdarr-troubleshooting.md: Link to new system crash prevention measures
- Update nas-mount-configuration.md: Add stability considerations for production systems

Root cause: CIFS streaming of large files during transcoding caused kernel memory
corruption and system deadlock. Documents provide comprehensive prevention strategy.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-11 12:29:31 -05:00

5.1 KiB

KDE Plasma Crash Analysis Summary

Date: 2025-08-11
Incident: Hard system crash requiring forced reboot
Analysis Period: ~11:00 - 11:58 (crash timeline)

Executive Summary

KDE Plasma did not actually crash - the system experienced kernel-level deadlocks caused by CIFS network issues combined with intensive Tdarr transcoding operations. The desktop environment became unresponsive as a symptom of deeper kernel problems.

Timeline of Events

11:05 - Network Issues Begin

CIFS: VFS: \\10.10.0.35 has not responded in 90 seconds. Reconnecting...
CIFS: VFS: reconnect tcon failed rc = -11

11:22:18 - Kernel Memory Corruption

BUG: Bad page state in process tdarr-ffmpeg pfn:a1af35
page: refcount:0 mapcount:0 mapping:00000000438f9be4 index:0x0 pfn:0xa1af35
aops:cifs_addr_ops [cifs] ino:2f15 dentry name(?):"Alice Through the Looking Glass (2016) Remux-1080p-TdarrCacheF"

11:23:21+ - RCU Stall Deadlock

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-0 rcu_node (CPUs 0-15): P456776
task:ffprobe state:R running task

11:26:40+ - System Deadlock

INFO: task NetworkManager:1806 blocked for more than 122 seconds
INFO: task tailscaled:188215 blocked for more than 122 seconds  
INFO: task ThreadPoolForeg:125721 blocked for more than 122 seconds

11:46:56 - Display Issues (Symptom)

qt.qpa.wayland: There are no outputs - creating placeholder screen
kwin_wayland_drm: atomic commit failed: Invalid argument

Root Cause Analysis

Primary Cause: CIFS + Transcoding Interaction

  1. Network instability to NAS (10.10.0.35) starting at 11:05
  2. Tdarr container streaming large video file (10GB+ remux) over CIFS during transcoding
  3. Kernel memory corruption in CIFS address operations during heavy I/O
  4. RCU deadlock preventing kernel from completing critical operations
  5. System-wide hang affecting all processes including desktop environment

Contributing Factors

  • No container resource limits - Tdarr could consume unlimited memory
  • Mapped node architecture - Forces streaming large files over network during processing
  • Aggressive CIFS buffers - 16MB buffers under memory pressure
  • Inadequate timeout handling - 90-second hangs before retry attempts
  • No interruption capability - Kernel couldn't abort hung CIFS operations

Why Hard Reboot Was Required

The kernel reached a state where:

  • RCU subsystem deadlocked - Critical kernel operations couldn't complete
  • NetworkManager blocked - Network stack unresponsive
  • Memory management corrupted - Page allocation failures
  • Display driver affected - GPU operations failed due to kernel issues

Normal shutdown impossible due to kernel-level deadlock.

Evidence Summary

System Recovered Cleanly

  • After reboot at 11:58:56 - All services started normally
  • No hardware failures - All components functional
  • Memory test clean - 62GB available, no corruption detected
  • KDE Plasma working - Desktop environment fully operational

KDE Plasma Was Victim, Not Cause

  • Wayland errors were symptoms - Display issues occurred after kernel problems
  • No Plasma-specific crashes - No segfaults or application failures in logs
  • Recovery immediate - Desktop worked perfectly after reboot

Immediate (Prevent Recurrence)

  1. Implement Tdarr container resource limits - Prevent memory exhaustion
  2. Update CIFS mount options - Better timeout and error handling
  3. Convert to unmapped Tdarr node - Eliminate CIFS streaming during transcoding

Monitoring (Early Detection)

  1. CIFS error monitoring - Detect network issues before escalation
  2. Container resource monitoring - Alert on memory/CPU exhaustion
  3. RCU stall detection - Kernel deadlock early warning

Architecture (Long-term Stability)

  1. Unmapped transcoding architecture - Process files locally on NVMe cache
  2. Gaming-aware scheduling - Prevent resource conflicts
  3. Automated recovery procedures - Handle network issues gracefully

Key Learnings

  1. Network storage + intensive I/O = risk - CIFS streaming large files during transcoding can trigger kernel issues
  2. Container resource limits critical - Unlimited resources can destabilize entire system
  3. Timeouts prevent hangs - Proper timeout configuration prevents 90-second deadlocks
  4. Desktop symptoms != desktop cause - Display issues often indicate deeper system problems

Files Created

  1. tdarr-container-fixes.md - Specific container configuration changes
  2. cifs-mount-resilience-fixes.md - CIFS mount option improvements
  3. crash-analysis-summary.md - This comprehensive analysis

Next Steps

Implement the recommendations in the order specified in the individual fix documents:

  1. Phase 1: Immediate fixes to prevent crashes
  2. Phase 2: Architecture migration for stability
  3. Phase 3: Production hardening and monitoring

The system is currently stable, but without these changes, similar crashes are likely when processing large files over network storage during periods of network instability.