claude-home/docker/examples/crash-analysis-summary.md
Cal Corum 4b7eca8a46
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
docs: add YAML frontmatter to all 151 markdown files
Adds title, description, type, domain, and tags frontmatter to every
doc for improved KB semantic search. The description field is prepended
to every search chunk, and domain/type/tags enable filtered queries.

Type values: context, guide, runbook, reference, troubleshooting
Domain values match directory structure (networking, docker, etc.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 09:00:44 -05:00

5.5 KiB

title description type domain tags
System Crash Analysis - CIFS/Tdarr Root cause analysis of a kernel-level deadlock caused by CIFS network mount failures during intensive Tdarr transcoding, resulting in memory corruption, RCU stalls, and a hard system crash on Nobara. troubleshooting docker
crash-analysis
kernel
cifs
tdarr
memory-corruption
rcu-stall
nobara
transcoding

KDE Plasma Crash Analysis Summary

Date: 2025-08-11
Incident: Hard system crash requiring forced reboot
Analysis Period: ~11:00 - 11:58 (crash timeline)

Executive Summary

KDE Plasma did not actually crash - the system experienced kernel-level deadlocks caused by CIFS network issues combined with intensive Tdarr transcoding operations. The desktop environment became unresponsive as a symptom of deeper kernel problems.

Timeline of Events

11:05 - Network Issues Begin

CIFS: VFS: \\10.10.0.35 has not responded in 90 seconds. Reconnecting...
CIFS: VFS: reconnect tcon failed rc = -11

11:22:18 - Kernel Memory Corruption

BUG: Bad page state in process tdarr-ffmpeg pfn:a1af35
page: refcount:0 mapcount:0 mapping:00000000438f9be4 index:0x0 pfn:0xa1af35
aops:cifs_addr_ops [cifs] ino:2f15 dentry name(?):"Alice Through the Looking Glass (2016) Remux-1080p-TdarrCacheF"

11:23:21+ - RCU Stall Deadlock

rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-0 rcu_node (CPUs 0-15): P456776
task:ffprobe state:R running task

11:26:40+ - System Deadlock

INFO: task NetworkManager:1806 blocked for more than 122 seconds
INFO: task tailscaled:188215 blocked for more than 122 seconds  
INFO: task ThreadPoolForeg:125721 blocked for more than 122 seconds

11:46:56 - Display Issues (Symptom)

qt.qpa.wayland: There are no outputs - creating placeholder screen
kwin_wayland_drm: atomic commit failed: Invalid argument

Root Cause Analysis

Primary Cause: CIFS + Transcoding Interaction

  1. Network instability to NAS (10.10.0.35) starting at 11:05
  2. Tdarr container streaming large video file (10GB+ remux) over CIFS during transcoding
  3. Kernel memory corruption in CIFS address operations during heavy I/O
  4. RCU deadlock preventing kernel from completing critical operations
  5. System-wide hang affecting all processes including desktop environment

Contributing Factors

  • No container resource limits - Tdarr could consume unlimited memory
  • Mapped node architecture - Forces streaming large files over network during processing
  • Aggressive CIFS buffers - 16MB buffers under memory pressure
  • Inadequate timeout handling - 90-second hangs before retry attempts
  • No interruption capability - Kernel couldn't abort hung CIFS operations

Why Hard Reboot Was Required

The kernel reached a state where:

  • RCU subsystem deadlocked - Critical kernel operations couldn't complete
  • NetworkManager blocked - Network stack unresponsive
  • Memory management corrupted - Page allocation failures
  • Display driver affected - GPU operations failed due to kernel issues

Normal shutdown impossible due to kernel-level deadlock.

Evidence Summary

System Recovered Cleanly

  • After reboot at 11:58:56 - All services started normally
  • No hardware failures - All components functional
  • Memory test clean - 62GB available, no corruption detected
  • KDE Plasma working - Desktop environment fully operational

KDE Plasma Was Victim, Not Cause

  • Wayland errors were symptoms - Display issues occurred after kernel problems
  • No Plasma-specific crashes - No segfaults or application failures in logs
  • Recovery immediate - Desktop worked perfectly after reboot

Immediate (Prevent Recurrence)

  1. Implement Tdarr container resource limits - Prevent memory exhaustion
  2. Update CIFS mount options - Better timeout and error handling
  3. Convert to unmapped Tdarr node - Eliminate CIFS streaming during transcoding

Monitoring (Early Detection)

  1. CIFS error monitoring - Detect network issues before escalation
  2. Container resource monitoring - Alert on memory/CPU exhaustion
  3. RCU stall detection - Kernel deadlock early warning

Architecture (Long-term Stability)

  1. Unmapped transcoding architecture - Process files locally on NVMe cache
  2. Gaming-aware scheduling - Prevent resource conflicts
  3. Automated recovery procedures - Handle network issues gracefully

Key Learnings

  1. Network storage + intensive I/O = risk - CIFS streaming large files during transcoding can trigger kernel issues
  2. Container resource limits critical - Unlimited resources can destabilize entire system
  3. Timeouts prevent hangs - Proper timeout configuration prevents 90-second deadlocks
  4. Desktop symptoms != desktop cause - Display issues often indicate deeper system problems

Files Created

  1. tdarr-container-fixes.md - Specific container configuration changes
  2. cifs-mount-resilience-fixes.md - CIFS mount option improvements
  3. crash-analysis-summary.md - This comprehensive analysis

Next Steps

Implement the recommendations in the order specified in the individual fix documents:

  1. Phase 1: Immediate fixes to prevent crashes
  2. Phase 2: Architecture migration for stability
  3. Phase 3: Production hardening and monitoring

The system is currently stable, but without these changes, similar crashes are likely when processing large files over network storage during periods of network instability.