All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
Adds title, description, type, domain, and tags frontmatter to every doc for improved KB semantic search. The description field is prepended to every search chunk, and domain/type/tags enable filtered queries. Type values: context, guide, runbook, reference, troubleshooting Domain values match directory structure (networking, docker, etc.) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5.5 KiB
5.5 KiB
| title | description | type | domain | tags | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| System Crash Analysis - CIFS/Tdarr | Root cause analysis of a kernel-level deadlock caused by CIFS network mount failures during intensive Tdarr transcoding, resulting in memory corruption, RCU stalls, and a hard system crash on Nobara. | troubleshooting | docker |
|
KDE Plasma Crash Analysis Summary
Date: 2025-08-11
Incident: Hard system crash requiring forced reboot
Analysis Period: ~11:00 - 11:58 (crash timeline)
Executive Summary
KDE Plasma did not actually crash - the system experienced kernel-level deadlocks caused by CIFS network issues combined with intensive Tdarr transcoding operations. The desktop environment became unresponsive as a symptom of deeper kernel problems.
Timeline of Events
11:05 - Network Issues Begin
CIFS: VFS: \\10.10.0.35 has not responded in 90 seconds. Reconnecting...
CIFS: VFS: reconnect tcon failed rc = -11
11:22:18 - Kernel Memory Corruption
BUG: Bad page state in process tdarr-ffmpeg pfn:a1af35
page: refcount:0 mapcount:0 mapping:00000000438f9be4 index:0x0 pfn:0xa1af35
aops:cifs_addr_ops [cifs] ino:2f15 dentry name(?):"Alice Through the Looking Glass (2016) Remux-1080p-TdarrCacheF"
11:23:21+ - RCU Stall Deadlock
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-0 rcu_node (CPUs 0-15): P456776
task:ffprobe state:R running task
11:26:40+ - System Deadlock
INFO: task NetworkManager:1806 blocked for more than 122 seconds
INFO: task tailscaled:188215 blocked for more than 122 seconds
INFO: task ThreadPoolForeg:125721 blocked for more than 122 seconds
11:46:56 - Display Issues (Symptom)
qt.qpa.wayland: There are no outputs - creating placeholder screen
kwin_wayland_drm: atomic commit failed: Invalid argument
Root Cause Analysis
Primary Cause: CIFS + Transcoding Interaction
- Network instability to NAS (10.10.0.35) starting at 11:05
- Tdarr container streaming large video file (10GB+ remux) over CIFS during transcoding
- Kernel memory corruption in CIFS address operations during heavy I/O
- RCU deadlock preventing kernel from completing critical operations
- System-wide hang affecting all processes including desktop environment
Contributing Factors
- No container resource limits - Tdarr could consume unlimited memory
- Mapped node architecture - Forces streaming large files over network during processing
- Aggressive CIFS buffers - 16MB buffers under memory pressure
- Inadequate timeout handling - 90-second hangs before retry attempts
- No interruption capability - Kernel couldn't abort hung CIFS operations
Why Hard Reboot Was Required
The kernel reached a state where:
- RCU subsystem deadlocked - Critical kernel operations couldn't complete
- NetworkManager blocked - Network stack unresponsive
- Memory management corrupted - Page allocation failures
- Display driver affected - GPU operations failed due to kernel issues
Normal shutdown impossible due to kernel-level deadlock.
Evidence Summary
System Recovered Cleanly
- After reboot at 11:58:56 - All services started normally
- No hardware failures - All components functional
- Memory test clean - 62GB available, no corruption detected
- KDE Plasma working - Desktop environment fully operational
KDE Plasma Was Victim, Not Cause
- Wayland errors were symptoms - Display issues occurred after kernel problems
- No Plasma-specific crashes - No segfaults or application failures in logs
- Recovery immediate - Desktop worked perfectly after reboot
Recommended Actions
Immediate (Prevent Recurrence)
- Implement Tdarr container resource limits - Prevent memory exhaustion
- Update CIFS mount options - Better timeout and error handling
- Convert to unmapped Tdarr node - Eliminate CIFS streaming during transcoding
Monitoring (Early Detection)
- CIFS error monitoring - Detect network issues before escalation
- Container resource monitoring - Alert on memory/CPU exhaustion
- RCU stall detection - Kernel deadlock early warning
Architecture (Long-term Stability)
- Unmapped transcoding architecture - Process files locally on NVMe cache
- Gaming-aware scheduling - Prevent resource conflicts
- Automated recovery procedures - Handle network issues gracefully
Key Learnings
- Network storage + intensive I/O = risk - CIFS streaming large files during transcoding can trigger kernel issues
- Container resource limits critical - Unlimited resources can destabilize entire system
- Timeouts prevent hangs - Proper timeout configuration prevents 90-second deadlocks
- Desktop symptoms != desktop cause - Display issues often indicate deeper system problems
Files Created
tdarr-container-fixes.md- Specific container configuration changescifs-mount-resilience-fixes.md- CIFS mount option improvementscrash-analysis-summary.md- This comprehensive analysis
Next Steps
Implement the recommendations in the order specified in the individual fix documents:
- Phase 1: Immediate fixes to prevent crashes
- Phase 2: Architecture migration for stability
- Phase 3: Production hardening and monitoring
The system is currently stable, but without these changes, similar crashes are likely when processing large files over network storage during periods of network instability.