All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
Adds title, description, type, domain, and tags frontmatter to every doc for improved KB semantic search. The description field is prepended to every search chunk, and domain/type/tags enable filtered queries. Type values: context, guide, runbook, reference, troubleshooting Domain values match directory structure (networking, docker, etc.) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
130 lines
5.5 KiB
Markdown
130 lines
5.5 KiB
Markdown
---
|
|
title: "System Crash Analysis - CIFS/Tdarr"
|
|
description: "Root cause analysis of a kernel-level deadlock caused by CIFS network mount failures during intensive Tdarr transcoding, resulting in memory corruption, RCU stalls, and a hard system crash on Nobara."
|
|
type: troubleshooting
|
|
domain: docker
|
|
tags: [crash-analysis, kernel, cifs, tdarr, memory-corruption, rcu-stall, nobara, transcoding]
|
|
---
|
|
|
|
# KDE Plasma Crash Analysis Summary
|
|
|
|
**Date**: 2025-08-11
|
|
**Incident**: Hard system crash requiring forced reboot
|
|
**Analysis Period**: ~11:00 - 11:58 (crash timeline)
|
|
|
|
## Executive Summary
|
|
|
|
KDE Plasma did not actually crash - the system experienced **kernel-level deadlocks** caused by CIFS network issues combined with intensive Tdarr transcoding operations. The desktop environment became unresponsive as a symptom of deeper kernel problems.
|
|
|
|
## Timeline of Events
|
|
|
|
### 11:05 - Network Issues Begin
|
|
```
|
|
CIFS: VFS: \\10.10.0.35 has not responded in 90 seconds. Reconnecting...
|
|
CIFS: VFS: reconnect tcon failed rc = -11
|
|
```
|
|
|
|
### 11:22:18 - Kernel Memory Corruption
|
|
```
|
|
BUG: Bad page state in process tdarr-ffmpeg pfn:a1af35
|
|
page: refcount:0 mapcount:0 mapping:00000000438f9be4 index:0x0 pfn:0xa1af35
|
|
aops:cifs_addr_ops [cifs] ino:2f15 dentry name(?):"Alice Through the Looking Glass (2016) Remux-1080p-TdarrCacheF"
|
|
```
|
|
|
|
### 11:23:21+ - RCU Stall Deadlock
|
|
```
|
|
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
|
|
rcu: Tasks blocked on level-0 rcu_node (CPUs 0-15): P456776
|
|
task:ffprobe state:R running task
|
|
```
|
|
|
|
### 11:26:40+ - System Deadlock
|
|
```
|
|
INFO: task NetworkManager:1806 blocked for more than 122 seconds
|
|
INFO: task tailscaled:188215 blocked for more than 122 seconds
|
|
INFO: task ThreadPoolForeg:125721 blocked for more than 122 seconds
|
|
```
|
|
|
|
### 11:46:56 - Display Issues (Symptom)
|
|
```
|
|
qt.qpa.wayland: There are no outputs - creating placeholder screen
|
|
kwin_wayland_drm: atomic commit failed: Invalid argument
|
|
```
|
|
|
|
## Root Cause Analysis
|
|
|
|
### Primary Cause: CIFS + Transcoding Interaction
|
|
1. **Network instability** to NAS (10.10.0.35) starting at 11:05
|
|
2. **Tdarr container** streaming large video file (10GB+ remux) over CIFS during transcoding
|
|
3. **Kernel memory corruption** in CIFS address operations during heavy I/O
|
|
4. **RCU deadlock** preventing kernel from completing critical operations
|
|
5. **System-wide hang** affecting all processes including desktop environment
|
|
|
|
### Contributing Factors
|
|
- **No container resource limits** - Tdarr could consume unlimited memory
|
|
- **Mapped node architecture** - Forces streaming large files over network during processing
|
|
- **Aggressive CIFS buffers** - 16MB buffers under memory pressure
|
|
- **Inadequate timeout handling** - 90-second hangs before retry attempts
|
|
- **No interruption capability** - Kernel couldn't abort hung CIFS operations
|
|
|
|
## Why Hard Reboot Was Required
|
|
|
|
The kernel reached a state where:
|
|
- **RCU subsystem deadlocked** - Critical kernel operations couldn't complete
|
|
- **NetworkManager blocked** - Network stack unresponsive
|
|
- **Memory management corrupted** - Page allocation failures
|
|
- **Display driver affected** - GPU operations failed due to kernel issues
|
|
|
|
Normal shutdown impossible due to kernel-level deadlock.
|
|
|
|
## Evidence Summary
|
|
|
|
### System Recovered Cleanly
|
|
- **After reboot at 11:58:56** - All services started normally
|
|
- **No hardware failures** - All components functional
|
|
- **Memory test clean** - 62GB available, no corruption detected
|
|
- **KDE Plasma working** - Desktop environment fully operational
|
|
|
|
### KDE Plasma Was Victim, Not Cause
|
|
- **Wayland errors were symptoms** - Display issues occurred after kernel problems
|
|
- **No Plasma-specific crashes** - No segfaults or application failures in logs
|
|
- **Recovery immediate** - Desktop worked perfectly after reboot
|
|
|
|
## Recommended Actions
|
|
|
|
### Immediate (Prevent Recurrence)
|
|
1. **Implement Tdarr container resource limits** - Prevent memory exhaustion
|
|
2. **Update CIFS mount options** - Better timeout and error handling
|
|
3. **Convert to unmapped Tdarr node** - Eliminate CIFS streaming during transcoding
|
|
|
|
### Monitoring (Early Detection)
|
|
1. **CIFS error monitoring** - Detect network issues before escalation
|
|
2. **Container resource monitoring** - Alert on memory/CPU exhaustion
|
|
3. **RCU stall detection** - Kernel deadlock early warning
|
|
|
|
### Architecture (Long-term Stability)
|
|
1. **Unmapped transcoding architecture** - Process files locally on NVMe cache
|
|
2. **Gaming-aware scheduling** - Prevent resource conflicts
|
|
3. **Automated recovery procedures** - Handle network issues gracefully
|
|
|
|
## Key Learnings
|
|
|
|
1. **Network storage + intensive I/O = risk** - CIFS streaming large files during transcoding can trigger kernel issues
|
|
2. **Container resource limits critical** - Unlimited resources can destabilize entire system
|
|
3. **Timeouts prevent hangs** - Proper timeout configuration prevents 90-second deadlocks
|
|
4. **Desktop symptoms != desktop cause** - Display issues often indicate deeper system problems
|
|
|
|
## Files Created
|
|
|
|
1. **`tdarr-container-fixes.md`** - Specific container configuration changes
|
|
2. **`cifs-mount-resilience-fixes.md`** - CIFS mount option improvements
|
|
3. **`crash-analysis-summary.md`** - This comprehensive analysis
|
|
|
|
## Next Steps
|
|
|
|
Implement the recommendations in the order specified in the individual fix documents:
|
|
1. Phase 1: Immediate fixes to prevent crashes
|
|
2. Phase 2: Architecture migration for stability
|
|
3. Phase 3: Production hardening and monitoring
|
|
|
|
The system is currently stable, but without these changes, similar crashes are likely when processing large files over network storage during periods of network instability. |