claude-home/docker/examples/crash-analysis-summary.md
Cal Corum 4b7eca8a46
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
docs: add YAML frontmatter to all 151 markdown files
Adds title, description, type, domain, and tags frontmatter to every
doc for improved KB semantic search. The description field is prepended
to every search chunk, and domain/type/tags enable filtered queries.

Type values: context, guide, runbook, reference, troubleshooting
Domain values match directory structure (networking, docker, etc.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 09:00:44 -05:00

130 lines
5.5 KiB
Markdown

---
title: "System Crash Analysis - CIFS/Tdarr"
description: "Root cause analysis of a kernel-level deadlock caused by CIFS network mount failures during intensive Tdarr transcoding, resulting in memory corruption, RCU stalls, and a hard system crash on Nobara."
type: troubleshooting
domain: docker
tags: [crash-analysis, kernel, cifs, tdarr, memory-corruption, rcu-stall, nobara, transcoding]
---
# KDE Plasma Crash Analysis Summary
**Date**: 2025-08-11
**Incident**: Hard system crash requiring forced reboot
**Analysis Period**: ~11:00 - 11:58 (crash timeline)
## Executive Summary
KDE Plasma did not actually crash - the system experienced **kernel-level deadlocks** caused by CIFS network issues combined with intensive Tdarr transcoding operations. The desktop environment became unresponsive as a symptom of deeper kernel problems.
## Timeline of Events
### 11:05 - Network Issues Begin
```
CIFS: VFS: \\10.10.0.35 has not responded in 90 seconds. Reconnecting...
CIFS: VFS: reconnect tcon failed rc = -11
```
### 11:22:18 - Kernel Memory Corruption
```
BUG: Bad page state in process tdarr-ffmpeg pfn:a1af35
page: refcount:0 mapcount:0 mapping:00000000438f9be4 index:0x0 pfn:0xa1af35
aops:cifs_addr_ops [cifs] ino:2f15 dentry name(?):"Alice Through the Looking Glass (2016) Remux-1080p-TdarrCacheF"
```
### 11:23:21+ - RCU Stall Deadlock
```
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-0 rcu_node (CPUs 0-15): P456776
task:ffprobe state:R running task
```
### 11:26:40+ - System Deadlock
```
INFO: task NetworkManager:1806 blocked for more than 122 seconds
INFO: task tailscaled:188215 blocked for more than 122 seconds
INFO: task ThreadPoolForeg:125721 blocked for more than 122 seconds
```
### 11:46:56 - Display Issues (Symptom)
```
qt.qpa.wayland: There are no outputs - creating placeholder screen
kwin_wayland_drm: atomic commit failed: Invalid argument
```
## Root Cause Analysis
### Primary Cause: CIFS + Transcoding Interaction
1. **Network instability** to NAS (10.10.0.35) starting at 11:05
2. **Tdarr container** streaming large video file (10GB+ remux) over CIFS during transcoding
3. **Kernel memory corruption** in CIFS address operations during heavy I/O
4. **RCU deadlock** preventing kernel from completing critical operations
5. **System-wide hang** affecting all processes including desktop environment
### Contributing Factors
- **No container resource limits** - Tdarr could consume unlimited memory
- **Mapped node architecture** - Forces streaming large files over network during processing
- **Aggressive CIFS buffers** - 16MB buffers under memory pressure
- **Inadequate timeout handling** - 90-second hangs before retry attempts
- **No interruption capability** - Kernel couldn't abort hung CIFS operations
## Why Hard Reboot Was Required
The kernel reached a state where:
- **RCU subsystem deadlocked** - Critical kernel operations couldn't complete
- **NetworkManager blocked** - Network stack unresponsive
- **Memory management corrupted** - Page allocation failures
- **Display driver affected** - GPU operations failed due to kernel issues
Normal shutdown impossible due to kernel-level deadlock.
## Evidence Summary
### System Recovered Cleanly
- **After reboot at 11:58:56** - All services started normally
- **No hardware failures** - All components functional
- **Memory test clean** - 62GB available, no corruption detected
- **KDE Plasma working** - Desktop environment fully operational
### KDE Plasma Was Victim, Not Cause
- **Wayland errors were symptoms** - Display issues occurred after kernel problems
- **No Plasma-specific crashes** - No segfaults or application failures in logs
- **Recovery immediate** - Desktop worked perfectly after reboot
## Recommended Actions
### Immediate (Prevent Recurrence)
1. **Implement Tdarr container resource limits** - Prevent memory exhaustion
2. **Update CIFS mount options** - Better timeout and error handling
3. **Convert to unmapped Tdarr node** - Eliminate CIFS streaming during transcoding
### Monitoring (Early Detection)
1. **CIFS error monitoring** - Detect network issues before escalation
2. **Container resource monitoring** - Alert on memory/CPU exhaustion
3. **RCU stall detection** - Kernel deadlock early warning
### Architecture (Long-term Stability)
1. **Unmapped transcoding architecture** - Process files locally on NVMe cache
2. **Gaming-aware scheduling** - Prevent resource conflicts
3. **Automated recovery procedures** - Handle network issues gracefully
## Key Learnings
1. **Network storage + intensive I/O = risk** - CIFS streaming large files during transcoding can trigger kernel issues
2. **Container resource limits critical** - Unlimited resources can destabilize entire system
3. **Timeouts prevent hangs** - Proper timeout configuration prevents 90-second deadlocks
4. **Desktop symptoms != desktop cause** - Display issues often indicate deeper system problems
## Files Created
1. **`tdarr-container-fixes.md`** - Specific container configuration changes
2. **`cifs-mount-resilience-fixes.md`** - CIFS mount option improvements
3. **`crash-analysis-summary.md`** - This comprehensive analysis
## Next Steps
Implement the recommendations in the order specified in the individual fix documents:
1. Phase 1: Immediate fixes to prevent crashes
2. Phase 2: Architecture migration for stability
3. Phase 3: Production hardening and monitoring
The system is currently stable, but without these changes, similar crashes are likely when processing large files over network storage during periods of network instability.