Complete restructure from patterns/examples/reference to technology-focused directories: • Created technology-specific directories with comprehensive documentation: - /tdarr/ - Transcoding automation with gaming-aware scheduling - /docker/ - Container management with GPU acceleration patterns - /vm-management/ - Virtual machine automation and cloud-init - /networking/ - SSH infrastructure, reverse proxy, and security - /monitoring/ - System health checks and Discord notifications - /databases/ - Database patterns and troubleshooting - /development/ - Programming language patterns (bash, nodejs, python, vuejs) • Enhanced CLAUDE.md with intelligent context loading: - Technology-first loading rules for automatic context provision - Troubleshooting keyword triggers for emergency scenarios - Documentation maintenance protocols with automated reminders - Context window management for optimal documentation updates • Preserved valuable content from .claude/tmp/: - SSH security improvements and server inventory - Tdarr CIFS troubleshooting and Docker iptables solutions - Operational scripts with proper technology classification • Benefits achieved: - Self-contained technology directories with complete context - Automatic loading of relevant documentation based on keywords - Emergency-ready troubleshooting with comprehensive guides - Scalable structure for future technology additions - Eliminated context bloat through targeted loading 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
5.1 KiB
5.1 KiB
KDE Plasma Crash Analysis Summary
Date: 2025-08-11
Incident: Hard system crash requiring forced reboot
Analysis Period: ~11:00 - 11:58 (crash timeline)
Executive Summary
KDE Plasma did not actually crash - the system experienced kernel-level deadlocks caused by CIFS network issues combined with intensive Tdarr transcoding operations. The desktop environment became unresponsive as a symptom of deeper kernel problems.
Timeline of Events
11:05 - Network Issues Begin
CIFS: VFS: \\10.10.0.35 has not responded in 90 seconds. Reconnecting...
CIFS: VFS: reconnect tcon failed rc = -11
11:22:18 - Kernel Memory Corruption
BUG: Bad page state in process tdarr-ffmpeg pfn:a1af35
page: refcount:0 mapcount:0 mapping:00000000438f9be4 index:0x0 pfn:0xa1af35
aops:cifs_addr_ops [cifs] ino:2f15 dentry name(?):"Alice Through the Looking Glass (2016) Remux-1080p-TdarrCacheF"
11:23:21+ - RCU Stall Deadlock
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-0 rcu_node (CPUs 0-15): P456776
task:ffprobe state:R running task
11:26:40+ - System Deadlock
INFO: task NetworkManager:1806 blocked for more than 122 seconds
INFO: task tailscaled:188215 blocked for more than 122 seconds
INFO: task ThreadPoolForeg:125721 blocked for more than 122 seconds
11:46:56 - Display Issues (Symptom)
qt.qpa.wayland: There are no outputs - creating placeholder screen
kwin_wayland_drm: atomic commit failed: Invalid argument
Root Cause Analysis
Primary Cause: CIFS + Transcoding Interaction
- Network instability to NAS (10.10.0.35) starting at 11:05
- Tdarr container streaming large video file (10GB+ remux) over CIFS during transcoding
- Kernel memory corruption in CIFS address operations during heavy I/O
- RCU deadlock preventing kernel from completing critical operations
- System-wide hang affecting all processes including desktop environment
Contributing Factors
- No container resource limits - Tdarr could consume unlimited memory
- Mapped node architecture - Forces streaming large files over network during processing
- Aggressive CIFS buffers - 16MB buffers under memory pressure
- Inadequate timeout handling - 90-second hangs before retry attempts
- No interruption capability - Kernel couldn't abort hung CIFS operations
Why Hard Reboot Was Required
The kernel reached a state where:
- RCU subsystem deadlocked - Critical kernel operations couldn't complete
- NetworkManager blocked - Network stack unresponsive
- Memory management corrupted - Page allocation failures
- Display driver affected - GPU operations failed due to kernel issues
Normal shutdown impossible due to kernel-level deadlock.
Evidence Summary
System Recovered Cleanly
- After reboot at 11:58:56 - All services started normally
- No hardware failures - All components functional
- Memory test clean - 62GB available, no corruption detected
- KDE Plasma working - Desktop environment fully operational
KDE Plasma Was Victim, Not Cause
- Wayland errors were symptoms - Display issues occurred after kernel problems
- No Plasma-specific crashes - No segfaults or application failures in logs
- Recovery immediate - Desktop worked perfectly after reboot
Recommended Actions
Immediate (Prevent Recurrence)
- Implement Tdarr container resource limits - Prevent memory exhaustion
- Update CIFS mount options - Better timeout and error handling
- Convert to unmapped Tdarr node - Eliminate CIFS streaming during transcoding
Monitoring (Early Detection)
- CIFS error monitoring - Detect network issues before escalation
- Container resource monitoring - Alert on memory/CPU exhaustion
- RCU stall detection - Kernel deadlock early warning
Architecture (Long-term Stability)
- Unmapped transcoding architecture - Process files locally on NVMe cache
- Gaming-aware scheduling - Prevent resource conflicts
- Automated recovery procedures - Handle network issues gracefully
Key Learnings
- Network storage + intensive I/O = risk - CIFS streaming large files during transcoding can trigger kernel issues
- Container resource limits critical - Unlimited resources can destabilize entire system
- Timeouts prevent hangs - Proper timeout configuration prevents 90-second deadlocks
- Desktop symptoms != desktop cause - Display issues often indicate deeper system problems
Files Created
tdarr-container-fixes.md- Specific container configuration changescifs-mount-resilience-fixes.md- CIFS mount option improvementscrash-analysis-summary.md- This comprehensive analysis
Next Steps
Implement the recommendations in the order specified in the individual fix documents:
- Phase 1: Immediate fixes to prevent crashes
- Phase 2: Architecture migration for stability
- Phase 3: Production hardening and monitoring
The system is currently stable, but without these changes, similar crashes are likely when processing large files over network storage during periods of network instability.