Cal Corum 34702a37fc CLAUDE: Add comprehensive KDE Plasma crash analysis and prevention documentation

- Add crash-analysis-summary.md: Complete incident timeline and root cause analysis
- Add tdarr-container-fixes.md: Container resource limits and unmapped node conversion
- Add cifs-mount-resilience-fixes.md: CIFS mount options for kernel stability
- Update tdarr-troubleshooting.md: Link to new system crash prevention measures
- Update nas-mount-configuration.md: Add stability considerations for production systems

Root cause: CIFS streaming of large files during transcoding caused kernel memory
corruption and system deadlock. Documents provide comprehensive prevention strategy.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-08-11 12:29:31 -05:00

5.5 KiB

Raw Blame History

CIFS Mount Resilience Improvements

Date: 2025-08-11
Issue: CIFS network errors escalating to kernel deadlocks and system crashes
Target: /mnt/media mount to NAS at 10.10.0.35

Current Configuration Analysis

Current fstab entry:

//10.10.0.35/media /mnt/media cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,cache=loose,rsize=16777216,wsize=16777216,bsize=4194304,actimeo=30,closetimeo=5,echo_interval=30,noperm 0 0

Problems Identified:

Missing critical timeout options leading to 90-second hangs
Aggressive buffer sizes (16MB) causing memory pressure during network issues
Limited retry attempts (retrans=1) providing minimal resilience
No explicit error handling for graceful degradation
Missing interruption handling preventing recovery from network deadlocks

Recommended CIFS Mount Configuration

New improved fstab entry:

//10.10.0.35/media /mnt/media cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,soft,intr,timeo=15,retrans=3,rsize=1048576,wsize=1048576,cache=loose,actimeo=10,echo_interval=60,_netdev,noauto,x-systemd.automount,x-systemd.device-timeout=10,x-systemd.mount-timeout=30,noperm 0 0

Key Improvements Explained

Better Timeout Handling

timeo=15 - 15-second timeout for RPC calls (prevents 90-second hangs)
retrans=3 - 3 retry attempts instead of 1
x-systemd.device-timeout=10 - 10-second systemd device timeout
x-systemd.mount-timeout=30 - 30-second mount operation timeout

Graceful Error Recovery

soft - Allows operations to fail instead of hanging indefinitely
intr - Allows kernel to interrupt hung operations (CRITICAL for preventing deadlocks)
_netdev - Indicates network dependency for proper boot ordering
noauto,x-systemd.automount - Auto-mount on access, unmount when idle

Preventing Kernel Deadlocks

Smaller buffer sizes - rsize=1048576,wsize=1048576 (1MB instead of 16MB) reduces memory pressure
actimeo=10 - Shorter attribute cache timeout (10s vs 30s) for faster error detection
echo_interval=60 - Longer keepalive interval reduces network chatter

Network Interruption Resilience

cache=loose - Maintains loose caching for better performance with network issues
Combined timeout strategy - Multiple timeout layers prevent single failure from hanging system

Implementation Steps

Step 1: Backup Current Configuration

sudo cp /etc/fstab /etc/fstab.backup

Step 2: Update /etc/fstab

Replace the current line with the recommended configuration above.

Step 3: Test the New Configuration

# Unmount current mount
sudo umount /mnt/media

# Remount with new options  
sudo mount /mnt/media

# Verify new mount options are active
mount | grep /mnt/media

Step 4: Validate Network Resilience

# Test timeout behavior with network simulation
# (Temporarily disconnect NAS network cable for 30 seconds)
# Verify mount operations fail gracefully instead of hanging system

Additional System-Level Protections

1. Network Monitoring Script

Create a monitoring script to detect NAS connectivity issues:

#!/bin/bash
# /mnt/NV2/Development/claude-home/scripts/monitoring/nas-connectivity-monitor.sh
ping -c 1 -W 5 10.10.0.35 || echo "NAS connectivity issue detected"

2. Systemd Service Dependencies

Configure services to gracefully handle mount failures:

# Add to services that depend on /mnt/media
After=mnt-media.mount
Wants=mnt-media.mount

3. Kernel Parameter Tuning

Consider CIFS timeout behavior tuning:

# Add to /etc/sysctl.conf if needed
echo 30 > /sys/module/cifs/parameters/CIFSMaxBufSize

Expected Improvements

After implementing these changes:

Immediate Benefits

No more 90-second hangs - Operations fail fast with 15-second timeouts
Graceful error recovery - intr allows kernel to interrupt hung operations
Reduced memory pressure - Smaller 1MB buffers vs 16MB
Better retry behavior - 3 attempts with exponential backoff

System Stability

Prevents kernel deadlocks - Operations can be interrupted and retried
Faster error detection - 10-second attribute cache timeout
Automatic recovery - systemd auto-mounting handles reconnection

Performance

Maintained caching benefits - cache=loose preserves performance
Reduced network overhead - 60-second keepalive intervals
Efficient buffer usage - 1MB buffers balance performance and stability

Files to Modify

/etc/fstab - Primary mount configuration
Optional monitoring scripts - NAS connectivity checks
Service configurations - Dependencies on mount availability

Testing Checklist

Backup current fstab configuration
Apply new mount options
Test normal operation (read/write files)
Test network interruption handling (disconnect NAS briefly)
Verify fast failure instead of system hangs
Monitor system stability over 24 hours
Validate with Tdarr container operations

5.5 KiB

Raw Blame History

CIFS Mount Resilience Improvements

Current Configuration Analysis

Recommended CIFS Mount Configuration

Key Improvements Explained

Better Timeout Handling

Graceful Error Recovery

Preventing Kernel Deadlocks

Network Interruption Resilience

Implementation Steps

Step 1: Backup Current Configuration

Step 2: Update /etc/fstab

Step 3: Test the New Configuration

Step 4: Validate Network Resilience

Additional System-Level Protections

1. Network Monitoring Script

2. Systemd Service Dependencies

3. Kernel Parameter Tuning

Expected Improvements

Immediate Benefits

System Stability

Performance

Files to Modify

Testing Checklist

Monitoring and Validation

Success Criteria

Long-term Monitoring

5.5 KiB Raw Blame History

CIFS Mount Resilience Improvements

Current Configuration Analysis

Recommended CIFS Mount Configuration

Key Improvements Explained

Better Timeout Handling

Graceful Error Recovery

Preventing Kernel Deadlocks

Network Interruption Resilience

Implementation Steps

Step 1: Backup Current Configuration

Step 2: Update /etc/fstab

Step 3: Test the New Configuration

Step 4: Validate Network Resilience

Additional System-Level Protections

1. Network Monitoring Script

2. Systemd Service Dependencies

3. Kernel Parameter Tuning

Expected Improvements

Immediate Benefits

System Stability

Performance

Files to Modify

Testing Checklist

Monitoring and Validation

Success Criteria

Long-term Monitoring

5.5 KiB

Raw Blame History