claude-home/reference/networking/cifs-mount-resilience-fixes.md
Cal Corum 34702a37fc CLAUDE: Add comprehensive KDE Plasma crash analysis and prevention documentation
- Add crash-analysis-summary.md: Complete incident timeline and root cause analysis
- Add tdarr-container-fixes.md: Container resource limits and unmapped node conversion
- Add cifs-mount-resilience-fixes.md: CIFS mount options for kernel stability
- Update tdarr-troubleshooting.md: Link to new system crash prevention measures
- Update nas-mount-configuration.md: Add stability considerations for production systems

Root cause: CIFS streaming of large files during transcoding caused kernel memory
corruption and system deadlock. Documents provide comprehensive prevention strategy.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-11 12:29:31 -05:00

5.5 KiB

CIFS Mount Resilience Improvements

Date: 2025-08-11
Issue: CIFS network errors escalating to kernel deadlocks and system crashes
Target: /mnt/media mount to NAS at 10.10.0.35

Current Configuration Analysis

Current fstab entry:

//10.10.0.35/media /mnt/media cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,cache=loose,rsize=16777216,wsize=16777216,bsize=4194304,actimeo=30,closetimeo=5,echo_interval=30,noperm 0 0

Problems Identified:

  • Missing critical timeout options leading to 90-second hangs
  • Aggressive buffer sizes (16MB) causing memory pressure during network issues
  • Limited retry attempts (retrans=1) providing minimal resilience
  • No explicit error handling for graceful degradation
  • Missing interruption handling preventing recovery from network deadlocks

New improved fstab entry:

//10.10.0.35/media /mnt/media cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,soft,intr,timeo=15,retrans=3,rsize=1048576,wsize=1048576,cache=loose,actimeo=10,echo_interval=60,_netdev,noauto,x-systemd.automount,x-systemd.device-timeout=10,x-systemd.mount-timeout=30,noperm 0 0

Key Improvements Explained

Better Timeout Handling

  • timeo=15 - 15-second timeout for RPC calls (prevents 90-second hangs)
  • retrans=3 - 3 retry attempts instead of 1
  • x-systemd.device-timeout=10 - 10-second systemd device timeout
  • x-systemd.mount-timeout=30 - 30-second mount operation timeout

Graceful Error Recovery

  • soft - Allows operations to fail instead of hanging indefinitely
  • intr - Allows kernel to interrupt hung operations (CRITICAL for preventing deadlocks)
  • _netdev - Indicates network dependency for proper boot ordering
  • noauto,x-systemd.automount - Auto-mount on access, unmount when idle

Preventing Kernel Deadlocks

  • Smaller buffer sizes - rsize=1048576,wsize=1048576 (1MB instead of 16MB) reduces memory pressure
  • actimeo=10 - Shorter attribute cache timeout (10s vs 30s) for faster error detection
  • echo_interval=60 - Longer keepalive interval reduces network chatter

Network Interruption Resilience

  • cache=loose - Maintains loose caching for better performance with network issues
  • Combined timeout strategy - Multiple timeout layers prevent single failure from hanging system

Implementation Steps

Step 1: Backup Current Configuration

sudo cp /etc/fstab /etc/fstab.backup

Step 2: Update /etc/fstab

Replace the current line with the recommended configuration above.

Step 3: Test the New Configuration

# Unmount current mount
sudo umount /mnt/media

# Remount with new options  
sudo mount /mnt/media

# Verify new mount options are active
mount | grep /mnt/media

Step 4: Validate Network Resilience

# Test timeout behavior with network simulation
# (Temporarily disconnect NAS network cable for 30 seconds)
# Verify mount operations fail gracefully instead of hanging system

Additional System-Level Protections

1. Network Monitoring Script

Create a monitoring script to detect NAS connectivity issues:

#!/bin/bash
# /mnt/NV2/Development/claude-home/scripts/monitoring/nas-connectivity-monitor.sh
ping -c 1 -W 5 10.10.0.35 || echo "NAS connectivity issue detected"

2. Systemd Service Dependencies

Configure services to gracefully handle mount failures:

# Add to services that depend on /mnt/media
After=mnt-media.mount
Wants=mnt-media.mount

3. Kernel Parameter Tuning

Consider CIFS timeout behavior tuning:

# Add to /etc/sysctl.conf if needed
echo 30 > /sys/module/cifs/parameters/CIFSMaxBufSize

Expected Improvements

After implementing these changes:

Immediate Benefits

  • No more 90-second hangs - Operations fail fast with 15-second timeouts
  • Graceful error recovery - intr allows kernel to interrupt hung operations
  • Reduced memory pressure - Smaller 1MB buffers vs 16MB
  • Better retry behavior - 3 attempts with exponential backoff

System Stability

  • Prevents kernel deadlocks - Operations can be interrupted and retried
  • Faster error detection - 10-second attribute cache timeout
  • Automatic recovery - systemd auto-mounting handles reconnection

Performance

  • Maintained caching benefits - cache=loose preserves performance
  • Reduced network overhead - 60-second keepalive intervals
  • Efficient buffer usage - 1MB buffers balance performance and stability

Files to Modify

  1. /etc/fstab - Primary mount configuration
  2. Optional monitoring scripts - NAS connectivity checks
  3. Service configurations - Dependencies on mount availability

Testing Checklist

  • Backup current fstab configuration
  • Apply new mount options
  • Test normal operation (read/write files)
  • Test network interruption handling (disconnect NAS briefly)
  • Verify fast failure instead of system hangs
  • Monitor system stability over 24 hours
  • Validate with Tdarr container operations

Monitoring and Validation

Success Criteria

  • Mount operations fail within 30 seconds during network issues
  • No kernel RCU stalls or deadlock messages in journal
  • System remains responsive during NAS network problems
  • Automatic remount when network connectivity restored

Long-term Monitoring

  • Monitor journal for CIFS error patterns
  • Track system stability metrics
  • Validate performance impact of smaller buffers
  • Ensure gaming and transcoding workloads remain unaffected