# CIFS Mount Resilience Improvements **Date**: 2025-08-11 **Issue**: CIFS network errors escalating to kernel deadlocks and system crashes **Target**: /mnt/media mount to NAS at 10.10.0.35 ## Current Configuration Analysis **Current fstab entry**: ```bash //10.10.0.35/media /mnt/media cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,cache=loose,rsize=16777216,wsize=16777216,bsize=4194304,actimeo=30,closetimeo=5,echo_interval=30,noperm 0 0 ``` **Problems Identified**: - Missing critical timeout options leading to 90-second hangs - Aggressive buffer sizes (16MB) causing memory pressure during network issues - Limited retry attempts (retrans=1) providing minimal resilience - No explicit error handling for graceful degradation - Missing interruption handling preventing recovery from network deadlocks ## Recommended CIFS Mount Configuration **New improved fstab entry**: ```bash //10.10.0.35/media /mnt/media cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,soft,intr,timeo=15,retrans=3,rsize=1048576,wsize=1048576,cache=loose,actimeo=10,echo_interval=60,_netdev,noauto,x-systemd.automount,x-systemd.device-timeout=10,x-systemd.mount-timeout=30,noperm 0 0 ``` ## Key Improvements Explained ### Better Timeout Handling - **`timeo=15`** - 15-second timeout for RPC calls (prevents 90-second hangs) - **`retrans=3`** - 3 retry attempts instead of 1 - **`x-systemd.device-timeout=10`** - 10-second systemd device timeout - **`x-systemd.mount-timeout=30`** - 30-second mount operation timeout ### Graceful Error Recovery - **`soft`** - Allows operations to fail instead of hanging indefinitely - **`intr`** - Allows kernel to interrupt hung operations (CRITICAL for preventing deadlocks) - **`_netdev`** - Indicates network dependency for proper boot ordering - **`noauto,x-systemd.automount`** - Auto-mount on access, unmount when idle ### Preventing Kernel Deadlocks - **Smaller buffer sizes** - `rsize=1048576,wsize=1048576` (1MB instead of 16MB) reduces memory pressure - **`actimeo=10`** - Shorter attribute cache timeout (10s vs 30s) for faster error detection - **`echo_interval=60`** - Longer keepalive interval reduces network chatter ### Network Interruption Resilience - **`cache=loose`** - Maintains loose caching for better performance with network issues - **Combined timeout strategy** - Multiple timeout layers prevent single failure from hanging system ## Implementation Steps ### Step 1: Backup Current Configuration ```bash sudo cp /etc/fstab /etc/fstab.backup ``` ### Step 2: Update /etc/fstab Replace the current line with the recommended configuration above. ### Step 3: Test the New Configuration ```bash # Unmount current mount sudo umount /mnt/media # Remount with new options sudo mount /mnt/media # Verify new mount options are active mount | grep /mnt/media ``` ### Step 4: Validate Network Resilience ```bash # Test timeout behavior with network simulation # (Temporarily disconnect NAS network cable for 30 seconds) # Verify mount operations fail gracefully instead of hanging system ``` ## Additional System-Level Protections ### 1. Network Monitoring Script Create a monitoring script to detect NAS connectivity issues: ```bash #!/bin/bash # /mnt/NV2/Development/claude-home/scripts/monitoring/nas-connectivity-monitor.sh ping -c 1 -W 5 10.10.0.35 || echo "NAS connectivity issue detected" ``` ### 2. Systemd Service Dependencies Configure services to gracefully handle mount failures: ```bash # Add to services that depend on /mnt/media After=mnt-media.mount Wants=mnt-media.mount ``` ### 3. Kernel Parameter Tuning Consider CIFS timeout behavior tuning: ```bash # Add to /etc/sysctl.conf if needed echo 30 > /sys/module/cifs/parameters/CIFSMaxBufSize ``` ## Expected Improvements After implementing these changes: ### Immediate Benefits - **No more 90-second hangs** - Operations fail fast with 15-second timeouts - **Graceful error recovery** - `intr` allows kernel to interrupt hung operations - **Reduced memory pressure** - Smaller 1MB buffers vs 16MB - **Better retry behavior** - 3 attempts with exponential backoff ### System Stability - **Prevents kernel deadlocks** - Operations can be interrupted and retried - **Faster error detection** - 10-second attribute cache timeout - **Automatic recovery** - systemd auto-mounting handles reconnection ### Performance - **Maintained caching benefits** - `cache=loose` preserves performance - **Reduced network overhead** - 60-second keepalive intervals - **Efficient buffer usage** - 1MB buffers balance performance and stability ## Files to Modify 1. **`/etc/fstab`** - Primary mount configuration 2. **Optional monitoring scripts** - NAS connectivity checks 3. **Service configurations** - Dependencies on mount availability ## Testing Checklist - [ ] Backup current fstab configuration - [ ] Apply new mount options - [ ] Test normal operation (read/write files) - [ ] Test network interruption handling (disconnect NAS briefly) - [ ] Verify fast failure instead of system hangs - [ ] Monitor system stability over 24 hours - [ ] Validate with Tdarr container operations ## Monitoring and Validation ### Success Criteria - Mount operations fail within 30 seconds during network issues - No kernel RCU stalls or deadlock messages in journal - System remains responsive during NAS network problems - Automatic remount when network connectivity restored ### Long-term Monitoring - Monitor journal for CIFS error patterns - Track system stability metrics - Validate performance impact of smaller buffers - Ensure gaming and transcoding workloads remain unaffected