claude-home/networking/examples/cifs-mount-resilience-fixes.md
Cal Corum 4b7eca8a46
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
docs: add YAML frontmatter to all 151 markdown files
Adds title, description, type, domain, and tags frontmatter to every
doc for improved KB semantic search. The description field is prepended
to every search chunk, and domain/type/tags enable filtered queries.

Type values: context, guide, runbook, reference, troubleshooting
Domain values match directory structure (networking, docker, etc.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 09:00:44 -05:00

5.8 KiB

title description type domain tags
CIFS Mount Resilience Fixes Improved CIFS fstab config to prevent kernel deadlocks during NAS network issues, with soft mounts, interrupt handling, reduced buffers, and systemd automount. runbook networking
cifs
smb
nas
fstab
kernel
stability
truenas

CIFS Mount Resilience Improvements

Date: 2025-08-11
Issue: CIFS network errors escalating to kernel deadlocks and system crashes
Target: /mnt/media mount to NAS at 10.10.0.35

Current Configuration Analysis

Current fstab entry:

//10.10.0.35/media /mnt/media cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,cache=loose,rsize=16777216,wsize=16777216,bsize=4194304,actimeo=30,closetimeo=5,echo_interval=30,noperm 0 0

Problems Identified:

  • Missing critical timeout options leading to 90-second hangs
  • Aggressive buffer sizes (16MB) causing memory pressure during network issues
  • Limited retry attempts (retrans=1) providing minimal resilience
  • No explicit error handling for graceful degradation
  • Missing interruption handling preventing recovery from network deadlocks

New improved fstab entry:

//10.10.0.35/media /mnt/media cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,soft,intr,timeo=15,retrans=3,rsize=1048576,wsize=1048576,cache=loose,actimeo=10,echo_interval=60,_netdev,noauto,x-systemd.automount,x-systemd.device-timeout=10,x-systemd.mount-timeout=30,noperm 0 0

Key Improvements Explained

Better Timeout Handling

  • timeo=15 - 15-second timeout for RPC calls (prevents 90-second hangs)
  • retrans=3 - 3 retry attempts instead of 1
  • x-systemd.device-timeout=10 - 10-second systemd device timeout
  • x-systemd.mount-timeout=30 - 30-second mount operation timeout

Graceful Error Recovery

  • soft - Allows operations to fail instead of hanging indefinitely
  • intr - Allows kernel to interrupt hung operations (CRITICAL for preventing deadlocks)
  • _netdev - Indicates network dependency for proper boot ordering
  • noauto,x-systemd.automount - Auto-mount on access, unmount when idle

Preventing Kernel Deadlocks

  • Smaller buffer sizes - rsize=1048576,wsize=1048576 (1MB instead of 16MB) reduces memory pressure
  • actimeo=10 - Shorter attribute cache timeout (10s vs 30s) for faster error detection
  • echo_interval=60 - Longer keepalive interval reduces network chatter

Network Interruption Resilience

  • cache=loose - Maintains loose caching for better performance with network issues
  • Combined timeout strategy - Multiple timeout layers prevent single failure from hanging system

Implementation Steps

Step 1: Backup Current Configuration

sudo cp /etc/fstab /etc/fstab.backup

Step 2: Update /etc/fstab

Replace the current line with the recommended configuration above.

Step 3: Test the New Configuration

# Unmount current mount
sudo umount /mnt/media

# Remount with new options  
sudo mount /mnt/media

# Verify new mount options are active
mount | grep /mnt/media

Step 4: Validate Network Resilience

# Test timeout behavior with network simulation
# (Temporarily disconnect NAS network cable for 30 seconds)
# Verify mount operations fail gracefully instead of hanging system

Additional System-Level Protections

1. Network Monitoring Script

Create a monitoring script to detect NAS connectivity issues:

#!/bin/bash
# /mnt/NV2/Development/claude-home/scripts/monitoring/nas-connectivity-monitor.sh
ping -c 1 -W 5 10.10.0.35 || echo "NAS connectivity issue detected"

2. Systemd Service Dependencies

Configure services to gracefully handle mount failures:

# Add to services that depend on /mnt/media
After=mnt-media.mount
Wants=mnt-media.mount

3. Kernel Parameter Tuning

Consider CIFS timeout behavior tuning:

# Add to /etc/sysctl.conf if needed
echo 30 > /sys/module/cifs/parameters/CIFSMaxBufSize

Expected Improvements

After implementing these changes:

Immediate Benefits

  • No more 90-second hangs - Operations fail fast with 15-second timeouts
  • Graceful error recovery - intr allows kernel to interrupt hung operations
  • Reduced memory pressure - Smaller 1MB buffers vs 16MB
  • Better retry behavior - 3 attempts with exponential backoff

System Stability

  • Prevents kernel deadlocks - Operations can be interrupted and retried
  • Faster error detection - 10-second attribute cache timeout
  • Automatic recovery - systemd auto-mounting handles reconnection

Performance

  • Maintained caching benefits - cache=loose preserves performance
  • Reduced network overhead - 60-second keepalive intervals
  • Efficient buffer usage - 1MB buffers balance performance and stability

Files to Modify

  1. /etc/fstab - Primary mount configuration
  2. Optional monitoring scripts - NAS connectivity checks
  3. Service configurations - Dependencies on mount availability

Testing Checklist

  • Backup current fstab configuration
  • Apply new mount options
  • Test normal operation (read/write files)
  • Test network interruption handling (disconnect NAS briefly)
  • Verify fast failure instead of system hangs
  • Monitor system stability over 24 hours
  • Validate with Tdarr container operations

Monitoring and Validation

Success Criteria

  • Mount operations fail within 30 seconds during network issues
  • No kernel RCU stalls or deadlock messages in journal
  • System remains responsive during NAS network problems
  • Automatic remount when network connectivity restored

Long-term Monitoring

  • Monitor journal for CIFS error patterns
  • Track system stability metrics
  • Validate performance impact of smaller buffers
  • Ensure gaming and transcoding workloads remain unaffected