claude-home/networking/examples/cifs-mount-resilience-fixes.md

# CIFS Mount Resilience Improvements

**Date**: 2025-08-11
**Issue**: CIFS network errors escalating to kernel deadlocks and system crashes
**Target**: /mnt/media mount to NAS at 10.10.0.35

## Current Configuration Analysis

**Current fstab entry**:
```bash
//10.10.0.35/media /mnt/media cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,cache=loose,rsize=16777216,wsize=16777216,bsize=4194304,actimeo=30,closetimeo=5,echo_interval=30,noperm 0 0
```

**Problems Identified**:
- Missing critical timeout options leading to 90-second hangs
- Aggressive buffer sizes (16MB) causing memory pressure during network issues
- Limited retry attempts (retrans=1) providing minimal resilience
- No explicit error handling for graceful degradation
- Missing interruption handling preventing recovery from network deadlocks

## Recommended CIFS Mount Configuration

**New improved fstab entry**:
```bash
//10.10.0.35/media /mnt/media cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,soft,intr,timeo=15,retrans=3,rsize=1048576,wsize=1048576,cache=loose,actimeo=10,echo_interval=60,_netdev,noauto,x-systemd.automount,x-systemd.device-timeout=10,x-systemd.mount-timeout=30,noperm 0 0
```

## Key Improvements Explained

### Better Timeout Handling
- **`timeo=15`** - 15-second timeout for RPC calls (prevents 90-second hangs)
- **`retrans=3`** - 3 retry attempts instead of 1
- **`x-systemd.device-timeout=10`** - 10-second systemd device timeout
- **`x-systemd.mount-timeout=30`** - 30-second mount operation timeout

### Graceful Error Recovery
- **`soft`** - Allows operations to fail instead of hanging indefinitely
- **`intr`** - Allows kernel to interrupt hung operations (CRITICAL for preventing deadlocks)
- **`_netdev`** - Indicates network dependency for proper boot ordering
- **`noauto,x-systemd.automount`** - Auto-mount on access, unmount when idle

### Preventing Kernel Deadlocks
- **Smaller buffer sizes** - `rsize=1048576,wsize=1048576` (1MB instead of 16MB) reduces memory pressure
- **`actimeo=10`** - Shorter attribute cache timeout (10s vs 30s) for faster error detection
- **`echo_interval=60`** - Longer keepalive interval reduces network chatter

### Network Interruption Resilience
- **`cache=loose`** - Maintains loose caching for better performance with network issues
- **Combined timeout strategy** - Multiple timeout layers prevent single failure from hanging system

## Implementation Steps

### Step 1: Backup Current Configuration
```bash
sudo cp /etc/fstab /etc/fstab.backup
```

### Step 2: Update /etc/fstab
Replace the current line with the recommended configuration above.

### Step 3: Test the New Configuration
```bash
# Unmount current mount
sudo umount /mnt/media

# Remount with new options
sudo mount /mnt/media

# Verify new mount options are active
mount | grep /mnt/media
```

### Step 4: Validate Network Resilience
```bash
# Test timeout behavior with network simulation
# (Temporarily disconnect NAS network cable for 30 seconds)
# Verify mount operations fail gracefully instead of hanging system
```

## Additional System-Level Protections

### 1. Network Monitoring Script
Create a monitoring script to detect NAS connectivity issues:
```bash
#!/bin/bash
# /mnt/NV2/Development/claude-home/scripts/monitoring/nas-connectivity-monitor.sh
ping -c 1 -W 5 10.10.0.35 || echo "NAS connectivity issue detected"
```

### 2. Systemd Service Dependencies
Configure services to gracefully handle mount failures:
```bash
# Add to services that depend on /mnt/media
After=mnt-media.mount
Wants=mnt-media.mount
```

### 3. Kernel Parameter Tuning
Consider CIFS timeout behavior tuning:
```bash
# Add to /etc/sysctl.conf if needed
echo 30 > /sys/module/cifs/parameters/CIFSMaxBufSize
```

## Expected Improvements

After implementing these changes:

### Immediate Benefits
- **No more 90-second hangs** - Operations fail fast with 15-second timeouts
- **Graceful error recovery** - `intr` allows kernel to interrupt hung operations
- **Reduced memory pressure** - Smaller 1MB buffers vs 16MB
- **Better retry behavior** - 3 attempts with exponential backoff

### System Stability
- **Prevents kernel deadlocks** - Operations can be interrupted and retried
- **Faster error detection** - 10-second attribute cache timeout
- **Automatic recovery** - systemd auto-mounting handles reconnection

### Performance
- **Maintained caching benefits** - `cache=loose` preserves performance
- **Reduced network overhead** - 60-second keepalive intervals
- **Efficient buffer usage** - 1MB buffers balance performance and stability

## Files to Modify

1. **`/etc/fstab`** - Primary mount configuration
2. **Optional monitoring scripts** - NAS connectivity checks
3. **Service configurations** - Dependencies on mount availability

## Testing Checklist

- [ ] Backup current fstab configuration
- [ ] Apply new mount options
- [ ] Test normal operation (read/write files)
- [ ] Test network interruption handling (disconnect NAS briefly)
- [ ] Verify fast failure instead of system hangs
- [ ] Monitor system stability over 24 hours
- [ ] Validate with Tdarr container operations

## Monitoring and Validation

### Success Criteria
- Mount operations fail within 30 seconds during network issues
- No kernel RCU stalls or deadlock messages in journal
- System remains responsive during NAS network problems
- Automatic remount when network connectivity restored

### Long-term Monitoring
- Monitor journal for CIFS error patterns
- Track system stability metrics
- Validate performance impact of smaller buffers
- Ensure gaming and transcoding workloads remain unaffected