All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
Adds title, description, type, domain, and tags frontmatter to every doc for improved KB semantic search. The description field is prepended to every search chunk, and domain/type/tags enable filtered queries. Type values: context, guide, runbook, reference, troubleshooting Domain values match directory structure (networking, docker, etc.) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
161 lines
5.8 KiB
Markdown
161 lines
5.8 KiB
Markdown
---
|
|
title: "CIFS Mount Resilience Fixes"
|
|
description: "Improved CIFS fstab config to prevent kernel deadlocks during NAS network issues, with soft mounts, interrupt handling, reduced buffers, and systemd automount."
|
|
type: runbook
|
|
domain: networking
|
|
tags: [cifs, smb, nas, fstab, kernel, stability, truenas]
|
|
---
|
|
|
|
# CIFS Mount Resilience Improvements
|
|
|
|
**Date**: 2025-08-11
|
|
**Issue**: CIFS network errors escalating to kernel deadlocks and system crashes
|
|
**Target**: /mnt/media mount to NAS at 10.10.0.35
|
|
|
|
## Current Configuration Analysis
|
|
|
|
**Current fstab entry**:
|
|
```bash
|
|
//10.10.0.35/media /mnt/media cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,cache=loose,rsize=16777216,wsize=16777216,bsize=4194304,actimeo=30,closetimeo=5,echo_interval=30,noperm 0 0
|
|
```
|
|
|
|
**Problems Identified**:
|
|
- Missing critical timeout options leading to 90-second hangs
|
|
- Aggressive buffer sizes (16MB) causing memory pressure during network issues
|
|
- Limited retry attempts (retrans=1) providing minimal resilience
|
|
- No explicit error handling for graceful degradation
|
|
- Missing interruption handling preventing recovery from network deadlocks
|
|
|
|
## Recommended CIFS Mount Configuration
|
|
|
|
**New improved fstab entry**:
|
|
```bash
|
|
//10.10.0.35/media /mnt/media cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,soft,intr,timeo=15,retrans=3,rsize=1048576,wsize=1048576,cache=loose,actimeo=10,echo_interval=60,_netdev,noauto,x-systemd.automount,x-systemd.device-timeout=10,x-systemd.mount-timeout=30,noperm 0 0
|
|
```
|
|
|
|
## Key Improvements Explained
|
|
|
|
### Better Timeout Handling
|
|
- **`timeo=15`** - 15-second timeout for RPC calls (prevents 90-second hangs)
|
|
- **`retrans=3`** - 3 retry attempts instead of 1
|
|
- **`x-systemd.device-timeout=10`** - 10-second systemd device timeout
|
|
- **`x-systemd.mount-timeout=30`** - 30-second mount operation timeout
|
|
|
|
### Graceful Error Recovery
|
|
- **`soft`** - Allows operations to fail instead of hanging indefinitely
|
|
- **`intr`** - Allows kernel to interrupt hung operations (CRITICAL for preventing deadlocks)
|
|
- **`_netdev`** - Indicates network dependency for proper boot ordering
|
|
- **`noauto,x-systemd.automount`** - Auto-mount on access, unmount when idle
|
|
|
|
### Preventing Kernel Deadlocks
|
|
- **Smaller buffer sizes** - `rsize=1048576,wsize=1048576` (1MB instead of 16MB) reduces memory pressure
|
|
- **`actimeo=10`** - Shorter attribute cache timeout (10s vs 30s) for faster error detection
|
|
- **`echo_interval=60`** - Longer keepalive interval reduces network chatter
|
|
|
|
### Network Interruption Resilience
|
|
- **`cache=loose`** - Maintains loose caching for better performance with network issues
|
|
- **Combined timeout strategy** - Multiple timeout layers prevent single failure from hanging system
|
|
|
|
## Implementation Steps
|
|
|
|
### Step 1: Backup Current Configuration
|
|
```bash
|
|
sudo cp /etc/fstab /etc/fstab.backup
|
|
```
|
|
|
|
### Step 2: Update /etc/fstab
|
|
Replace the current line with the recommended configuration above.
|
|
|
|
### Step 3: Test the New Configuration
|
|
```bash
|
|
# Unmount current mount
|
|
sudo umount /mnt/media
|
|
|
|
# Remount with new options
|
|
sudo mount /mnt/media
|
|
|
|
# Verify new mount options are active
|
|
mount | grep /mnt/media
|
|
```
|
|
|
|
### Step 4: Validate Network Resilience
|
|
```bash
|
|
# Test timeout behavior with network simulation
|
|
# (Temporarily disconnect NAS network cable for 30 seconds)
|
|
# Verify mount operations fail gracefully instead of hanging system
|
|
```
|
|
|
|
## Additional System-Level Protections
|
|
|
|
### 1. Network Monitoring Script
|
|
Create a monitoring script to detect NAS connectivity issues:
|
|
```bash
|
|
#!/bin/bash
|
|
# /mnt/NV2/Development/claude-home/scripts/monitoring/nas-connectivity-monitor.sh
|
|
ping -c 1 -W 5 10.10.0.35 || echo "NAS connectivity issue detected"
|
|
```
|
|
|
|
### 2. Systemd Service Dependencies
|
|
Configure services to gracefully handle mount failures:
|
|
```bash
|
|
# Add to services that depend on /mnt/media
|
|
After=mnt-media.mount
|
|
Wants=mnt-media.mount
|
|
```
|
|
|
|
### 3. Kernel Parameter Tuning
|
|
Consider CIFS timeout behavior tuning:
|
|
```bash
|
|
# Add to /etc/sysctl.conf if needed
|
|
echo 30 > /sys/module/cifs/parameters/CIFSMaxBufSize
|
|
```
|
|
|
|
## Expected Improvements
|
|
|
|
After implementing these changes:
|
|
|
|
### Immediate Benefits
|
|
- **No more 90-second hangs** - Operations fail fast with 15-second timeouts
|
|
- **Graceful error recovery** - `intr` allows kernel to interrupt hung operations
|
|
- **Reduced memory pressure** - Smaller 1MB buffers vs 16MB
|
|
- **Better retry behavior** - 3 attempts with exponential backoff
|
|
|
|
### System Stability
|
|
- **Prevents kernel deadlocks** - Operations can be interrupted and retried
|
|
- **Faster error detection** - 10-second attribute cache timeout
|
|
- **Automatic recovery** - systemd auto-mounting handles reconnection
|
|
|
|
### Performance
|
|
- **Maintained caching benefits** - `cache=loose` preserves performance
|
|
- **Reduced network overhead** - 60-second keepalive intervals
|
|
- **Efficient buffer usage** - 1MB buffers balance performance and stability
|
|
|
|
## Files to Modify
|
|
|
|
1. **`/etc/fstab`** - Primary mount configuration
|
|
2. **Optional monitoring scripts** - NAS connectivity checks
|
|
3. **Service configurations** - Dependencies on mount availability
|
|
|
|
## Testing Checklist
|
|
|
|
- [ ] Backup current fstab configuration
|
|
- [ ] Apply new mount options
|
|
- [ ] Test normal operation (read/write files)
|
|
- [ ] Test network interruption handling (disconnect NAS briefly)
|
|
- [ ] Verify fast failure instead of system hangs
|
|
- [ ] Monitor system stability over 24 hours
|
|
- [ ] Validate with Tdarr container operations
|
|
|
|
## Monitoring and Validation
|
|
|
|
### Success Criteria
|
|
- Mount operations fail within 30 seconds during network issues
|
|
- No kernel RCU stalls or deadlock messages in journal
|
|
- System remains responsive during NAS network problems
|
|
- Automatic remount when network connectivity restored
|
|
|
|
### Long-term Monitoring
|
|
- Monitor journal for CIFS error patterns
|
|
- Track system stability metrics
|
|
- Validate performance impact of smaller buffers
|
|
- Ensure gaming and transcoding workloads remain unaffected |