claude-home/networking/examples/cifs-mount-resilience-fixes.md
Cal Corum 10c9e0d854 CLAUDE: Migrate to technology-first documentation architecture
Complete restructure from patterns/examples/reference to technology-focused directories:

• Created technology-specific directories with comprehensive documentation:
  - /tdarr/ - Transcoding automation with gaming-aware scheduling
  - /docker/ - Container management with GPU acceleration patterns
  - /vm-management/ - Virtual machine automation and cloud-init
  - /networking/ - SSH infrastructure, reverse proxy, and security
  - /monitoring/ - System health checks and Discord notifications
  - /databases/ - Database patterns and troubleshooting
  - /development/ - Programming language patterns (bash, nodejs, python, vuejs)

• Enhanced CLAUDE.md with intelligent context loading:
  - Technology-first loading rules for automatic context provision
  - Troubleshooting keyword triggers for emergency scenarios
  - Documentation maintenance protocols with automated reminders
  - Context window management for optimal documentation updates

• Preserved valuable content from .claude/tmp/:
  - SSH security improvements and server inventory
  - Tdarr CIFS troubleshooting and Docker iptables solutions
  - Operational scripts with proper technology classification

• Benefits achieved:
  - Self-contained technology directories with complete context
  - Automatic loading of relevant documentation based on keywords
  - Emergency-ready troubleshooting with comprehensive guides
  - Scalable structure for future technology additions
  - Eliminated context bloat through targeted loading

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-12 23:20:15 -05:00

153 lines
5.5 KiB
Markdown

# CIFS Mount Resilience Improvements
**Date**: 2025-08-11
**Issue**: CIFS network errors escalating to kernel deadlocks and system crashes
**Target**: /mnt/media mount to NAS at 10.10.0.35
## Current Configuration Analysis
**Current fstab entry**:
```bash
//10.10.0.35/media /mnt/media cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,cache=loose,rsize=16777216,wsize=16777216,bsize=4194304,actimeo=30,closetimeo=5,echo_interval=30,noperm 0 0
```
**Problems Identified**:
- Missing critical timeout options leading to 90-second hangs
- Aggressive buffer sizes (16MB) causing memory pressure during network issues
- Limited retry attempts (retrans=1) providing minimal resilience
- No explicit error handling for graceful degradation
- Missing interruption handling preventing recovery from network deadlocks
## Recommended CIFS Mount Configuration
**New improved fstab entry**:
```bash
//10.10.0.35/media /mnt/media cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,soft,intr,timeo=15,retrans=3,rsize=1048576,wsize=1048576,cache=loose,actimeo=10,echo_interval=60,_netdev,noauto,x-systemd.automount,x-systemd.device-timeout=10,x-systemd.mount-timeout=30,noperm 0 0
```
## Key Improvements Explained
### Better Timeout Handling
- **`timeo=15`** - 15-second timeout for RPC calls (prevents 90-second hangs)
- **`retrans=3`** - 3 retry attempts instead of 1
- **`x-systemd.device-timeout=10`** - 10-second systemd device timeout
- **`x-systemd.mount-timeout=30`** - 30-second mount operation timeout
### Graceful Error Recovery
- **`soft`** - Allows operations to fail instead of hanging indefinitely
- **`intr`** - Allows kernel to interrupt hung operations (CRITICAL for preventing deadlocks)
- **`_netdev`** - Indicates network dependency for proper boot ordering
- **`noauto,x-systemd.automount`** - Auto-mount on access, unmount when idle
### Preventing Kernel Deadlocks
- **Smaller buffer sizes** - `rsize=1048576,wsize=1048576` (1MB instead of 16MB) reduces memory pressure
- **`actimeo=10`** - Shorter attribute cache timeout (10s vs 30s) for faster error detection
- **`echo_interval=60`** - Longer keepalive interval reduces network chatter
### Network Interruption Resilience
- **`cache=loose`** - Maintains loose caching for better performance with network issues
- **Combined timeout strategy** - Multiple timeout layers prevent single failure from hanging system
## Implementation Steps
### Step 1: Backup Current Configuration
```bash
sudo cp /etc/fstab /etc/fstab.backup
```
### Step 2: Update /etc/fstab
Replace the current line with the recommended configuration above.
### Step 3: Test the New Configuration
```bash
# Unmount current mount
sudo umount /mnt/media
# Remount with new options
sudo mount /mnt/media
# Verify new mount options are active
mount | grep /mnt/media
```
### Step 4: Validate Network Resilience
```bash
# Test timeout behavior with network simulation
# (Temporarily disconnect NAS network cable for 30 seconds)
# Verify mount operations fail gracefully instead of hanging system
```
## Additional System-Level Protections
### 1. Network Monitoring Script
Create a monitoring script to detect NAS connectivity issues:
```bash
#!/bin/bash
# /mnt/NV2/Development/claude-home/scripts/monitoring/nas-connectivity-monitor.sh
ping -c 1 -W 5 10.10.0.35 || echo "NAS connectivity issue detected"
```
### 2. Systemd Service Dependencies
Configure services to gracefully handle mount failures:
```bash
# Add to services that depend on /mnt/media
After=mnt-media.mount
Wants=mnt-media.mount
```
### 3. Kernel Parameter Tuning
Consider CIFS timeout behavior tuning:
```bash
# Add to /etc/sysctl.conf if needed
echo 30 > /sys/module/cifs/parameters/CIFSMaxBufSize
```
## Expected Improvements
After implementing these changes:
### Immediate Benefits
- **No more 90-second hangs** - Operations fail fast with 15-second timeouts
- **Graceful error recovery** - `intr` allows kernel to interrupt hung operations
- **Reduced memory pressure** - Smaller 1MB buffers vs 16MB
- **Better retry behavior** - 3 attempts with exponential backoff
### System Stability
- **Prevents kernel deadlocks** - Operations can be interrupted and retried
- **Faster error detection** - 10-second attribute cache timeout
- **Automatic recovery** - systemd auto-mounting handles reconnection
### Performance
- **Maintained caching benefits** - `cache=loose` preserves performance
- **Reduced network overhead** - 60-second keepalive intervals
- **Efficient buffer usage** - 1MB buffers balance performance and stability
## Files to Modify
1. **`/etc/fstab`** - Primary mount configuration
2. **Optional monitoring scripts** - NAS connectivity checks
3. **Service configurations** - Dependencies on mount availability
## Testing Checklist
- [ ] Backup current fstab configuration
- [ ] Apply new mount options
- [ ] Test normal operation (read/write files)
- [ ] Test network interruption handling (disconnect NAS briefly)
- [ ] Verify fast failure instead of system hangs
- [ ] Monitor system stability over 24 hours
- [ ] Validate with Tdarr container operations
## Monitoring and Validation
### Success Criteria
- Mount operations fail within 30 seconds during network issues
- No kernel RCU stalls or deadlock messages in journal
- System remains responsive during NAS network problems
- Automatic remount when network connectivity restored
### Long-term Monitoring
- Monitor journal for CIFS error patterns
- Track system stability metrics
- Validate performance impact of smaller buffers
- Ensure gaming and transcoding workloads remain unaffected