Complete restructure from patterns/examples/reference to technology-focused directories: • Created technology-specific directories with comprehensive documentation: - /tdarr/ - Transcoding automation with gaming-aware scheduling - /docker/ - Container management with GPU acceleration patterns - /vm-management/ - Virtual machine automation and cloud-init - /networking/ - SSH infrastructure, reverse proxy, and security - /monitoring/ - System health checks and Discord notifications - /databases/ - Database patterns and troubleshooting - /development/ - Programming language patterns (bash, nodejs, python, vuejs) • Enhanced CLAUDE.md with intelligent context loading: - Technology-first loading rules for automatic context provision - Troubleshooting keyword triggers for emergency scenarios - Documentation maintenance protocols with automated reminders - Context window management for optimal documentation updates • Preserved valuable content from .claude/tmp/: - SSH security improvements and server inventory - Tdarr CIFS troubleshooting and Docker iptables solutions - Operational scripts with proper technology classification • Benefits achieved: - Self-contained technology directories with complete context - Automatic loading of relevant documentation based on keywords - Emergency-ready troubleshooting with comprehensive guides - Scalable structure for future technology additions - Eliminated context bloat through targeted loading 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
153 lines
5.5 KiB
Markdown
153 lines
5.5 KiB
Markdown
# CIFS Mount Resilience Improvements
|
|
|
|
**Date**: 2025-08-11
|
|
**Issue**: CIFS network errors escalating to kernel deadlocks and system crashes
|
|
**Target**: /mnt/media mount to NAS at 10.10.0.35
|
|
|
|
## Current Configuration Analysis
|
|
|
|
**Current fstab entry**:
|
|
```bash
|
|
//10.10.0.35/media /mnt/media cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,cache=loose,rsize=16777216,wsize=16777216,bsize=4194304,actimeo=30,closetimeo=5,echo_interval=30,noperm 0 0
|
|
```
|
|
|
|
**Problems Identified**:
|
|
- Missing critical timeout options leading to 90-second hangs
|
|
- Aggressive buffer sizes (16MB) causing memory pressure during network issues
|
|
- Limited retry attempts (retrans=1) providing minimal resilience
|
|
- No explicit error handling for graceful degradation
|
|
- Missing interruption handling preventing recovery from network deadlocks
|
|
|
|
## Recommended CIFS Mount Configuration
|
|
|
|
**New improved fstab entry**:
|
|
```bash
|
|
//10.10.0.35/media /mnt/media cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,soft,intr,timeo=15,retrans=3,rsize=1048576,wsize=1048576,cache=loose,actimeo=10,echo_interval=60,_netdev,noauto,x-systemd.automount,x-systemd.device-timeout=10,x-systemd.mount-timeout=30,noperm 0 0
|
|
```
|
|
|
|
## Key Improvements Explained
|
|
|
|
### Better Timeout Handling
|
|
- **`timeo=15`** - 15-second timeout for RPC calls (prevents 90-second hangs)
|
|
- **`retrans=3`** - 3 retry attempts instead of 1
|
|
- **`x-systemd.device-timeout=10`** - 10-second systemd device timeout
|
|
- **`x-systemd.mount-timeout=30`** - 30-second mount operation timeout
|
|
|
|
### Graceful Error Recovery
|
|
- **`soft`** - Allows operations to fail instead of hanging indefinitely
|
|
- **`intr`** - Allows kernel to interrupt hung operations (CRITICAL for preventing deadlocks)
|
|
- **`_netdev`** - Indicates network dependency for proper boot ordering
|
|
- **`noauto,x-systemd.automount`** - Auto-mount on access, unmount when idle
|
|
|
|
### Preventing Kernel Deadlocks
|
|
- **Smaller buffer sizes** - `rsize=1048576,wsize=1048576` (1MB instead of 16MB) reduces memory pressure
|
|
- **`actimeo=10`** - Shorter attribute cache timeout (10s vs 30s) for faster error detection
|
|
- **`echo_interval=60`** - Longer keepalive interval reduces network chatter
|
|
|
|
### Network Interruption Resilience
|
|
- **`cache=loose`** - Maintains loose caching for better performance with network issues
|
|
- **Combined timeout strategy** - Multiple timeout layers prevent single failure from hanging system
|
|
|
|
## Implementation Steps
|
|
|
|
### Step 1: Backup Current Configuration
|
|
```bash
|
|
sudo cp /etc/fstab /etc/fstab.backup
|
|
```
|
|
|
|
### Step 2: Update /etc/fstab
|
|
Replace the current line with the recommended configuration above.
|
|
|
|
### Step 3: Test the New Configuration
|
|
```bash
|
|
# Unmount current mount
|
|
sudo umount /mnt/media
|
|
|
|
# Remount with new options
|
|
sudo mount /mnt/media
|
|
|
|
# Verify new mount options are active
|
|
mount | grep /mnt/media
|
|
```
|
|
|
|
### Step 4: Validate Network Resilience
|
|
```bash
|
|
# Test timeout behavior with network simulation
|
|
# (Temporarily disconnect NAS network cable for 30 seconds)
|
|
# Verify mount operations fail gracefully instead of hanging system
|
|
```
|
|
|
|
## Additional System-Level Protections
|
|
|
|
### 1. Network Monitoring Script
|
|
Create a monitoring script to detect NAS connectivity issues:
|
|
```bash
|
|
#!/bin/bash
|
|
# /mnt/NV2/Development/claude-home/scripts/monitoring/nas-connectivity-monitor.sh
|
|
ping -c 1 -W 5 10.10.0.35 || echo "NAS connectivity issue detected"
|
|
```
|
|
|
|
### 2. Systemd Service Dependencies
|
|
Configure services to gracefully handle mount failures:
|
|
```bash
|
|
# Add to services that depend on /mnt/media
|
|
After=mnt-media.mount
|
|
Wants=mnt-media.mount
|
|
```
|
|
|
|
### 3. Kernel Parameter Tuning
|
|
Consider CIFS timeout behavior tuning:
|
|
```bash
|
|
# Add to /etc/sysctl.conf if needed
|
|
echo 30 > /sys/module/cifs/parameters/CIFSMaxBufSize
|
|
```
|
|
|
|
## Expected Improvements
|
|
|
|
After implementing these changes:
|
|
|
|
### Immediate Benefits
|
|
- **No more 90-second hangs** - Operations fail fast with 15-second timeouts
|
|
- **Graceful error recovery** - `intr` allows kernel to interrupt hung operations
|
|
- **Reduced memory pressure** - Smaller 1MB buffers vs 16MB
|
|
- **Better retry behavior** - 3 attempts with exponential backoff
|
|
|
|
### System Stability
|
|
- **Prevents kernel deadlocks** - Operations can be interrupted and retried
|
|
- **Faster error detection** - 10-second attribute cache timeout
|
|
- **Automatic recovery** - systemd auto-mounting handles reconnection
|
|
|
|
### Performance
|
|
- **Maintained caching benefits** - `cache=loose` preserves performance
|
|
- **Reduced network overhead** - 60-second keepalive intervals
|
|
- **Efficient buffer usage** - 1MB buffers balance performance and stability
|
|
|
|
## Files to Modify
|
|
|
|
1. **`/etc/fstab`** - Primary mount configuration
|
|
2. **Optional monitoring scripts** - NAS connectivity checks
|
|
3. **Service configurations** - Dependencies on mount availability
|
|
|
|
## Testing Checklist
|
|
|
|
- [ ] Backup current fstab configuration
|
|
- [ ] Apply new mount options
|
|
- [ ] Test normal operation (read/write files)
|
|
- [ ] Test network interruption handling (disconnect NAS briefly)
|
|
- [ ] Verify fast failure instead of system hangs
|
|
- [ ] Monitor system stability over 24 hours
|
|
- [ ] Validate with Tdarr container operations
|
|
|
|
## Monitoring and Validation
|
|
|
|
### Success Criteria
|
|
- Mount operations fail within 30 seconds during network issues
|
|
- No kernel RCU stalls or deadlock messages in journal
|
|
- System remains responsive during NAS network problems
|
|
- Automatic remount when network connectivity restored
|
|
|
|
### Long-term Monitoring
|
|
- Monitor journal for CIFS error patterns
|
|
- Track system stability metrics
|
|
- Validate performance impact of smaller buffers
|
|
- Ensure gaming and transcoding workloads remain unaffected |