Complete restructure from patterns/examples/reference to technology-focused directories: • Created technology-specific directories with comprehensive documentation: - /tdarr/ - Transcoding automation with gaming-aware scheduling - /docker/ - Container management with GPU acceleration patterns - /vm-management/ - Virtual machine automation and cloud-init - /networking/ - SSH infrastructure, reverse proxy, and security - /monitoring/ - System health checks and Discord notifications - /databases/ - Database patterns and troubleshooting - /development/ - Programming language patterns (bash, nodejs, python, vuejs) • Enhanced CLAUDE.md with intelligent context loading: - Technology-first loading rules for automatic context provision - Troubleshooting keyword triggers for emergency scenarios - Documentation maintenance protocols with automated reminders - Context window management for optimal documentation updates • Preserved valuable content from .claude/tmp/: - SSH security improvements and server inventory - Tdarr CIFS troubleshooting and Docker iptables solutions - Operational scripts with proper technology classification • Benefits achieved: - Self-contained technology directories with complete context - Automatic loading of relevant documentation based on keywords - Emergency-ready troubleshooting with comprehensive guides - Scalable structure for future technology additions - Eliminated context bloat through targeted loading 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
5.5 KiB
5.5 KiB
CIFS Mount Resilience Improvements
Date: 2025-08-11
Issue: CIFS network errors escalating to kernel deadlocks and system crashes
Target: /mnt/media mount to NAS at 10.10.0.35
Current Configuration Analysis
Current fstab entry:
//10.10.0.35/media /mnt/media cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,cache=loose,rsize=16777216,wsize=16777216,bsize=4194304,actimeo=30,closetimeo=5,echo_interval=30,noperm 0 0
Problems Identified:
- Missing critical timeout options leading to 90-second hangs
- Aggressive buffer sizes (16MB) causing memory pressure during network issues
- Limited retry attempts (retrans=1) providing minimal resilience
- No explicit error handling for graceful degradation
- Missing interruption handling preventing recovery from network deadlocks
Recommended CIFS Mount Configuration
New improved fstab entry:
//10.10.0.35/media /mnt/media cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,soft,intr,timeo=15,retrans=3,rsize=1048576,wsize=1048576,cache=loose,actimeo=10,echo_interval=60,_netdev,noauto,x-systemd.automount,x-systemd.device-timeout=10,x-systemd.mount-timeout=30,noperm 0 0
Key Improvements Explained
Better Timeout Handling
timeo=15- 15-second timeout for RPC calls (prevents 90-second hangs)retrans=3- 3 retry attempts instead of 1x-systemd.device-timeout=10- 10-second systemd device timeoutx-systemd.mount-timeout=30- 30-second mount operation timeout
Graceful Error Recovery
soft- Allows operations to fail instead of hanging indefinitelyintr- Allows kernel to interrupt hung operations (CRITICAL for preventing deadlocks)_netdev- Indicates network dependency for proper boot orderingnoauto,x-systemd.automount- Auto-mount on access, unmount when idle
Preventing Kernel Deadlocks
- Smaller buffer sizes -
rsize=1048576,wsize=1048576(1MB instead of 16MB) reduces memory pressure actimeo=10- Shorter attribute cache timeout (10s vs 30s) for faster error detectionecho_interval=60- Longer keepalive interval reduces network chatter
Network Interruption Resilience
cache=loose- Maintains loose caching for better performance with network issues- Combined timeout strategy - Multiple timeout layers prevent single failure from hanging system
Implementation Steps
Step 1: Backup Current Configuration
sudo cp /etc/fstab /etc/fstab.backup
Step 2: Update /etc/fstab
Replace the current line with the recommended configuration above.
Step 3: Test the New Configuration
# Unmount current mount
sudo umount /mnt/media
# Remount with new options
sudo mount /mnt/media
# Verify new mount options are active
mount | grep /mnt/media
Step 4: Validate Network Resilience
# Test timeout behavior with network simulation
# (Temporarily disconnect NAS network cable for 30 seconds)
# Verify mount operations fail gracefully instead of hanging system
Additional System-Level Protections
1. Network Monitoring Script
Create a monitoring script to detect NAS connectivity issues:
#!/bin/bash
# /mnt/NV2/Development/claude-home/scripts/monitoring/nas-connectivity-monitor.sh
ping -c 1 -W 5 10.10.0.35 || echo "NAS connectivity issue detected"
2. Systemd Service Dependencies
Configure services to gracefully handle mount failures:
# Add to services that depend on /mnt/media
After=mnt-media.mount
Wants=mnt-media.mount
3. Kernel Parameter Tuning
Consider CIFS timeout behavior tuning:
# Add to /etc/sysctl.conf if needed
echo 30 > /sys/module/cifs/parameters/CIFSMaxBufSize
Expected Improvements
After implementing these changes:
Immediate Benefits
- No more 90-second hangs - Operations fail fast with 15-second timeouts
- Graceful error recovery -
intrallows kernel to interrupt hung operations - Reduced memory pressure - Smaller 1MB buffers vs 16MB
- Better retry behavior - 3 attempts with exponential backoff
System Stability
- Prevents kernel deadlocks - Operations can be interrupted and retried
- Faster error detection - 10-second attribute cache timeout
- Automatic recovery - systemd auto-mounting handles reconnection
Performance
- Maintained caching benefits -
cache=loosepreserves performance - Reduced network overhead - 60-second keepalive intervals
- Efficient buffer usage - 1MB buffers balance performance and stability
Files to Modify
/etc/fstab- Primary mount configuration- Optional monitoring scripts - NAS connectivity checks
- Service configurations - Dependencies on mount availability
Testing Checklist
- Backup current fstab configuration
- Apply new mount options
- Test normal operation (read/write files)
- Test network interruption handling (disconnect NAS briefly)
- Verify fast failure instead of system hangs
- Monitor system stability over 24 hours
- Validate with Tdarr container operations
Monitoring and Validation
Success Criteria
- Mount operations fail within 30 seconds during network issues
- No kernel RCU stalls or deadlock messages in journal
- System remains responsive during NAS network problems
- Automatic remount when network connectivity restored
Long-term Monitoring
- Monitor journal for CIFS error patterns
- Track system stability metrics
- Validate performance impact of smaller buffers
- Ensure gaming and transcoding workloads remain unaffected