claude-home/networking/examples/cifs-mount-resilience-fixes.md
Cal Corum 10c9e0d854 CLAUDE: Migrate to technology-first documentation architecture
Complete restructure from patterns/examples/reference to technology-focused directories:

• Created technology-specific directories with comprehensive documentation:
  - /tdarr/ - Transcoding automation with gaming-aware scheduling
  - /docker/ - Container management with GPU acceleration patterns
  - /vm-management/ - Virtual machine automation and cloud-init
  - /networking/ - SSH infrastructure, reverse proxy, and security
  - /monitoring/ - System health checks and Discord notifications
  - /databases/ - Database patterns and troubleshooting
  - /development/ - Programming language patterns (bash, nodejs, python, vuejs)

• Enhanced CLAUDE.md with intelligent context loading:
  - Technology-first loading rules for automatic context provision
  - Troubleshooting keyword triggers for emergency scenarios
  - Documentation maintenance protocols with automated reminders
  - Context window management for optimal documentation updates

• Preserved valuable content from .claude/tmp/:
  - SSH security improvements and server inventory
  - Tdarr CIFS troubleshooting and Docker iptables solutions
  - Operational scripts with proper technology classification

• Benefits achieved:
  - Self-contained technology directories with complete context
  - Automatic loading of relevant documentation based on keywords
  - Emergency-ready troubleshooting with comprehensive guides
  - Scalable structure for future technology additions
  - Eliminated context bloat through targeted loading

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-12 23:20:15 -05:00

5.5 KiB

CIFS Mount Resilience Improvements

Date: 2025-08-11
Issue: CIFS network errors escalating to kernel deadlocks and system crashes
Target: /mnt/media mount to NAS at 10.10.0.35

Current Configuration Analysis

Current fstab entry:

//10.10.0.35/media /mnt/media cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,cache=loose,rsize=16777216,wsize=16777216,bsize=4194304,actimeo=30,closetimeo=5,echo_interval=30,noperm 0 0

Problems Identified:

  • Missing critical timeout options leading to 90-second hangs
  • Aggressive buffer sizes (16MB) causing memory pressure during network issues
  • Limited retry attempts (retrans=1) providing minimal resilience
  • No explicit error handling for graceful degradation
  • Missing interruption handling preventing recovery from network deadlocks

New improved fstab entry:

//10.10.0.35/media /mnt/media cifs credentials=/home/cal/.samba_credentials,uid=1000,gid=1000,vers=3.1.1,soft,intr,timeo=15,retrans=3,rsize=1048576,wsize=1048576,cache=loose,actimeo=10,echo_interval=60,_netdev,noauto,x-systemd.automount,x-systemd.device-timeout=10,x-systemd.mount-timeout=30,noperm 0 0

Key Improvements Explained

Better Timeout Handling

  • timeo=15 - 15-second timeout for RPC calls (prevents 90-second hangs)
  • retrans=3 - 3 retry attempts instead of 1
  • x-systemd.device-timeout=10 - 10-second systemd device timeout
  • x-systemd.mount-timeout=30 - 30-second mount operation timeout

Graceful Error Recovery

  • soft - Allows operations to fail instead of hanging indefinitely
  • intr - Allows kernel to interrupt hung operations (CRITICAL for preventing deadlocks)
  • _netdev - Indicates network dependency for proper boot ordering
  • noauto,x-systemd.automount - Auto-mount on access, unmount when idle

Preventing Kernel Deadlocks

  • Smaller buffer sizes - rsize=1048576,wsize=1048576 (1MB instead of 16MB) reduces memory pressure
  • actimeo=10 - Shorter attribute cache timeout (10s vs 30s) for faster error detection
  • echo_interval=60 - Longer keepalive interval reduces network chatter

Network Interruption Resilience

  • cache=loose - Maintains loose caching for better performance with network issues
  • Combined timeout strategy - Multiple timeout layers prevent single failure from hanging system

Implementation Steps

Step 1: Backup Current Configuration

sudo cp /etc/fstab /etc/fstab.backup

Step 2: Update /etc/fstab

Replace the current line with the recommended configuration above.

Step 3: Test the New Configuration

# Unmount current mount
sudo umount /mnt/media

# Remount with new options  
sudo mount /mnt/media

# Verify new mount options are active
mount | grep /mnt/media

Step 4: Validate Network Resilience

# Test timeout behavior with network simulation
# (Temporarily disconnect NAS network cable for 30 seconds)
# Verify mount operations fail gracefully instead of hanging system

Additional System-Level Protections

1. Network Monitoring Script

Create a monitoring script to detect NAS connectivity issues:

#!/bin/bash
# /mnt/NV2/Development/claude-home/scripts/monitoring/nas-connectivity-monitor.sh
ping -c 1 -W 5 10.10.0.35 || echo "NAS connectivity issue detected"

2. Systemd Service Dependencies

Configure services to gracefully handle mount failures:

# Add to services that depend on /mnt/media
After=mnt-media.mount
Wants=mnt-media.mount

3. Kernel Parameter Tuning

Consider CIFS timeout behavior tuning:

# Add to /etc/sysctl.conf if needed
echo 30 > /sys/module/cifs/parameters/CIFSMaxBufSize

Expected Improvements

After implementing these changes:

Immediate Benefits

  • No more 90-second hangs - Operations fail fast with 15-second timeouts
  • Graceful error recovery - intr allows kernel to interrupt hung operations
  • Reduced memory pressure - Smaller 1MB buffers vs 16MB
  • Better retry behavior - 3 attempts with exponential backoff

System Stability

  • Prevents kernel deadlocks - Operations can be interrupted and retried
  • Faster error detection - 10-second attribute cache timeout
  • Automatic recovery - systemd auto-mounting handles reconnection

Performance

  • Maintained caching benefits - cache=loose preserves performance
  • Reduced network overhead - 60-second keepalive intervals
  • Efficient buffer usage - 1MB buffers balance performance and stability

Files to Modify

  1. /etc/fstab - Primary mount configuration
  2. Optional monitoring scripts - NAS connectivity checks
  3. Service configurations - Dependencies on mount availability

Testing Checklist

  • Backup current fstab configuration
  • Apply new mount options
  • Test normal operation (read/write files)
  • Test network interruption handling (disconnect NAS briefly)
  • Verify fast failure instead of system hangs
  • Monitor system stability over 24 hours
  • Validate with Tdarr container operations

Monitoring and Validation

Success Criteria

  • Mount operations fail within 30 seconds during network issues
  • No kernel RCU stalls or deadlock messages in journal
  • System remains responsive during NAS network problems
  • Automatic remount when network connectivity restored

Long-term Monitoring

  • Monitor journal for CIFS error patterns
  • Track system stability metrics
  • Validate performance impact of smaller buffers
  • Ensure gaming and transcoding workloads remain unaffected