claude-home/networking/troubleshooting.md
Cal Corum 10c9e0d854 CLAUDE: Migrate to technology-first documentation architecture
Complete restructure from patterns/examples/reference to technology-focused directories:

• Created technology-specific directories with comprehensive documentation:
  - /tdarr/ - Transcoding automation with gaming-aware scheduling
  - /docker/ - Container management with GPU acceleration patterns
  - /vm-management/ - Virtual machine automation and cloud-init
  - /networking/ - SSH infrastructure, reverse proxy, and security
  - /monitoring/ - System health checks and Discord notifications
  - /databases/ - Database patterns and troubleshooting
  - /development/ - Programming language patterns (bash, nodejs, python, vuejs)

• Enhanced CLAUDE.md with intelligent context loading:
  - Technology-first loading rules for automatic context provision
  - Troubleshooting keyword triggers for emergency scenarios
  - Documentation maintenance protocols with automated reminders
  - Context window management for optimal documentation updates

• Preserved valuable content from .claude/tmp/:
  - SSH security improvements and server inventory
  - Tdarr CIFS troubleshooting and Docker iptables solutions
  - Operational scripts with proper technology classification

• Benefits achieved:
  - Self-contained technology directories with complete context
  - Automatic loading of relevant documentation based on keywords
  - Emergency-ready troubleshooting with comprehensive guides
  - Scalable structure for future technology additions
  - Eliminated context bloat through targeted loading

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-12 23:20:15 -05:00

496 lines
10 KiB
Markdown

# Networking Infrastructure Troubleshooting Guide
## SSH Connection Issues
### SSH Authentication Failures
**Symptoms**: Permission denied, connection refused, timeout
**Diagnosis**:
```bash
# Verbose SSH debugging
ssh -vvv user@host
# Test different authentication methods
ssh -o PasswordAuthentication=no user@host
ssh -o PubkeyAuthentication=yes user@host
# Check local key files
ls -la ~/.ssh/
ssh-keygen -lf ~/.ssh/homelab_rsa.pub
```
**Solutions**:
```bash
# Re-deploy SSH keys
ssh-copy-id -i ~/.ssh/homelab_rsa.pub user@host
ssh-copy-id -i ~/.ssh/emergency_homelab_rsa.pub user@host
# Fix key permissions
chmod 600 ~/.ssh/homelab_rsa
chmod 644 ~/.ssh/homelab_rsa.pub
chmod 700 ~/.ssh
# Verify remote authorized_keys
ssh user@host 'chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys'
```
### SSH Service Issues
**Symptoms**: Connection refused, service not running
**Diagnosis**:
```bash
# Check SSH service status
systemctl status sshd
ss -tlnp | grep :22
# Test port connectivity
nc -zv host 22
nmap -p 22 host
```
**Solutions**:
```bash
# Restart SSH service
sudo systemctl restart sshd
sudo systemctl enable sshd
# Check firewall
sudo ufw status
sudo ufw allow ssh
# Verify SSH configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)"
```
## Network Connectivity Problems
### Basic Network Troubleshooting
**Symptoms**: Cannot reach hosts, timeouts, routing issues
**Diagnosis**:
```bash
# Basic connectivity tests
ping host
traceroute host
mtr host
# Check local network configuration
ip addr show
ip route show
cat /etc/resolv.conf
```
**Solutions**:
```bash
# Restart networking
sudo systemctl restart networking
sudo netplan apply # Ubuntu
# Reset network interface
sudo ip link set eth0 down
sudo ip link set eth0 up
# Check default gateway
sudo ip route add default via 10.10.0.1
```
### DNS Resolution Issues
**Symptoms**: Cannot resolve hostnames, slow resolution
**Diagnosis**:
```bash
# Test DNS resolution
nslookup google.com
dig google.com
host google.com
# Check DNS servers
systemd-resolve --status
cat /etc/resolv.conf
```
**Solutions**:
```bash
# Temporary DNS fix
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
# Restart DNS services
sudo systemctl restart systemd-resolved
# Flush DNS cache
sudo systemd-resolve --flush-caches
```
## Reverse Proxy and Load Balancer Issues
### Nginx Configuration Problems
**Symptoms**: 502 Bad Gateway, 503 Service Unavailable, SSL errors
**Diagnosis**:
```bash
# Check Nginx status and logs
systemctl status nginx
sudo tail -f /var/log/nginx/error.log
sudo tail -f /var/log/nginx/access.log
# Test Nginx configuration
sudo nginx -t
sudo nginx -T # Show full configuration
```
**Solutions**:
```bash
# Reload Nginx configuration
sudo nginx -s reload
# Check upstream servers
curl -I http://backend-server:port
telnet backend-server port
# Fix common configuration issues
sudo nano /etc/nginx/sites-available/default
# Check proxy_pass URLs, upstream definitions
```
### SSL/TLS Certificate Issues
**Symptoms**: Certificate warnings, expired certificates, connection errors
**Diagnosis**:
```bash
# Check certificate validity
openssl s_client -connect host:443 -servername host
openssl x509 -in /etc/ssl/certs/cert.pem -text -noout
# Check certificate expiry
openssl x509 -in /etc/ssl/certs/cert.pem -noout -dates
```
**Solutions**:
```bash
# Renew Let's Encrypt certificates
sudo certbot renew --dry-run
sudo certbot renew --force-renewal
# Generate self-signed certificate
sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
-keyout /etc/ssl/private/selfsigned.key \
-out /etc/ssl/certs/selfsigned.crt
```
## Network Storage Issues
### CIFS/SMB Mount Problems
**Symptoms**: Mount failures, connection timeouts, permission errors
**Diagnosis**:
```bash
# Test SMB connectivity
smbclient -L //nas-server -U username
testparm # Test Samba configuration
# Check mount status
mount | grep cifs
df -h | grep cifs
```
**Solutions**:
```bash
# Remount with verbose logging
sudo mount -t cifs //server/share /mnt/point -o username=user,password=pass,vers=3.0
# Fix mount options in /etc/fstab
//server/share /mnt/point cifs credentials=/etc/cifs/credentials,uid=1000,gid=1000,iocharset=utf8,file_mode=0644,dir_mode=0755,cache=strict,_netdev 0 0
# Test credentials
sudo cat /etc/cifs/credentials
# Should contain: username=, password=, domain=
```
### NFS Mount Issues
**Symptoms**: Stale file handles, mount hangs, permission denied
**Diagnosis**:
```bash
# Check NFS services
systemctl status nfs-client.target
showmount -e nfs-server
# Test NFS connectivity
rpcinfo -p nfs-server
```
**Solutions**:
```bash
# Restart NFS services
sudo systemctl restart nfs-client.target
# Remount NFS shares
sudo umount /mnt/nfs-share
sudo mount -t nfs server:/path /mnt/nfs-share
# Fix stale file handles
sudo umount -f /mnt/nfs-share
sudo mount /mnt/nfs-share
```
## Firewall and Security Issues
### Port Access Problems
**Symptoms**: Connection refused, filtered ports, blocked services
**Diagnosis**:
```bash
# Check firewall status
sudo ufw status verbose
sudo iptables -L -n -v
# Test port accessibility
nc -zv host port
nmap -p port host
```
**Solutions**:
```bash
# Open required ports
sudo ufw allow ssh
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw allow from 10.10.0.0/24
# Reset firewall if needed
sudo ufw --force reset
sudo ufw enable
```
### Network Security Issues
**Symptoms**: Unauthorized access, suspicious traffic, security alerts
**Diagnosis**:
```bash
# Check active connections
ss -tuln
netstat -tuln
# Review logs for security events
sudo tail -f /var/log/auth.log
sudo tail -f /var/log/syslog | grep -i security
```
**Solutions**:
```bash
# Block suspicious IPs
sudo ufw deny from suspicious-ip
# Update SSH security
sudo nano /etc/ssh/sshd_config
# Set: PasswordAuthentication no, PermitRootLogin no
sudo systemctl restart sshd
```
## Service Discovery and DNS Issues
### Local DNS Problems
**Symptoms**: Services unreachable by hostname, DNS timeouts
**Diagnosis**:
```bash
# Test local DNS resolution
nslookup service.homelab.local
dig @10.10.0.16 service.homelab.local
# Check DNS server status
systemctl status bind9 # or named
```
**Solutions**:
```bash
# Add to /etc/hosts as temporary fix
echo "10.10.0.100 service.homelab.local" | sudo tee -a /etc/hosts
# Restart DNS services
sudo systemctl restart bind9
sudo systemctl restart systemd-resolved
```
### Container Networking Issues
**Symptoms**: Containers cannot communicate, service discovery fails
**Diagnosis**:
```bash
# Check Docker networks
docker network ls
docker network inspect bridge
# Test container connectivity
docker exec container1 ping container2
docker exec container1 nslookup container2
```
**Solutions**:
```bash
# Create custom network
docker network create --driver bridge app-network
docker run --network app-network container
# Fix DNS in containers
docker run --dns 8.8.8.8 container
```
## Performance Issues
### Network Latency Problems
**Symptoms**: Slow response times, timeouts, poor performance
**Diagnosis**:
```bash
# Measure network latency
ping -c 100 host
mtr --report host
# Check network interface stats
ip -s link show
cat /proc/net/dev
```
**Solutions**:
```bash
# Optimize network settings
echo 'net.core.rmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
echo 'net.core.wmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
# Check for network congestion
iftop
nethogs
```
### Bandwidth Issues
**Symptoms**: Slow transfers, network congestion, dropped packets
**Diagnosis**:
```bash
# Test bandwidth
iperf3 -s # Server
iperf3 -c server-ip # Client
# Check interface utilization
vnstat -i eth0
```
**Solutions**:
```bash
# Implement QoS if needed
sudo tc qdisc add dev eth0 root fq_codel
# Optimize buffer sizes
sudo ethtool -G eth0 rx 4096 tx 4096
```
## Emergency Recovery Procedures
### Network Emergency Recovery
**Complete network failure recovery**:
```bash
# Reset all network configuration
sudo systemctl stop networking
sudo ip addr flush eth0
sudo ip route flush table main
sudo systemctl start networking
# Manual network configuration
sudo ip addr add 10.10.0.100/24 dev eth0
sudo ip route add default via 10.10.0.1
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
```
### SSH Emergency Access
**When locked out of systems**:
```bash
# Use emergency SSH key
ssh -i ~/.ssh/emergency_homelab_rsa user@host
# Via console access (if available)
# Use hypervisor console or physical access
# Reset SSH to allow password auth temporarily
sudo sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config
sudo systemctl restart sshd
```
### Service Recovery
**Critical service restoration**:
```bash
# Restart all network services
sudo systemctl restart networking
sudo systemctl restart nginx
sudo systemctl restart sshd
# Emergency firewall disable
sudo ufw disable # CAUTION: Only for troubleshooting
# Service-specific recovery
sudo systemctl restart docker
sudo systemctl restart systemd-resolved
```
## Monitoring and Prevention
### Network Health Monitoring
```bash
#!/bin/bash
# network-monitor.sh
CRITICAL_HOSTS="10.10.0.1 10.10.0.16 nas.homelab.local"
CRITICAL_SERVICES="https://homelab.local http://proxmox.homelab.local:8006"
for host in $CRITICAL_HOSTS; do
if ! ping -c1 -W5 $host >/dev/null 2>&1; then
echo "ALERT: $host unreachable" | logger -t network-monitor
fi
done
for service in $CRITICAL_SERVICES; do
if ! curl -sSf --max-time 10 "$service" >/dev/null 2>&1; then
echo "ALERT: $service unavailable" | logger -t network-monitor
fi
done
```
### Automated Recovery Scripts
```bash
#!/bin/bash
# network-recovery.sh
if ! ping -c1 8.8.8.8 >/dev/null 2>&1; then
echo "Network down, attempting recovery..."
sudo systemctl restart networking
sleep 10
if ping -c1 8.8.8.8 >/dev/null 2>&1; then
echo "Network recovered"
else
echo "Manual intervention required"
fi
fi
```
## Quick Reference Commands
### Network Diagnostics
```bash
# Connectivity tests
ping host
traceroute host
mtr host
nc -zv host port
# Service checks
systemctl status networking
systemctl status nginx
systemctl status sshd
# Network configuration
ip addr show
ip route show
ss -tuln
```
### Emergency Commands
```bash
# Network restart
sudo systemctl restart networking
# SSH emergency access
ssh -i ~/.ssh/emergency_homelab_rsa user@host
# Firewall quick disable (emergency only)
sudo ufw disable
# DNS quick fix
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf
```
This troubleshooting guide provides comprehensive solutions for common networking issues in home lab environments.