claude-home/networking/troubleshooting.md
Cal Corum 10c9e0d854 CLAUDE: Migrate to technology-first documentation architecture
Complete restructure from patterns/examples/reference to technology-focused directories:

• Created technology-specific directories with comprehensive documentation:
  - /tdarr/ - Transcoding automation with gaming-aware scheduling
  - /docker/ - Container management with GPU acceleration patterns
  - /vm-management/ - Virtual machine automation and cloud-init
  - /networking/ - SSH infrastructure, reverse proxy, and security
  - /monitoring/ - System health checks and Discord notifications
  - /databases/ - Database patterns and troubleshooting
  - /development/ - Programming language patterns (bash, nodejs, python, vuejs)

• Enhanced CLAUDE.md with intelligent context loading:
  - Technology-first loading rules for automatic context provision
  - Troubleshooting keyword triggers for emergency scenarios
  - Documentation maintenance protocols with automated reminders
  - Context window management for optimal documentation updates

• Preserved valuable content from .claude/tmp/:
  - SSH security improvements and server inventory
  - Tdarr CIFS troubleshooting and Docker iptables solutions
  - Operational scripts with proper technology classification

• Benefits achieved:
  - Self-contained technology directories with complete context
  - Automatic loading of relevant documentation based on keywords
  - Emergency-ready troubleshooting with comprehensive guides
  - Scalable structure for future technology additions
  - Eliminated context bloat through targeted loading

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-12 23:20:15 -05:00

10 KiB

Networking Infrastructure Troubleshooting Guide

SSH Connection Issues

SSH Authentication Failures

Symptoms: Permission denied, connection refused, timeout Diagnosis:

# Verbose SSH debugging
ssh -vvv user@host

# Test different authentication methods
ssh -o PasswordAuthentication=no user@host
ssh -o PubkeyAuthentication=yes user@host

# Check local key files
ls -la ~/.ssh/
ssh-keygen -lf ~/.ssh/homelab_rsa.pub

Solutions:

# Re-deploy SSH keys
ssh-copy-id -i ~/.ssh/homelab_rsa.pub user@host
ssh-copy-id -i ~/.ssh/emergency_homelab_rsa.pub user@host

# Fix key permissions
chmod 600 ~/.ssh/homelab_rsa
chmod 644 ~/.ssh/homelab_rsa.pub
chmod 700 ~/.ssh

# Verify remote authorized_keys
ssh user@host 'chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys'

SSH Service Issues

Symptoms: Connection refused, service not running Diagnosis:

# Check SSH service status
systemctl status sshd
ss -tlnp | grep :22

# Test port connectivity
nc -zv host 22
nmap -p 22 host

Solutions:

# Restart SSH service
sudo systemctl restart sshd
sudo systemctl enable sshd

# Check firewall
sudo ufw status
sudo ufw allow ssh

# Verify SSH configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)"

Network Connectivity Problems

Basic Network Troubleshooting

Symptoms: Cannot reach hosts, timeouts, routing issues Diagnosis:

# Basic connectivity tests
ping host
traceroute host
mtr host

# Check local network configuration
ip addr show
ip route show
cat /etc/resolv.conf

Solutions:

# Restart networking
sudo systemctl restart networking
sudo netplan apply  # Ubuntu

# Reset network interface
sudo ip link set eth0 down
sudo ip link set eth0 up

# Check default gateway
sudo ip route add default via 10.10.0.1

DNS Resolution Issues

Symptoms: Cannot resolve hostnames, slow resolution Diagnosis:

# Test DNS resolution
nslookup google.com
dig google.com
host google.com

# Check DNS servers
systemd-resolve --status
cat /etc/resolv.conf

Solutions:

# Temporary DNS fix
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf

# Restart DNS services
sudo systemctl restart systemd-resolved

# Flush DNS cache
sudo systemd-resolve --flush-caches

Reverse Proxy and Load Balancer Issues

Nginx Configuration Problems

Symptoms: 502 Bad Gateway, 503 Service Unavailable, SSL errors Diagnosis:

# Check Nginx status and logs
systemctl status nginx
sudo tail -f /var/log/nginx/error.log
sudo tail -f /var/log/nginx/access.log

# Test Nginx configuration
sudo nginx -t
sudo nginx -T  # Show full configuration

Solutions:

# Reload Nginx configuration
sudo nginx -s reload

# Check upstream servers
curl -I http://backend-server:port
telnet backend-server port

# Fix common configuration issues
sudo nano /etc/nginx/sites-available/default
# Check proxy_pass URLs, upstream definitions

SSL/TLS Certificate Issues

Symptoms: Certificate warnings, expired certificates, connection errors Diagnosis:

# Check certificate validity
openssl s_client -connect host:443 -servername host
openssl x509 -in /etc/ssl/certs/cert.pem -text -noout

# Check certificate expiry
openssl x509 -in /etc/ssl/certs/cert.pem -noout -dates

Solutions:

# Renew Let's Encrypt certificates
sudo certbot renew --dry-run
sudo certbot renew --force-renewal

# Generate self-signed certificate
sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
    -keyout /etc/ssl/private/selfsigned.key \
    -out /etc/ssl/certs/selfsigned.crt

Network Storage Issues

CIFS/SMB Mount Problems

Symptoms: Mount failures, connection timeouts, permission errors Diagnosis:

# Test SMB connectivity
smbclient -L //nas-server -U username
testparm  # Test Samba configuration

# Check mount status
mount | grep cifs
df -h | grep cifs

Solutions:

# Remount with verbose logging
sudo mount -t cifs //server/share /mnt/point -o username=user,password=pass,vers=3.0

# Fix mount options in /etc/fstab
//server/share /mnt/point cifs credentials=/etc/cifs/credentials,uid=1000,gid=1000,iocharset=utf8,file_mode=0644,dir_mode=0755,cache=strict,_netdev 0 0

# Test credentials
sudo cat /etc/cifs/credentials
# Should contain: username=, password=, domain=

NFS Mount Issues

Symptoms: Stale file handles, mount hangs, permission denied Diagnosis:

# Check NFS services
systemctl status nfs-client.target
showmount -e nfs-server

# Test NFS connectivity
rpcinfo -p nfs-server

Solutions:

# Restart NFS services
sudo systemctl restart nfs-client.target

# Remount NFS shares
sudo umount /mnt/nfs-share
sudo mount -t nfs server:/path /mnt/nfs-share

# Fix stale file handles
sudo umount -f /mnt/nfs-share
sudo mount /mnt/nfs-share

Firewall and Security Issues

Port Access Problems

Symptoms: Connection refused, filtered ports, blocked services Diagnosis:

# Check firewall status
sudo ufw status verbose
sudo iptables -L -n -v

# Test port accessibility
nc -zv host port
nmap -p port host

Solutions:

# Open required ports
sudo ufw allow ssh
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw allow from 10.10.0.0/24

# Reset firewall if needed
sudo ufw --force reset
sudo ufw enable

Network Security Issues

Symptoms: Unauthorized access, suspicious traffic, security alerts Diagnosis:

# Check active connections
ss -tuln
netstat -tuln

# Review logs for security events
sudo tail -f /var/log/auth.log
sudo tail -f /var/log/syslog | grep -i security

Solutions:

# Block suspicious IPs
sudo ufw deny from suspicious-ip

# Update SSH security
sudo nano /etc/ssh/sshd_config
# Set: PasswordAuthentication no, PermitRootLogin no
sudo systemctl restart sshd

Service Discovery and DNS Issues

Local DNS Problems

Symptoms: Services unreachable by hostname, DNS timeouts Diagnosis:

# Test local DNS resolution
nslookup service.homelab.local
dig @10.10.0.16 service.homelab.local

# Check DNS server status
systemctl status bind9  # or named

Solutions:

# Add to /etc/hosts as temporary fix
echo "10.10.0.100 service.homelab.local" | sudo tee -a /etc/hosts

# Restart DNS services
sudo systemctl restart bind9
sudo systemctl restart systemd-resolved

Container Networking Issues

Symptoms: Containers cannot communicate, service discovery fails Diagnosis:

# Check Docker networks
docker network ls
docker network inspect bridge

# Test container connectivity
docker exec container1 ping container2
docker exec container1 nslookup container2

Solutions:

# Create custom network
docker network create --driver bridge app-network
docker run --network app-network container

# Fix DNS in containers
docker run --dns 8.8.8.8 container

Performance Issues

Network Latency Problems

Symptoms: Slow response times, timeouts, poor performance Diagnosis:

# Measure network latency
ping -c 100 host
mtr --report host

# Check network interface stats
ip -s link show
cat /proc/net/dev

Solutions:

# Optimize network settings
echo 'net.core.rmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
echo 'net.core.wmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

# Check for network congestion
iftop
nethogs

Bandwidth Issues

Symptoms: Slow transfers, network congestion, dropped packets Diagnosis:

# Test bandwidth
iperf3 -s  # Server
iperf3 -c server-ip  # Client

# Check interface utilization
vnstat -i eth0

Solutions:

# Implement QoS if needed
sudo tc qdisc add dev eth0 root fq_codel

# Optimize buffer sizes
sudo ethtool -G eth0 rx 4096 tx 4096

Emergency Recovery Procedures

Network Emergency Recovery

Complete network failure recovery:

# Reset all network configuration
sudo systemctl stop networking
sudo ip addr flush eth0
sudo ip route flush table main
sudo systemctl start networking

# Manual network configuration
sudo ip addr add 10.10.0.100/24 dev eth0
sudo ip route add default via 10.10.0.1
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf

SSH Emergency Access

When locked out of systems:

# Use emergency SSH key
ssh -i ~/.ssh/emergency_homelab_rsa user@host

# Via console access (if available)
# Use hypervisor console or physical access

# Reset SSH to allow password auth temporarily
sudo sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config
sudo systemctl restart sshd

Service Recovery

Critical service restoration:

# Restart all network services
sudo systemctl restart networking
sudo systemctl restart nginx
sudo systemctl restart sshd

# Emergency firewall disable
sudo ufw disable  # CAUTION: Only for troubleshooting

# Service-specific recovery
sudo systemctl restart docker
sudo systemctl restart systemd-resolved

Monitoring and Prevention

Network Health Monitoring

#!/bin/bash
# network-monitor.sh
CRITICAL_HOSTS="10.10.0.1 10.10.0.16 nas.homelab.local"
CRITICAL_SERVICES="https://homelab.local http://proxmox.homelab.local:8006"

for host in $CRITICAL_HOSTS; do
    if ! ping -c1 -W5 $host >/dev/null 2>&1; then
        echo "ALERT: $host unreachable" | logger -t network-monitor
    fi
done

for service in $CRITICAL_SERVICES; do
    if ! curl -sSf --max-time 10 "$service" >/dev/null 2>&1; then
        echo "ALERT: $service unavailable" | logger -t network-monitor
    fi
done

Automated Recovery Scripts

#!/bin/bash
# network-recovery.sh
if ! ping -c1 8.8.8.8 >/dev/null 2>&1; then
    echo "Network down, attempting recovery..."
    sudo systemctl restart networking
    sleep 10
    if ping -c1 8.8.8.8 >/dev/null 2>&1; then
        echo "Network recovered"
    else
        echo "Manual intervention required"
    fi
fi

Quick Reference Commands

Network Diagnostics

# Connectivity tests
ping host
traceroute host
mtr host
nc -zv host port

# Service checks
systemctl status networking
systemctl status nginx
systemctl status sshd

# Network configuration
ip addr show
ip route show
ss -tuln

Emergency Commands

# Network restart
sudo systemctl restart networking

# SSH emergency access
ssh -i ~/.ssh/emergency_homelab_rsa user@host

# Firewall quick disable (emergency only)
sudo ufw disable

# DNS quick fix
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf

This troubleshooting guide provides comprehensive solutions for common networking issues in home lab environments.