Reindex Knowledge Base / reindex (push) Successful in 3s

Details

docs: add YAML frontmatter to all 151 markdown files

Adds title, description, type, domain, and tags frontmatter to every
doc for improved KB semantic search. The description field is prepended
to every search chunk, and domain/type/tags enable filtered queries.

Type values: context, guide, runbook, reference, troubleshooting
Domain values match directory structure (networking, docker, etc.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-12 09:00:44 -05:00

36 KiB

Raw Blame History

title

description

type

domain

Networking Infrastructure Troubleshooting Guide

SSH Connection Issues

SSH Authentication Failures

Symptoms: Permission denied, connection refused, timeout Diagnosis:

# Verbose SSH debugging
ssh -vvv user@host

# Test different authentication methods
ssh -o PasswordAuthentication=no user@host
ssh -o PubkeyAuthentication=yes user@host

# Check local key files
ls -la ~/.ssh/
ssh-keygen -lf ~/.ssh/homelab_rsa.pub

Solutions:

# Re-deploy SSH keys
ssh-copy-id -i ~/.ssh/homelab_rsa.pub user@host
ssh-copy-id -i ~/.ssh/emergency_homelab_rsa.pub user@host

# Fix key permissions
chmod 600 ~/.ssh/homelab_rsa
chmod 644 ~/.ssh/homelab_rsa.pub
chmod 700 ~/.ssh

# Verify remote authorized_keys
ssh user@host 'chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys'

SSH Service Issues

Symptoms: Connection refused, service not running Diagnosis:

# Check SSH service status
systemctl status sshd
ss -tlnp | grep :22

# Test port connectivity
nc -zv host 22
nmap -p 22 host

Solutions:

# Restart SSH service
sudo systemctl restart sshd
sudo systemctl enable sshd

# Check firewall
sudo ufw status
sudo ufw allow ssh

# Verify SSH configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)"

Network Connectivity Problems

Basic Network Troubleshooting

Symptoms: Cannot reach hosts, timeouts, routing issues Diagnosis:

# Basic connectivity tests
ping host
traceroute host
mtr host

# Check local network configuration
ip addr show
ip route show
cat /etc/resolv.conf

Solutions:

# Restart networking
sudo systemctl restart networking
sudo netplan apply  # Ubuntu

# Reset network interface
sudo ip link set eth0 down
sudo ip link set eth0 up

# Check default gateway
sudo ip route add default via 10.10.0.1

DNS Resolution Issues

Symptoms: Cannot resolve hostnames, slow resolution Diagnosis:

# Test DNS resolution
nslookup google.com
dig google.com
host google.com

# Check DNS servers
systemd-resolve --status
cat /etc/resolv.conf

Solutions:

# Temporary DNS fix
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf

# Restart DNS services
sudo systemctl restart systemd-resolved

# Flush DNS cache
sudo systemd-resolve --flush-caches

UniFi Firewall Blocking DNS to New Networks

Symptoms: New network/VLAN has "no internet access" - devices connect to WiFi but cannot browse or resolve domain names. Ping to IP addresses (8.8.8.8) works, but DNS resolution fails.

Root Cause: Firewall rules blocking traffic from DNS servers (Pi-holes in "Servers" network group) to new networks. Rules like "Servers to WiFi" or "Servers to Home" with DROP action block ALL traffic including DNS responses on port 53.

Diagnosis:

# From affected device on new network:

# Test if routing works (should succeed)
ping 8.8.8.8
traceroute 8.8.8.8

# Test if DNS resolution works (will fail)
nslookup google.com

# Test DNS servers directly (will timeout or fail)
nslookup google.com 10.10.0.16
nslookup google.com 10.10.0.226

# Test public DNS (should work)
nslookup google.com 8.8.8.8

# Check DHCP-assigned DNS servers
# Windows:
ipconfig /all | findstr DNS

# Linux/macOS:
cat /etc/resolv.conf

If routing works but DNS fails, the issue is firewall blocking DNS traffic, not network configuration.

Solutions:

Step 1: Identify Blocking Rules

In UniFi: Settings → Firewall & Security → Traffic Rules → LAN In
Look for DROP rules with:
- Source: Servers (or network group containing Pi-holes)
- Destination: Your new network (e.g., "Home WiFi", "Home Network")
- Examples: "Servers to WiFi", "Servers to Home"

Step 2: Create DNS Allow Rules (BEFORE Drop Rules)

Create new rules positioned ABOVE the drop rules:

Name: Allow DNS - Servers to [Network Name]
Action: Accept
Rule Applied: Before Predefined Rules
Type: LAN In
Protocol: TCP and UDP
Source:
  - Network/Group: Servers (or specific Pi-hole IPs: 10.10.0.16, 10.10.0.226)
  - Port: Any
Destination:
  - Network: [Your new network - e.g., Home WiFi]
  - Port: 53 (DNS)

Repeat for each network that needs DNS access from servers.

Step 3: Verify Rule Order

CRITICAL: Firewall rules process top-to-bottom, first match wins!

Correct order:

✅ Allow DNS - Servers to Home Network (Accept, Port 53)
✅ Allow DNS - Servers to Home WiFi (Accept, Port 53)
❌ Servers to Home (Drop, All ports)
❌ Servers to WiFi (Drop, All ports)

Step 4: Re-enable Drop Rules

Once DNS allow rules are in place and positioned correctly, re-enable the drop rules.

Verification:

# From device on new network:

# DNS should work
nslookup google.com

# Browsing should work
ping google.com

# Other server traffic should still be blocked (expected)
ping 10.10.0.16  # Should fail or timeout
ssh 10.10.0.16   # Should be blocked

Real-World Example: New "Home WiFi" network (10.1.0.0/24, VLAN 2)

Problem: Devices connected but couldn't browse web
Diagnosis: traceroute 8.8.8.8 worked (16ms), but nslookup google.com failed
Cause: Firewall rule "Servers to WiFi" (rule 20004) blocked Pi-hole DNS responses
Solution: Added "Allow DNS - Servers to Home WiFi" rule (Accept, port 53) above drop rule
Result: DNS resolution works, other server traffic remains properly blocked

Reverse Proxy and Load Balancer Issues

Nginx Configuration Problems

Symptoms: 502 Bad Gateway, 503 Service Unavailable, SSL errors Diagnosis:

# Check Nginx status and logs
systemctl status nginx
sudo tail -f /var/log/nginx/error.log
sudo tail -f /var/log/nginx/access.log

# Test Nginx configuration
sudo nginx -t
sudo nginx -T  # Show full configuration

Solutions:

# Reload Nginx configuration
sudo nginx -s reload

# Check upstream servers
curl -I http://backend-server:port
telnet backend-server port

# Fix common configuration issues
sudo nano /etc/nginx/sites-available/default
# Check proxy_pass URLs, upstream definitions

SSL/TLS Certificate Issues

Symptoms: Certificate warnings, expired certificates, connection errors Diagnosis:

# Check certificate validity
openssl s_client -connect host:443 -servername host
openssl x509 -in /etc/ssl/certs/cert.pem -text -noout

# Check certificate expiry
openssl x509 -in /etc/ssl/certs/cert.pem -noout -dates

Solutions:

# Renew Let's Encrypt certificates
sudo certbot renew --dry-run
sudo certbot renew --force-renewal

# Generate self-signed certificate
sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
    -keyout /etc/ssl/private/selfsigned.key \
    -out /etc/ssl/certs/selfsigned.crt

Intermittent SSL Errors (ERR_SSL_UNRECOGNIZED_NAME_ALERT)

Symptoms: SSL errors that work sometimes but fail other times, ERR_SSL_UNRECOGNIZED_NAME_ALERT in browser, connection works from internal network intermittently

Root Cause: IPv6/IPv4 DNS conflicts where public DNS returns Cloudflare IPv6 addresses while local DNS (Pi-hole) only overrides IPv4. Modern systems prefer IPv6, causing intermittent failures when IPv6 connection attempts fail.

Diagnosis:

# Check for multiple DNS records (IPv4 + IPv6)
nslookup domain.example.com 10.10.0.16
dig domain.example.com @10.10.0.16

# Compare with public DNS
host domain.example.com 8.8.8.8

# Test IPv6 vs IPv4 connectivity
curl -6 -I https://domain.example.com  # IPv6 (may fail)
curl -4 -I https://domain.example.com  # IPv4 (should work)

# Check if system has IPv6 connectivity
ip -6 addr show | grep global

Example Problem:

# Local Pi-hole returns:
domain.example.com → 10.10.0.16 (IPv4 internal NPM)

# Public DNS also returns:
domain.example.com → 2606:4700:... (Cloudflare IPv6)

# System tries IPv6 first → fails
# Sometimes falls back to IPv4 → works
# Result: Intermittent SSL errors

Solutions:

Option 1: Add IPv6 Local DNS Override (Recommended)

# Add non-routable IPv6 address to Pi-hole custom.list
ssh pihole "docker exec pihole bash -c 'echo \"fe80::1 domain.example.com\" >> /etc/pihole/custom.list'"

# Restart Pi-hole DNS
ssh pihole "docker exec pihole pihole restartdns"

# Verify fix
nslookup domain.example.com 10.10.0.16
# Should show: 10.10.0.16 (IPv4) and fe80::1 (IPv6 link-local)

Option 2: Remove Cloudflare DNS Records (If public access not needed)

# In Cloudflare dashboard:
# - Turn off orange cloud (proxy) for the domain
# - Or delete A/AAAA records entirely

# This removes Cloudflare IPs from public DNS

Option 3: Disable IPv6 on Client (Temporary testing)

# Disable IPv6 temporarily to confirm diagnosis
sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1

# Test domain - should work consistently now

# Re-enable when done testing
sudo sysctl -w net.ipv6.conf.all.disable_ipv6=0

Verification:

# After applying fix, verify consistent resolution
for i in {1..10}; do
    echo "Test $i:"
    curl -I https://domain.example.com 2>&1 | grep -E "(HTTP|SSL|certificate)"
    sleep 1
done

# All attempts should succeed consistently

Real-World Example: git.manticorum.com

Problem: Intermittent SSL errors from internal network (10.0.0.0/24)
Diagnosis: Pi-hole had IPv4 override (10.10.0.16) but public DNS returned Cloudflare IPv6
Solution: Added fe80::1 git.manticorum.com to Pi-hole custom.list
Result: Consistent successful connections, always routes to internal NPM

iOS DNS Bypass Issues (Encrypted DNS)

Symptoms: iOS device gets 403 errors when accessing internal services, NPM logs show external public IP as source instead of local 10.x.x.x IP, even with correct Pi-hole DNS configuration

Root Cause: iOS devices can use encrypted DNS (DNS-over-HTTPS or DNS-over-TLS) that bypasses traditional DNS servers, even when correctly configured. This causes the device to resolve to public/Cloudflare IPs instead of local overrides, routing traffic through the public internet and triggering ACL denials.

Diagnosis:

# Check NPM access logs for the service
ssh 10.10.0.16 "docker exec nginx-proxy-manager_app_1 tail -50 /data/logs/proxy-host-*_access.log | grep 403"

# Look for external IPs in logs instead of local 10.x.x.x:
# BAD:  [Client 73.36.102.55] - - 403 (external IP, blocked by ACL)
# GOOD: [Client 10.0.0.207] - 200 200 (local IP, allowed)

# Verify iOS device is on local network
# On iOS: Settings → Wi-Fi → (i) → IP Address
# Should show 10.0.0.x or 10.10.0.x

# Verify Pi-hole DNS is configured
# On iOS: Settings → Wi-Fi → (i) → DNS
# Should show 10.10.0.16

# Test if DNS is actually being used
nslookup domain.example.com 10.10.0.16  # Shows what Pi-hole returns
# Then check what iOS actually resolves (if possible via network sniffer)

Example Problem:

# iOS device configuration:
IP Address: 10.0.0.207 (correct, on local network)
DNS: 10.10.0.16 (correct, Pi-hole configured)
Cellular Data: OFF

# But NPM logs show:
[Client 73.36.102.55] - - 403  # Coming from ISP public IP!

# Why: iOS is using encrypted DNS, bypassing Pi-hole
# Result: Resolves to Cloudflare IP, routes through public internet,
#         NPM sees external IP, ACL blocks with 403

Solutions:

Option 1: Add Public IP to NPM Access Rules (Quickest, recommended for mobile devices)

# Find which config file contains your domain
ssh 10.10.0.16 "docker exec nginx-proxy-manager_app_1 sh -c 'grep -l domain.example.com /data/nginx/proxy_host/*.conf'"
# Example output: /data/nginx/proxy_host/19.conf

# Add public IP to access rules (replace YOUR_PUBLIC_IP and config number)
ssh 10.10.0.16 "docker exec nginx-proxy-manager_app_1 sed -i '/allow 10.10.0.0\/24;/a \    \n    allow YOUR_PUBLIC_IP;' /data/nginx/proxy_host/19.conf"

# Verify the change
ssh 10.10.0.16 "docker exec nginx-proxy-manager_app_1 cat /data/nginx/proxy_host/19.conf" | grep -A 8 "Access Rules"

# Test and reload nginx
ssh 10.10.0.16 "docker exec nginx-proxy-manager_app_1 nginx -t"
ssh 10.10.0.16 "docker exec nginx-proxy-manager_app_1 nginx -s reload"

Option 2: Reset iOS Network Settings (Nuclear option, clears DNS cache/profiles)

iOS: Settings → General → Transfer or Reset iPhone → Reset → Reset Network Settings
WARNING: This removes all saved WiFi passwords and network configurations

Option 3: Check for DNS Configuration Profiles

iOS: Settings → General → VPN & Device Management
- Look for any DNS or Configuration Profiles
- Remove any third-party DNS profiles (AdGuard, NextDNS, etc.)

Option 4: Disable Private Relay and IP Tracking (Usually already tried)

iOS: Settings → [Your Name] → iCloud → Private Relay → OFF
iOS: Settings → Wi-Fi → (i) → Limit IP Address Tracking → OFF

Option 5: Check Browser DNS Settings (If using Brave or Firefox)

Brave: Settings → Brave Shields & Privacy → Use secure DNS → OFF
Firefox: Settings → DNS over HTTPS → OFF

Verification:

# After applying fix, check NPM logs while accessing from iOS
ssh 10.10.0.16 "docker exec nginx-proxy-manager_app_1 tail -f /data/logs/proxy-host-*_access.log"

# With Option 1 (added public IP): Should see 200 status with external IP
# With Option 2-5 (fixed DNS): Should see 200 status with local 10.x.x.x IP

Important Notes:

Option 1 is recommended for mobile devices as iOS encrypted DNS behavior is inconsistent
Public IP workaround requires updating if ISP changes your IP (rare for residential)
Manual nginx config changes (Option 1) will be overwritten if you edit the proxy host in NPM UI
To make permanent, either use NPM UI to add the IP, or re-apply after UI changes
This issue can affect any iOS device (iPhone, iPad) and some Android devices with encrypted DNS

Real-World Example: git.manticorum.com iOS Access

Problem: iPhone showing 403 errors, desktop working fine on same network
iOS Config: IP 10.0.0.207, DNS 10.10.0.16, Cellular OFF (all correct)
NPM Logs: iPhone requests showing as [Client 73.36.102.55] (ISP public IP)
Diagnosis: iOS using encrypted DNS, bypassing Pi-hole, routing through Cloudflare
Solution: Added allow 73.36.102.55; to NPM proxy_host/19.conf ACL rules
Result: Immediate access, user able to log in to Gitea successfully

Network Storage Issues

CIFS/SMB Mount Problems

Symptoms: Mount failures, connection timeouts, permission errors Diagnosis:

# Test SMB connectivity
smbclient -L //nas-server -U username
testparm  # Test Samba configuration

# Check mount status
mount | grep cifs
df -h | grep cifs

Solutions:

# Remount with verbose logging
sudo mount -t cifs //server/share /mnt/point -o username=user,password=pass,vers=3.0

# Fix mount options in /etc/fstab
//server/share /mnt/point cifs credentials=/etc/cifs/credentials,uid=1000,gid=1000,iocharset=utf8,file_mode=0644,dir_mode=0755,cache=strict,_netdev 0 0

# Test credentials
sudo cat /etc/cifs/credentials
# Should contain: username=, password=, domain=

NFS Mount Issues

Symptoms: Stale file handles, mount hangs, permission denied Diagnosis:

# Check NFS services
systemctl status nfs-client.target
showmount -e nfs-server

# Test NFS connectivity
rpcinfo -p nfs-server

Solutions:

# Restart NFS services
sudo systemctl restart nfs-client.target

# Remount NFS shares
sudo umount /mnt/nfs-share
sudo mount -t nfs server:/path /mnt/nfs-share

# Fix stale file handles
sudo umount -f /mnt/nfs-share
sudo mount /mnt/nfs-share

Firewall and Security Issues

Port Access Problems

Symptoms: Connection refused, filtered ports, blocked services Diagnosis:

# Check firewall status
sudo ufw status verbose
sudo iptables -L -n -v

# Test port accessibility
nc -zv host port
nmap -p port host

Solutions:

# Open required ports
sudo ufw allow ssh
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw allow from 10.10.0.0/24

# Reset firewall if needed
sudo ufw --force reset
sudo ufw enable

Network Security Issues

Symptoms: Unauthorized access, suspicious traffic, security alerts Diagnosis:

# Check active connections
ss -tuln
netstat -tuln

# Review logs for security events
sudo tail -f /var/log/auth.log
sudo tail -f /var/log/syslog | grep -i security

Solutions:

# Block suspicious IPs
sudo ufw deny from suspicious-ip

# Update SSH security
sudo nano /etc/ssh/sshd_config
# Set: PasswordAuthentication no, PermitRootLogin no
sudo systemctl restart sshd

Pi-hole High Availability Troubleshooting

Pi-hole Not Responding to DNS Queries

Symptoms: DNS resolution failures, clients cannot resolve domains, Pi-hole web UI inaccessible Diagnosis:

# Test DNS response from both Pi-holes
dig @10.10.0.16 google.com
dig @10.10.0.226 google.com

# Check Pi-hole container status
ssh npm-pihole "docker ps | grep pihole"
ssh ubuntu-manticore "docker ps | grep pihole"

# Check Pi-hole logs
ssh npm-pihole "docker logs pihole --tail 50"
ssh ubuntu-manticore "docker logs pihole --tail 50"

# Test port 53 is listening
ssh ubuntu-manticore "netstat -tulpn | grep :53"
ssh ubuntu-manticore "ss -tulpn | grep :53"

Solutions:

# Restart Pi-hole containers
ssh npm-pihole "docker restart pihole"
ssh ubuntu-manticore "cd ~/docker/pihole && docker compose restart"

# Check for port conflicts
ssh ubuntu-manticore "lsof -i :53"

# If systemd-resolved is conflicting, disable it
ssh ubuntu-manticore "sudo systemctl stop systemd-resolved"
ssh ubuntu-manticore "sudo systemctl disable systemd-resolved"

# Rebuild Pi-hole container
ssh ubuntu-manticore "cd ~/docker/pihole && docker compose down && docker compose up -d"

DNS Failover Not Working

Symptoms: DNS stops working when primary Pi-hole fails, clients not using secondary DNS Diagnosis:

# Check UniFi DHCP DNS configuration
# Via UniFi UI: Settings → Networks → LAN → DHCP
# DNS Server 1: 10.10.0.16
# DNS Server 2: 10.10.0.226

# Check client DNS configuration
# Windows:
ipconfig /all | findstr /i "DNS"

# Linux/macOS:
cat /etc/resolv.conf

# Check if secondary Pi-hole is reachable
ping -c 4 10.10.0.226
dig @10.10.0.226 google.com

# Test failover manually
ssh npm-pihole "docker stop pihole"
dig google.com  # Should still work via secondary
ssh npm-pihole "docker start pihole"

Solutions:

# Force DHCP lease renewal to get updated DNS servers
# Windows:
ipconfig /release && ipconfig /renew

# Linux:
sudo dhclient -r && sudo dhclient

# macOS/iOS:
# Disconnect and reconnect to WiFi

# Verify UniFi DHCP settings are correct
# Both DNS servers must be configured in UniFi controller

# Check client respects both DNS servers
# Some clients may cache failed DNS responses
# Flush DNS cache:
# Windows: ipconfig /flushdns
# macOS: sudo dscacheutil -flushcache
# Linux: sudo systemd-resolve --flush-caches

Orbital Sync Not Syncing

Symptoms: Blocklists/whitelists differ between Pi-holes, custom DNS entries missing on secondary Diagnosis:

# Check Orbital Sync container status
ssh ubuntu-manticore "docker ps | grep orbital-sync"

# Check Orbital Sync logs
ssh ubuntu-manticore "docker logs orbital-sync --tail 100"

# Look for sync errors in logs
ssh ubuntu-manticore "docker logs orbital-sync 2>&1 | grep -i error"

# Verify API tokens are correct
ssh ubuntu-manticore "cat ~/docker/orbital-sync/.env"

# Test API access manually
ssh npm-pihole "docker exec pihole pihole -a -p"  # Get API token
curl -H "Authorization: Token YOUR_TOKEN" http://10.10.0.16/admin/api.php?status

# Compare blocklist counts between Pi-holes
ssh npm-pihole "docker exec pihole pihole -g -l"
ssh ubuntu-manticore "docker exec pihole pihole -g -l"

Solutions:

# Regenerate API tokens
# Primary Pi-hole: http://10.10.0.16/admin → Settings → API → Generate New Token
# Secondary Pi-hole: http://10.10.0.226:8053/admin → Settings → API → Generate New Token

# Update Orbital Sync .env file
ssh ubuntu-manticore "nano ~/docker/orbital-sync/.env"
# Update PRIMARY_HOST_PASSWORD and SECONDARY_HOST_PASSWORD

# Restart Orbital Sync
ssh ubuntu-manticore "cd ~/docker/orbital-sync && docker compose restart"

# Force immediate sync by restarting
ssh ubuntu-manticore "cd ~/docker/orbital-sync && docker compose down && docker compose up -d"

# Monitor sync in real-time
ssh ubuntu-manticore "docker logs orbital-sync -f"

# If all else fails, manually sync via Teleporter
# Primary: Settings → Teleporter → Backup
# Secondary: Settings → Teleporter → Restore (upload backup file)

NPM DNS Sync Failing

Symptoms: NPM proxy hosts missing from Pi-hole custom.list, new domains not resolving Diagnosis:

# Check NPM sync script status
ssh npm-pihole "cat /var/log/cron.log | grep npm-pihole-sync"

# Run sync script manually to see errors
ssh npm-pihole "/home/cal/scripts/npm-pihole-sync.sh"

# Check script can access both Pi-holes
ssh npm-pihole "docker exec pihole cat /etc/pihole/custom.list | grep git.manticorum.com"
ssh npm-pihole "ssh ubuntu-manticore 'docker exec pihole cat /etc/pihole/custom.list | grep git.manticorum.com'"

# Verify SSH connectivity to ubuntu-manticore
ssh npm-pihole "ssh ubuntu-manticore 'echo SSH OK'"

Solutions:

# Fix SSH key authentication (if needed)
ssh npm-pihole "ssh-copy-id ubuntu-manticore"

# Test script with dry-run
ssh npm-pihole "/home/cal/scripts/npm-pihole-sync.sh --dry-run"

# Run script manually to sync immediately
ssh npm-pihole "/home/cal/scripts/npm-pihole-sync.sh"

# Verify cron job is configured
ssh npm-pihole "crontab -l | grep npm-pihole-sync"

# If cron job missing, add it
ssh npm-pihole "crontab -e"
# Add: 0 * * * * /home/cal/scripts/npm-pihole-sync.sh >> /var/log/npm-pihole-sync.log 2>&1

# Check script logs
ssh npm-pihole "tail -50 /var/log/npm-pihole-sync.log"

Secondary Pi-hole Performance Issues

Symptoms: ubuntu-manticore slow, high CPU/RAM usage, Pi-hole affecting Jellyfin/Tdarr Diagnosis:

# Check resource usage
ssh ubuntu-manticore "docker stats --no-stream"

# Pi-hole should use <1% CPU and ~150MB RAM
# If higher, investigate:
ssh ubuntu-manticore "docker logs pihole --tail 100"

# Check for excessive queries
ssh ubuntu-manticore "docker exec pihole pihole -c -e"

# Check for DNS loops or misconfiguration
ssh ubuntu-manticore "docker exec pihole pihole -t"  # Tail pihole.log

Solutions:

# Restart Pi-hole if resource usage is high
ssh ubuntu-manticore "docker restart pihole"

# Check for DNS query loops
# Look for same domain being queried repeatedly
ssh ubuntu-manticore "docker exec pihole pihole -t | grep -A 5 'query\[A\]'"

# Adjust Pi-hole cache settings if needed
ssh ubuntu-manticore "docker exec pihole bash -c 'echo \"cache-size=10000\" >> /etc/dnsmasq.d/99-custom.conf'"
ssh ubuntu-manticore "docker restart pihole"

# If Jellyfin/Tdarr are affected, verify Pi-hole is using minimal resources
# Resource limits can be added to docker-compose.yml:
ssh ubuntu-manticore "nano ~/docker/pihole/docker-compose.yml"
# Add under pihole service:
#   deploy:
#     resources:
#       limits:
#         cpus: '0.5'
#         memory: 256M

iOS Devices Still Getting 403 Errors (Post-HA Deployment)

Symptoms: After deploying dual Pi-hole setup, iOS devices still bypass DNS and get 403 errors on internal services Diagnosis:

# Verify UniFi DHCP has BOTH Pi-holes configured, NO public DNS
# UniFi UI: Settings → Networks → LAN → DHCP → Name Server
# DNS1: 10.10.0.16
# DNS2: 10.10.0.226
# Public DNS (1.1.1.1, 8.8.8.8): REMOVED

# Check iOS DNS settings
# iOS: Settings → WiFi → (i) → DNS
# Should show: 10.10.0.16

# Force iOS DHCP renewal
# iOS: Settings → WiFi → Forget Network → Reconnect

# Check NPM logs for request source
ssh npm-pihole "docker exec nginx-proxy-manager_app_1 tail -50 /data/logs/proxy-host-*_access.log | grep 403"

# Verify both Pi-holes have custom DNS entries
ssh npm-pihole "docker exec pihole cat /etc/pihole/custom.list | grep git.manticorum.com"
ssh ubuntu-manticore "docker exec pihole cat /etc/pihole/custom.list | grep git.manticorum.com"

Solutions:

# Solution 1: Verify public DNS is removed from UniFi DHCP
# If public DNS (1.1.1.1) is still configured, iOS will prefer it
# Remove ALL public DNS servers from UniFi DHCP configuration

# Solution 2: Force iOS to renew DHCP lease
# iOS: Settings → WiFi → Forget Network
# Then reconnect to WiFi
# This forces device to get new DNS servers from DHCP

# Solution 3: Disable iOS encrypted DNS if still active
# iOS: Settings → [Your Name] → iCloud → Private Relay → OFF
# iOS: Check for DNS profiles: Settings → General → VPN & Device Management

# Solution 4: If encrypted DNS persists, add public IP to NPM ACL (fallback)
# See "iOS DNS Bypass Issues" section above for detailed steps

# Solution 5: Test with different iOS device to isolate issue
# If other iOS devices work, issue is device-specific configuration

# Verification after fix
ssh npm-pihole "docker exec nginx-proxy-manager_app_1 tail -f /data/logs/proxy-host-*_access.log"
# Access git.manticorum.com from iOS
# Should see: [Client 10.0.0.x] - - 200 (local IP)

Both Pi-holes Failing Simultaneously

Symptoms: Complete DNS failure across network, all devices cannot resolve domains Diagnosis:

# Check both Pi-hole containers
ssh npm-pihole "docker ps -a | grep pihole"
ssh ubuntu-manticore "docker ps -a | grep pihole"

# Check both hosts are reachable
ping -c 4 10.10.0.16
ping -c 4 10.10.0.226

# Check Docker daemon on both hosts
ssh npm-pihole "systemctl status docker"
ssh ubuntu-manticore "systemctl status docker"

# Test emergency DNS (bypassing Pi-hole)
dig @8.8.8.8 google.com

Solutions:

# Emergency: Temporarily use public DNS
# UniFi UI: Settings → Networks → LAN → DHCP → Name Server
# DNS1: 8.8.8.8 (Google DNS - temporary)
# DNS2: 1.1.1.1 (Cloudflare - temporary)

# Restart both Pi-holes
ssh npm-pihole "docker restart pihole"
ssh ubuntu-manticore "docker restart pihole"

# If Docker daemon issues:
ssh npm-pihole "sudo systemctl restart docker"
ssh ubuntu-manticore "sudo systemctl restart docker"

# Rebuild both Pi-holes if corruption suspected
ssh npm-pihole "cd ~/pihole && docker compose down && docker compose up -d"
ssh ubuntu-manticore "cd ~/docker/pihole && docker compose down && docker compose up -d"

# After Pi-holes are restored, revert UniFi DHCP to Pi-holes
# UniFi UI: Settings → Networks → LAN → DHCP → Name Server
# DNS1: 10.10.0.16
# DNS2: 10.10.0.226

Query Load Not Balanced Between Pi-holes

Symptoms: Primary Pi-hole getting most queries, secondary rarely used Diagnosis:

# Check query counts on both Pi-holes
# Primary: http://10.10.0.16/admin → Dashboard → Total Queries
# Secondary: http://10.10.0.226:8053/admin → Dashboard → Total Queries

# This is NORMAL behavior - clients prefer DNS1 by default
# Secondary is for failover, not load balancing

# To verify failover works:
ssh npm-pihole "docker stop pihole"
# Wait 30 seconds
# Check secondary query count - should increase
ssh npm-pihole "docker start pihole"

Solutions:

# No action needed - this is expected behavior
# DNS failover is for redundancy, not load distribution

# If you want true load balancing (advanced):
# Option 1: Configure some devices to prefer DNS2
# Manually set DNS on specific devices to 10.10.0.226, 10.10.0.16

# Option 2: Implement DNS round-robin (requires custom DHCP)
# Not recommended for homelab - adds complexity

# Option 3: Accept default behavior (recommended)
# Primary handles most traffic, secondary provides failover
# This is industry standard DNS HA behavior

Pi-hole Blocklist Blocking Legitimate Apps

Facebook Blocklist Breaking Messenger Kids (2026-03-05)

Symptoms: iPad could not connect to Facebook Messenger Kids. App would not load or send/receive messages. Disconnecting iPad from WiFi (using cellular) restored functionality.

Root Cause: The anudeepND/blacklist/master/facebook.txt blocklist was subscribed in Pi-hole, which blocked all core Facebook domains needed by Messenger Kids.

Blocked Domains (from pihole.log):

Domain	Purpose
`edge-mqtt.facebook.com`	MQTT real-time message transport
`graph.facebook.com`	Facebook Graph API (login, contacts, profiles)
`graph-fallback.facebook.com`	Graph API fallback (blocked via CNAME chain)
`www.facebook.com`	Core Facebook domain

Allowed Domains (not on the blocklist, resolved fine):

dgw.c10r.facebook.com - Data gateway
mqtt.fallback.c10r.facebook.com - MQTT fallback
chat-e2ee.c10r.facebook.com - E2E encrypted chat

Diagnosis:

# Find blocked domains for a specific client IP
ssh pihole "docker exec pihole grep 'CLIENT_IP' /var/log/pihole/pihole.log | grep 'gravity blocked'"

# Check which blocklist contains a domain
ssh pihole "docker exec pihole pihole -q edge-mqtt.facebook.com"
# Output: https://raw.githubusercontent.com/anudeepND/blacklist/master/facebook.txt (block)

Resolution: Removed the Facebook blocklist from primary Pi-hole (secondary didn't have it). The blocklist contained ~3,997 Facebook domains.

Pi-hole v6 API - Deleting a Blocklist:

# Authenticate and get session ID
SID=$(curl -s -X POST 'http://PIHOLE_IP:PORT/api/auth' \
  -H 'Content-Type: application/json' \
  -d '{"password":"APP_PASSWORD"}' \
  | python3 -c 'import sys,json; print(json.load(sys.stdin)["session"]["sid"])')

# DELETE uses the URL-encoded list ADDRESS as path parameter (NOT numeric ID)
# The ?type=block parameter is REQUIRED
curl -s -X DELETE \
  "http://PIHOLE_IP:PORT/api/lists/URL_ENCODED_LIST_ADDRESS?type=block" \
  -H "X-FTL-SID: $SID"
# Success returns HTTP 204 No Content

# Update gravity after removal
ssh pihole "docker exec pihole pihole -g"

# Verify domain is no longer blocked
ssh pihole "docker exec pihole pihole -q edge-mqtt.facebook.com"

Important Pi-hole v6 API Notes:

List endpoints use the URL-encoded blocklist address as path param, not numeric IDs
?type=block query parameter is mandatory for DELETE operations
Numeric ID DELETE returns 200 with {"took": ...} but DOES NOT actually delete (silent failure)
Successful address-based DELETE returns HTTP 204 (no body)
Must run pihole -g (gravity update) after deletion for changes to take effect

Future Improvement (TODO): Implement Pi-hole v6 group/client-based approach:

Create a group for the iPad that bypasses the Facebook blocklist
Re-add the Facebook blocklist assigned to the default group only
Assign the iPad's IP to a "Kids Devices" client group that excludes the Facebook list
This would maintain Facebook blocking for other devices while allowing Messenger Kids
See: Pi-hole v6 Admin -> Groups/Clients for per-device blocklist management

Service Discovery and DNS Issues

Local DNS Problems

Symptoms: Services unreachable by hostname, DNS timeouts Diagnosis:

# Test local DNS resolution
nslookup service.homelab.local
dig @10.10.0.16 service.homelab.local

# Check DNS server status
systemctl status bind9  # or named

Solutions:

# Add to /etc/hosts as temporary fix
echo "10.10.0.100 service.homelab.local" | sudo tee -a /etc/hosts

# Restart DNS services
sudo systemctl restart bind9
sudo systemctl restart systemd-resolved

Container Networking Issues

Symptoms: Containers cannot communicate, service discovery fails Diagnosis:

# Check Docker networks
docker network ls
docker network inspect bridge

# Test container connectivity
docker exec container1 ping container2
docker exec container1 nslookup container2

Solutions:

# Create custom network
docker network create --driver bridge app-network
docker run --network app-network container

# Fix DNS in containers
docker run --dns 8.8.8.8 container

Performance Issues

Network Latency Problems

Symptoms: Slow response times, timeouts, poor performance Diagnosis:

# Measure network latency
ping -c 100 host
mtr --report host

# Check network interface stats
ip -s link show
cat /proc/net/dev

Solutions:

# Optimize network settings
echo 'net.core.rmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
echo 'net.core.wmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

# Check for network congestion
iftop
nethogs

Bandwidth Issues

Symptoms: Slow transfers, network congestion, dropped packets Diagnosis:

# Test bandwidth
iperf3 -s  # Server
iperf3 -c server-ip  # Client

# Check interface utilization
vnstat -i eth0

Solutions:

# Implement QoS if needed
sudo tc qdisc add dev eth0 root fq_codel

# Optimize buffer sizes
sudo ethtool -G eth0 rx 4096 tx 4096

Emergency Recovery Procedures

Network Emergency Recovery

Complete network failure recovery:

# Reset all network configuration
sudo systemctl stop networking
sudo ip addr flush eth0
sudo ip route flush table main
sudo systemctl start networking

# Manual network configuration
sudo ip addr add 10.10.0.100/24 dev eth0
sudo ip route add default via 10.10.0.1
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf

SSH Emergency Access

When locked out of systems:

# Use emergency SSH key
ssh -i ~/.ssh/emergency_homelab_rsa user@host

# Via console access (if available)
# Use hypervisor console or physical access

# Reset SSH to allow password auth temporarily
sudo sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config
sudo systemctl restart sshd

Service Recovery

Critical service restoration:

# Restart all network services
sudo systemctl restart networking
sudo systemctl restart nginx
sudo systemctl restart sshd

# Emergency firewall disable
sudo ufw disable  # CAUTION: Only for troubleshooting

# Service-specific recovery
sudo systemctl restart docker
sudo systemctl restart systemd-resolved

Monitoring and Prevention

Network Health Monitoring

#!/bin/bash
# network-monitor.sh
CRITICAL_HOSTS="10.10.0.1 10.10.0.16 nas.homelab.local"
CRITICAL_SERVICES="https://homelab.local http://proxmox.homelab.local:8006"

for host in $CRITICAL_HOSTS; do
    if ! ping -c1 -W5 $host >/dev/null 2>&1; then
        echo "ALERT: $host unreachable" | logger -t network-monitor
    fi
done

for service in $CRITICAL_SERVICES; do
    if ! curl -sSf --max-time 10 "$service" >/dev/null 2>&1; then
        echo "ALERT: $service unavailable" | logger -t network-monitor
    fi
done

Automated Recovery Scripts

#!/bin/bash
# network-recovery.sh
if ! ping -c1 8.8.8.8 >/dev/null 2>&1; then
    echo "Network down, attempting recovery..."
    sudo systemctl restart networking
    sleep 10
    if ping -c1 8.8.8.8 >/dev/null 2>&1; then
        echo "Network recovered"
    else
        echo "Manual intervention required"
    fi
fi

Quick Reference Commands

Network Diagnostics

# Connectivity tests
ping host
traceroute host
mtr host
nc -zv host port

# Service checks
systemctl status networking
systemctl status nginx
systemctl status sshd

# Network configuration
ip addr show
ip route show
ss -tuln

Emergency Commands

# Network restart
sudo systemctl restart networking

# SSH emergency access
ssh -i ~/.ssh/emergency_homelab_rsa user@host

# Firewall quick disable (emergency only)
sudo ufw disable

# DNS quick fix
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf

This troubleshooting guide provides comprehensive solutions for common networking issues in home lab environments.

36 KiB Raw Blame History

Networking Infrastructure Troubleshooting Guide

SSH Connection Issues

SSH Authentication Failures

SSH Service Issues

Network Connectivity Problems

Basic Network Troubleshooting

DNS Resolution Issues

UniFi Firewall Blocking DNS to New Networks

Reverse Proxy and Load Balancer Issues

Nginx Configuration Problems

SSL/TLS Certificate Issues

Intermittent SSL Errors (ERR_SSL_UNRECOGNIZED_NAME_ALERT)

iOS DNS Bypass Issues (Encrypted DNS)

Network Storage Issues

CIFS/SMB Mount Problems

NFS Mount Issues

Firewall and Security Issues

Port Access Problems

Network Security Issues

Pi-hole High Availability Troubleshooting

Pi-hole Not Responding to DNS Queries

DNS Failover Not Working

Orbital Sync Not Syncing

NPM DNS Sync Failing

Secondary Pi-hole Performance Issues

iOS Devices Still Getting 403 Errors (Post-HA Deployment)

Both Pi-holes Failing Simultaneously

Query Load Not Balanced Between Pi-holes

Pi-hole Blocklist Blocking Legitimate Apps

Facebook Blocklist Breaking Messenger Kids (2026-03-05)

Service Discovery and DNS Issues

Local DNS Problems

Container Networking Issues

Performance Issues

Network Latency Problems

Bandwidth Issues

Emergency Recovery Procedures

Network Emergency Recovery

SSH Emergency Access

Service Recovery

Monitoring and Prevention

Network Health Monitoring

Automated Recovery Scripts

Quick Reference Commands

Network Diagnostics

Emergency Commands

36 KiB

Raw Blame History