claude-home/networking/troubleshooting.md
Cal Corum 0b46b51048 docs: add Pi-hole Facebook blocklist incident and v6 API notes
Document Messenger Kids connectivity issue caused by anudeepND Facebook
blocklist blocking edge-mqtt/graph.facebook.com. Includes Pi-hole v6 API
gotcha where numeric ID deletes silently fail (must use URL-encoded address).
TODO added for future per-device group-based blocklist management.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 13:13:21 -05:00

35 KiB

Networking Infrastructure Troubleshooting Guide

SSH Connection Issues

SSH Authentication Failures

Symptoms: Permission denied, connection refused, timeout Diagnosis:

# Verbose SSH debugging
ssh -vvv user@host

# Test different authentication methods
ssh -o PasswordAuthentication=no user@host
ssh -o PubkeyAuthentication=yes user@host

# Check local key files
ls -la ~/.ssh/
ssh-keygen -lf ~/.ssh/homelab_rsa.pub

Solutions:

# Re-deploy SSH keys
ssh-copy-id -i ~/.ssh/homelab_rsa.pub user@host
ssh-copy-id -i ~/.ssh/emergency_homelab_rsa.pub user@host

# Fix key permissions
chmod 600 ~/.ssh/homelab_rsa
chmod 644 ~/.ssh/homelab_rsa.pub
chmod 700 ~/.ssh

# Verify remote authorized_keys
ssh user@host 'chmod 700 ~/.ssh && chmod 600 ~/.ssh/authorized_keys'

SSH Service Issues

Symptoms: Connection refused, service not running Diagnosis:

# Check SSH service status
systemctl status sshd
ss -tlnp | grep :22

# Test port connectivity
nc -zv host 22
nmap -p 22 host

Solutions:

# Restart SSH service
sudo systemctl restart sshd
sudo systemctl enable sshd

# Check firewall
sudo ufw status
sudo ufw allow ssh

# Verify SSH configuration
sudo sshd -T | grep -E "(passwordauth|pubkeyauth|permitroot)"

Network Connectivity Problems

Basic Network Troubleshooting

Symptoms: Cannot reach hosts, timeouts, routing issues Diagnosis:

# Basic connectivity tests
ping host
traceroute host
mtr host

# Check local network configuration
ip addr show
ip route show
cat /etc/resolv.conf

Solutions:

# Restart networking
sudo systemctl restart networking
sudo netplan apply  # Ubuntu

# Reset network interface
sudo ip link set eth0 down
sudo ip link set eth0 up

# Check default gateway
sudo ip route add default via 10.10.0.1

DNS Resolution Issues

Symptoms: Cannot resolve hostnames, slow resolution Diagnosis:

# Test DNS resolution
nslookup google.com
dig google.com
host google.com

# Check DNS servers
systemd-resolve --status
cat /etc/resolv.conf

Solutions:

# Temporary DNS fix
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf

# Restart DNS services
sudo systemctl restart systemd-resolved

# Flush DNS cache
sudo systemd-resolve --flush-caches

UniFi Firewall Blocking DNS to New Networks

Symptoms: New network/VLAN has "no internet access" - devices connect to WiFi but cannot browse or resolve domain names. Ping to IP addresses (8.8.8.8) works, but DNS resolution fails.

Root Cause: Firewall rules blocking traffic from DNS servers (Pi-holes in "Servers" network group) to new networks. Rules like "Servers to WiFi" or "Servers to Home" with DROP action block ALL traffic including DNS responses on port 53.

Diagnosis:

# From affected device on new network:

# Test if routing works (should succeed)
ping 8.8.8.8
traceroute 8.8.8.8

# Test if DNS resolution works (will fail)
nslookup google.com

# Test DNS servers directly (will timeout or fail)
nslookup google.com 10.10.0.16
nslookup google.com 10.10.0.226

# Test public DNS (should work)
nslookup google.com 8.8.8.8

# Check DHCP-assigned DNS servers
# Windows:
ipconfig /all | findstr DNS

# Linux/macOS:
cat /etc/resolv.conf

If routing works but DNS fails, the issue is firewall blocking DNS traffic, not network configuration.

Solutions:

Step 1: Identify Blocking Rules

  • In UniFi: Settings → Firewall & Security → Traffic Rules → LAN In
  • Look for DROP rules with:
    • Source: Servers (or network group containing Pi-holes)
    • Destination: Your new network (e.g., "Home WiFi", "Home Network")
    • Examples: "Servers to WiFi", "Servers to Home"

Step 2: Create DNS Allow Rules (BEFORE Drop Rules)

Create new rules positioned ABOVE the drop rules:

Name: Allow DNS - Servers to [Network Name]
Action: Accept
Rule Applied: Before Predefined Rules
Type: LAN In
Protocol: TCP and UDP
Source:
  - Network/Group: Servers (or specific Pi-hole IPs: 10.10.0.16, 10.10.0.226)
  - Port: Any
Destination:
  - Network: [Your new network - e.g., Home WiFi]
  - Port: 53 (DNS)

Repeat for each network that needs DNS access from servers.

Step 3: Verify Rule Order

CRITICAL: Firewall rules process top-to-bottom, first match wins!

Correct order:

✅ Allow DNS - Servers to Home Network (Accept, Port 53)
✅ Allow DNS - Servers to Home WiFi (Accept, Port 53)
❌ Servers to Home (Drop, All ports)
❌ Servers to WiFi (Drop, All ports)

Step 4: Re-enable Drop Rules

Once DNS allow rules are in place and positioned correctly, re-enable the drop rules.

Verification:

# From device on new network:

# DNS should work
nslookup google.com

# Browsing should work
ping google.com

# Other server traffic should still be blocked (expected)
ping 10.10.0.16  # Should fail or timeout
ssh 10.10.0.16   # Should be blocked

Real-World Example: New "Home WiFi" network (10.1.0.0/24, VLAN 2)

  • Problem: Devices connected but couldn't browse web
  • Diagnosis: traceroute 8.8.8.8 worked (16ms), but nslookup google.com failed
  • Cause: Firewall rule "Servers to WiFi" (rule 20004) blocked Pi-hole DNS responses
  • Solution: Added "Allow DNS - Servers to Home WiFi" rule (Accept, port 53) above drop rule
  • Result: DNS resolution works, other server traffic remains properly blocked

Reverse Proxy and Load Balancer Issues

Nginx Configuration Problems

Symptoms: 502 Bad Gateway, 503 Service Unavailable, SSL errors Diagnosis:

# Check Nginx status and logs
systemctl status nginx
sudo tail -f /var/log/nginx/error.log
sudo tail -f /var/log/nginx/access.log

# Test Nginx configuration
sudo nginx -t
sudo nginx -T  # Show full configuration

Solutions:

# Reload Nginx configuration
sudo nginx -s reload

# Check upstream servers
curl -I http://backend-server:port
telnet backend-server port

# Fix common configuration issues
sudo nano /etc/nginx/sites-available/default
# Check proxy_pass URLs, upstream definitions

SSL/TLS Certificate Issues

Symptoms: Certificate warnings, expired certificates, connection errors Diagnosis:

# Check certificate validity
openssl s_client -connect host:443 -servername host
openssl x509 -in /etc/ssl/certs/cert.pem -text -noout

# Check certificate expiry
openssl x509 -in /etc/ssl/certs/cert.pem -noout -dates

Solutions:

# Renew Let's Encrypt certificates
sudo certbot renew --dry-run
sudo certbot renew --force-renewal

# Generate self-signed certificate
sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
    -keyout /etc/ssl/private/selfsigned.key \
    -out /etc/ssl/certs/selfsigned.crt

Intermittent SSL Errors (ERR_SSL_UNRECOGNIZED_NAME_ALERT)

Symptoms: SSL errors that work sometimes but fail other times, ERR_SSL_UNRECOGNIZED_NAME_ALERT in browser, connection works from internal network intermittently

Root Cause: IPv6/IPv4 DNS conflicts where public DNS returns Cloudflare IPv6 addresses while local DNS (Pi-hole) only overrides IPv4. Modern systems prefer IPv6, causing intermittent failures when IPv6 connection attempts fail.

Diagnosis:

# Check for multiple DNS records (IPv4 + IPv6)
nslookup domain.example.com 10.10.0.16
dig domain.example.com @10.10.0.16

# Compare with public DNS
host domain.example.com 8.8.8.8

# Test IPv6 vs IPv4 connectivity
curl -6 -I https://domain.example.com  # IPv6 (may fail)
curl -4 -I https://domain.example.com  # IPv4 (should work)

# Check if system has IPv6 connectivity
ip -6 addr show | grep global

Example Problem:

# Local Pi-hole returns:
domain.example.com → 10.10.0.16 (IPv4 internal NPM)

# Public DNS also returns:
domain.example.com → 2606:4700:... (Cloudflare IPv6)

# System tries IPv6 first → fails
# Sometimes falls back to IPv4 → works
# Result: Intermittent SSL errors

Solutions:

Option 1: Add IPv6 Local DNS Override (Recommended)

# Add non-routable IPv6 address to Pi-hole custom.list
ssh pihole "docker exec pihole bash -c 'echo \"fe80::1 domain.example.com\" >> /etc/pihole/custom.list'"

# Restart Pi-hole DNS
ssh pihole "docker exec pihole pihole restartdns"

# Verify fix
nslookup domain.example.com 10.10.0.16
# Should show: 10.10.0.16 (IPv4) and fe80::1 (IPv6 link-local)

Option 2: Remove Cloudflare DNS Records (If public access not needed)

# In Cloudflare dashboard:
# - Turn off orange cloud (proxy) for the domain
# - Or delete A/AAAA records entirely

# This removes Cloudflare IPs from public DNS

Option 3: Disable IPv6 on Client (Temporary testing)

# Disable IPv6 temporarily to confirm diagnosis
sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1

# Test domain - should work consistently now

# Re-enable when done testing
sudo sysctl -w net.ipv6.conf.all.disable_ipv6=0

Verification:

# After applying fix, verify consistent resolution
for i in {1..10}; do
    echo "Test $i:"
    curl -I https://domain.example.com 2>&1 | grep -E "(HTTP|SSL|certificate)"
    sleep 1
done

# All attempts should succeed consistently

Real-World Example: git.manticorum.com

  • Problem: Intermittent SSL errors from internal network (10.0.0.0/24)
  • Diagnosis: Pi-hole had IPv4 override (10.10.0.16) but public DNS returned Cloudflare IPv6
  • Solution: Added fe80::1 git.manticorum.com to Pi-hole custom.list
  • Result: Consistent successful connections, always routes to internal NPM

iOS DNS Bypass Issues (Encrypted DNS)

Symptoms: iOS device gets 403 errors when accessing internal services, NPM logs show external public IP as source instead of local 10.x.x.x IP, even with correct Pi-hole DNS configuration

Root Cause: iOS devices can use encrypted DNS (DNS-over-HTTPS or DNS-over-TLS) that bypasses traditional DNS servers, even when correctly configured. This causes the device to resolve to public/Cloudflare IPs instead of local overrides, routing traffic through the public internet and triggering ACL denials.

Diagnosis:

# Check NPM access logs for the service
ssh 10.10.0.16 "docker exec nginx-proxy-manager_app_1 tail -50 /data/logs/proxy-host-*_access.log | grep 403"

# Look for external IPs in logs instead of local 10.x.x.x:
# BAD:  [Client 73.36.102.55] - - 403 (external IP, blocked by ACL)
# GOOD: [Client 10.0.0.207] - 200 200 (local IP, allowed)

# Verify iOS device is on local network
# On iOS: Settings → Wi-Fi → (i) → IP Address
# Should show 10.0.0.x or 10.10.0.x

# Verify Pi-hole DNS is configured
# On iOS: Settings → Wi-Fi → (i) → DNS
# Should show 10.10.0.16

# Test if DNS is actually being used
nslookup domain.example.com 10.10.0.16  # Shows what Pi-hole returns
# Then check what iOS actually resolves (if possible via network sniffer)

Example Problem:

# iOS device configuration:
IP Address: 10.0.0.207 (correct, on local network)
DNS: 10.10.0.16 (correct, Pi-hole configured)
Cellular Data: OFF

# But NPM logs show:
[Client 73.36.102.55] - - 403  # Coming from ISP public IP!

# Why: iOS is using encrypted DNS, bypassing Pi-hole
# Result: Resolves to Cloudflare IP, routes through public internet,
#         NPM sees external IP, ACL blocks with 403

Solutions:

Option 1: Add Public IP to NPM Access Rules (Quickest, recommended for mobile devices)

# Find which config file contains your domain
ssh 10.10.0.16 "docker exec nginx-proxy-manager_app_1 sh -c 'grep -l domain.example.com /data/nginx/proxy_host/*.conf'"
# Example output: /data/nginx/proxy_host/19.conf

# Add public IP to access rules (replace YOUR_PUBLIC_IP and config number)
ssh 10.10.0.16 "docker exec nginx-proxy-manager_app_1 sed -i '/allow 10.10.0.0\/24;/a \    \n    allow YOUR_PUBLIC_IP;' /data/nginx/proxy_host/19.conf"

# Verify the change
ssh 10.10.0.16 "docker exec nginx-proxy-manager_app_1 cat /data/nginx/proxy_host/19.conf" | grep -A 8 "Access Rules"

# Test and reload nginx
ssh 10.10.0.16 "docker exec nginx-proxy-manager_app_1 nginx -t"
ssh 10.10.0.16 "docker exec nginx-proxy-manager_app_1 nginx -s reload"

Option 2: Reset iOS Network Settings (Nuclear option, clears DNS cache/profiles)

iOS: Settings → General → Transfer or Reset iPhone → Reset → Reset Network Settings
WARNING: This removes all saved WiFi passwords and network configurations

Option 3: Check for DNS Configuration Profiles

iOS: Settings → General → VPN & Device Management
- Look for any DNS or Configuration Profiles
- Remove any third-party DNS profiles (AdGuard, NextDNS, etc.)

Option 4: Disable Private Relay and IP Tracking (Usually already tried)

iOS: Settings → [Your Name] → iCloud → Private Relay → OFF
iOS: Settings → Wi-Fi → (i) → Limit IP Address Tracking → OFF

Option 5: Check Browser DNS Settings (If using Brave or Firefox)

Brave: Settings → Brave Shields & Privacy → Use secure DNS → OFF
Firefox: Settings → DNS over HTTPS → OFF

Verification:

# After applying fix, check NPM logs while accessing from iOS
ssh 10.10.0.16 "docker exec nginx-proxy-manager_app_1 tail -f /data/logs/proxy-host-*_access.log"

# With Option 1 (added public IP): Should see 200 status with external IP
# With Option 2-5 (fixed DNS): Should see 200 status with local 10.x.x.x IP

Important Notes:

  • Option 1 is recommended for mobile devices as iOS encrypted DNS behavior is inconsistent
  • Public IP workaround requires updating if ISP changes your IP (rare for residential)
  • Manual nginx config changes (Option 1) will be overwritten if you edit the proxy host in NPM UI
  • To make permanent, either use NPM UI to add the IP, or re-apply after UI changes
  • This issue can affect any iOS device (iPhone, iPad) and some Android devices with encrypted DNS

Real-World Example: git.manticorum.com iOS Access

  • Problem: iPhone showing 403 errors, desktop working fine on same network
  • iOS Config: IP 10.0.0.207, DNS 10.10.0.16, Cellular OFF (all correct)
  • NPM Logs: iPhone requests showing as [Client 73.36.102.55] (ISP public IP)
  • Diagnosis: iOS using encrypted DNS, bypassing Pi-hole, routing through Cloudflare
  • Solution: Added allow 73.36.102.55; to NPM proxy_host/19.conf ACL rules
  • Result: Immediate access, user able to log in to Gitea successfully

Network Storage Issues

CIFS/SMB Mount Problems

Symptoms: Mount failures, connection timeouts, permission errors Diagnosis:

# Test SMB connectivity
smbclient -L //nas-server -U username
testparm  # Test Samba configuration

# Check mount status
mount | grep cifs
df -h | grep cifs

Solutions:

# Remount with verbose logging
sudo mount -t cifs //server/share /mnt/point -o username=user,password=pass,vers=3.0

# Fix mount options in /etc/fstab
//server/share /mnt/point cifs credentials=/etc/cifs/credentials,uid=1000,gid=1000,iocharset=utf8,file_mode=0644,dir_mode=0755,cache=strict,_netdev 0 0

# Test credentials
sudo cat /etc/cifs/credentials
# Should contain: username=, password=, domain=

NFS Mount Issues

Symptoms: Stale file handles, mount hangs, permission denied Diagnosis:

# Check NFS services
systemctl status nfs-client.target
showmount -e nfs-server

# Test NFS connectivity
rpcinfo -p nfs-server

Solutions:

# Restart NFS services
sudo systemctl restart nfs-client.target

# Remount NFS shares
sudo umount /mnt/nfs-share
sudo mount -t nfs server:/path /mnt/nfs-share

# Fix stale file handles
sudo umount -f /mnt/nfs-share
sudo mount /mnt/nfs-share

Firewall and Security Issues

Port Access Problems

Symptoms: Connection refused, filtered ports, blocked services Diagnosis:

# Check firewall status
sudo ufw status verbose
sudo iptables -L -n -v

# Test port accessibility
nc -zv host port
nmap -p port host

Solutions:

# Open required ports
sudo ufw allow ssh
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw allow from 10.10.0.0/24

# Reset firewall if needed
sudo ufw --force reset
sudo ufw enable

Network Security Issues

Symptoms: Unauthorized access, suspicious traffic, security alerts Diagnosis:

# Check active connections
ss -tuln
netstat -tuln

# Review logs for security events
sudo tail -f /var/log/auth.log
sudo tail -f /var/log/syslog | grep -i security

Solutions:

# Block suspicious IPs
sudo ufw deny from suspicious-ip

# Update SSH security
sudo nano /etc/ssh/sshd_config
# Set: PasswordAuthentication no, PermitRootLogin no
sudo systemctl restart sshd

Pi-hole High Availability Troubleshooting

Pi-hole Not Responding to DNS Queries

Symptoms: DNS resolution failures, clients cannot resolve domains, Pi-hole web UI inaccessible Diagnosis:

# Test DNS response from both Pi-holes
dig @10.10.0.16 google.com
dig @10.10.0.226 google.com

# Check Pi-hole container status
ssh npm-pihole "docker ps | grep pihole"
ssh ubuntu-manticore "docker ps | grep pihole"

# Check Pi-hole logs
ssh npm-pihole "docker logs pihole --tail 50"
ssh ubuntu-manticore "docker logs pihole --tail 50"

# Test port 53 is listening
ssh ubuntu-manticore "netstat -tulpn | grep :53"
ssh ubuntu-manticore "ss -tulpn | grep :53"

Solutions:

# Restart Pi-hole containers
ssh npm-pihole "docker restart pihole"
ssh ubuntu-manticore "cd ~/docker/pihole && docker compose restart"

# Check for port conflicts
ssh ubuntu-manticore "lsof -i :53"

# If systemd-resolved is conflicting, disable it
ssh ubuntu-manticore "sudo systemctl stop systemd-resolved"
ssh ubuntu-manticore "sudo systemctl disable systemd-resolved"

# Rebuild Pi-hole container
ssh ubuntu-manticore "cd ~/docker/pihole && docker compose down && docker compose up -d"

DNS Failover Not Working

Symptoms: DNS stops working when primary Pi-hole fails, clients not using secondary DNS Diagnosis:

# Check UniFi DHCP DNS configuration
# Via UniFi UI: Settings → Networks → LAN → DHCP
# DNS Server 1: 10.10.0.16
# DNS Server 2: 10.10.0.226

# Check client DNS configuration
# Windows:
ipconfig /all | findstr /i "DNS"

# Linux/macOS:
cat /etc/resolv.conf

# Check if secondary Pi-hole is reachable
ping -c 4 10.10.0.226
dig @10.10.0.226 google.com

# Test failover manually
ssh npm-pihole "docker stop pihole"
dig google.com  # Should still work via secondary
ssh npm-pihole "docker start pihole"

Solutions:

# Force DHCP lease renewal to get updated DNS servers
# Windows:
ipconfig /release && ipconfig /renew

# Linux:
sudo dhclient -r && sudo dhclient

# macOS/iOS:
# Disconnect and reconnect to WiFi

# Verify UniFi DHCP settings are correct
# Both DNS servers must be configured in UniFi controller

# Check client respects both DNS servers
# Some clients may cache failed DNS responses
# Flush DNS cache:
# Windows: ipconfig /flushdns
# macOS: sudo dscacheutil -flushcache
# Linux: sudo systemd-resolve --flush-caches

Orbital Sync Not Syncing

Symptoms: Blocklists/whitelists differ between Pi-holes, custom DNS entries missing on secondary Diagnosis:

# Check Orbital Sync container status
ssh ubuntu-manticore "docker ps | grep orbital-sync"

# Check Orbital Sync logs
ssh ubuntu-manticore "docker logs orbital-sync --tail 100"

# Look for sync errors in logs
ssh ubuntu-manticore "docker logs orbital-sync 2>&1 | grep -i error"

# Verify API tokens are correct
ssh ubuntu-manticore "cat ~/docker/orbital-sync/.env"

# Test API access manually
ssh npm-pihole "docker exec pihole pihole -a -p"  # Get API token
curl -H "Authorization: Token YOUR_TOKEN" http://10.10.0.16/admin/api.php?status

# Compare blocklist counts between Pi-holes
ssh npm-pihole "docker exec pihole pihole -g -l"
ssh ubuntu-manticore "docker exec pihole pihole -g -l"

Solutions:

# Regenerate API tokens
# Primary Pi-hole: http://10.10.0.16/admin → Settings → API → Generate New Token
# Secondary Pi-hole: http://10.10.0.226:8053/admin → Settings → API → Generate New Token

# Update Orbital Sync .env file
ssh ubuntu-manticore "nano ~/docker/orbital-sync/.env"
# Update PRIMARY_HOST_PASSWORD and SECONDARY_HOST_PASSWORD

# Restart Orbital Sync
ssh ubuntu-manticore "cd ~/docker/orbital-sync && docker compose restart"

# Force immediate sync by restarting
ssh ubuntu-manticore "cd ~/docker/orbital-sync && docker compose down && docker compose up -d"

# Monitor sync in real-time
ssh ubuntu-manticore "docker logs orbital-sync -f"

# If all else fails, manually sync via Teleporter
# Primary: Settings → Teleporter → Backup
# Secondary: Settings → Teleporter → Restore (upload backup file)

NPM DNS Sync Failing

Symptoms: NPM proxy hosts missing from Pi-hole custom.list, new domains not resolving Diagnosis:

# Check NPM sync script status
ssh npm-pihole "cat /var/log/cron.log | grep npm-pihole-sync"

# Run sync script manually to see errors
ssh npm-pihole "/home/cal/scripts/npm-pihole-sync.sh"

# Check script can access both Pi-holes
ssh npm-pihole "docker exec pihole cat /etc/pihole/custom.list | grep git.manticorum.com"
ssh npm-pihole "ssh ubuntu-manticore 'docker exec pihole cat /etc/pihole/custom.list | grep git.manticorum.com'"

# Verify SSH connectivity to ubuntu-manticore
ssh npm-pihole "ssh ubuntu-manticore 'echo SSH OK'"

Solutions:

# Fix SSH key authentication (if needed)
ssh npm-pihole "ssh-copy-id ubuntu-manticore"

# Test script with dry-run
ssh npm-pihole "/home/cal/scripts/npm-pihole-sync.sh --dry-run"

# Run script manually to sync immediately
ssh npm-pihole "/home/cal/scripts/npm-pihole-sync.sh"

# Verify cron job is configured
ssh npm-pihole "crontab -l | grep npm-pihole-sync"

# If cron job missing, add it
ssh npm-pihole "crontab -e"
# Add: 0 * * * * /home/cal/scripts/npm-pihole-sync.sh >> /var/log/npm-pihole-sync.log 2>&1

# Check script logs
ssh npm-pihole "tail -50 /var/log/npm-pihole-sync.log"

Secondary Pi-hole Performance Issues

Symptoms: ubuntu-manticore slow, high CPU/RAM usage, Pi-hole affecting Jellyfin/Tdarr Diagnosis:

# Check resource usage
ssh ubuntu-manticore "docker stats --no-stream"

# Pi-hole should use <1% CPU and ~150MB RAM
# If higher, investigate:
ssh ubuntu-manticore "docker logs pihole --tail 100"

# Check for excessive queries
ssh ubuntu-manticore "docker exec pihole pihole -c -e"

# Check for DNS loops or misconfiguration
ssh ubuntu-manticore "docker exec pihole pihole -t"  # Tail pihole.log

Solutions:

# Restart Pi-hole if resource usage is high
ssh ubuntu-manticore "docker restart pihole"

# Check for DNS query loops
# Look for same domain being queried repeatedly
ssh ubuntu-manticore "docker exec pihole pihole -t | grep -A 5 'query\[A\]'"

# Adjust Pi-hole cache settings if needed
ssh ubuntu-manticore "docker exec pihole bash -c 'echo \"cache-size=10000\" >> /etc/dnsmasq.d/99-custom.conf'"
ssh ubuntu-manticore "docker restart pihole"

# If Jellyfin/Tdarr are affected, verify Pi-hole is using minimal resources
# Resource limits can be added to docker-compose.yml:
ssh ubuntu-manticore "nano ~/docker/pihole/docker-compose.yml"
# Add under pihole service:
#   deploy:
#     resources:
#       limits:
#         cpus: '0.5'
#         memory: 256M

iOS Devices Still Getting 403 Errors (Post-HA Deployment)

Symptoms: After deploying dual Pi-hole setup, iOS devices still bypass DNS and get 403 errors on internal services Diagnosis:

# Verify UniFi DHCP has BOTH Pi-holes configured, NO public DNS
# UniFi UI: Settings → Networks → LAN → DHCP → Name Server
# DNS1: 10.10.0.16
# DNS2: 10.10.0.226
# Public DNS (1.1.1.1, 8.8.8.8): REMOVED

# Check iOS DNS settings
# iOS: Settings → WiFi → (i) → DNS
# Should show: 10.10.0.16

# Force iOS DHCP renewal
# iOS: Settings → WiFi → Forget Network → Reconnect

# Check NPM logs for request source
ssh npm-pihole "docker exec nginx-proxy-manager_app_1 tail -50 /data/logs/proxy-host-*_access.log | grep 403"

# Verify both Pi-holes have custom DNS entries
ssh npm-pihole "docker exec pihole cat /etc/pihole/custom.list | grep git.manticorum.com"
ssh ubuntu-manticore "docker exec pihole cat /etc/pihole/custom.list | grep git.manticorum.com"

Solutions:

# Solution 1: Verify public DNS is removed from UniFi DHCP
# If public DNS (1.1.1.1) is still configured, iOS will prefer it
# Remove ALL public DNS servers from UniFi DHCP configuration

# Solution 2: Force iOS to renew DHCP lease
# iOS: Settings → WiFi → Forget Network
# Then reconnect to WiFi
# This forces device to get new DNS servers from DHCP

# Solution 3: Disable iOS encrypted DNS if still active
# iOS: Settings → [Your Name] → iCloud → Private Relay → OFF
# iOS: Check for DNS profiles: Settings → General → VPN & Device Management

# Solution 4: If encrypted DNS persists, add public IP to NPM ACL (fallback)
# See "iOS DNS Bypass Issues" section above for detailed steps

# Solution 5: Test with different iOS device to isolate issue
# If other iOS devices work, issue is device-specific configuration

# Verification after fix
ssh npm-pihole "docker exec nginx-proxy-manager_app_1 tail -f /data/logs/proxy-host-*_access.log"
# Access git.manticorum.com from iOS
# Should see: [Client 10.0.0.x] - - 200 (local IP)

Both Pi-holes Failing Simultaneously

Symptoms: Complete DNS failure across network, all devices cannot resolve domains Diagnosis:

# Check both Pi-hole containers
ssh npm-pihole "docker ps -a | grep pihole"
ssh ubuntu-manticore "docker ps -a | grep pihole"

# Check both hosts are reachable
ping -c 4 10.10.0.16
ping -c 4 10.10.0.226

# Check Docker daemon on both hosts
ssh npm-pihole "systemctl status docker"
ssh ubuntu-manticore "systemctl status docker"

# Test emergency DNS (bypassing Pi-hole)
dig @8.8.8.8 google.com

Solutions:

# Emergency: Temporarily use public DNS
# UniFi UI: Settings → Networks → LAN → DHCP → Name Server
# DNS1: 8.8.8.8 (Google DNS - temporary)
# DNS2: 1.1.1.1 (Cloudflare - temporary)

# Restart both Pi-holes
ssh npm-pihole "docker restart pihole"
ssh ubuntu-manticore "docker restart pihole"

# If Docker daemon issues:
ssh npm-pihole "sudo systemctl restart docker"
ssh ubuntu-manticore "sudo systemctl restart docker"

# Rebuild both Pi-holes if corruption suspected
ssh npm-pihole "cd ~/pihole && docker compose down && docker compose up -d"
ssh ubuntu-manticore "cd ~/docker/pihole && docker compose down && docker compose up -d"

# After Pi-holes are restored, revert UniFi DHCP to Pi-holes
# UniFi UI: Settings → Networks → LAN → DHCP → Name Server
# DNS1: 10.10.0.16
# DNS2: 10.10.0.226

Query Load Not Balanced Between Pi-holes

Symptoms: Primary Pi-hole getting most queries, secondary rarely used Diagnosis:

# Check query counts on both Pi-holes
# Primary: http://10.10.0.16/admin → Dashboard → Total Queries
# Secondary: http://10.10.0.226:8053/admin → Dashboard → Total Queries

# This is NORMAL behavior - clients prefer DNS1 by default
# Secondary is for failover, not load balancing

# To verify failover works:
ssh npm-pihole "docker stop pihole"
# Wait 30 seconds
# Check secondary query count - should increase
ssh npm-pihole "docker start pihole"

Solutions:

# No action needed - this is expected behavior
# DNS failover is for redundancy, not load distribution

# If you want true load balancing (advanced):
# Option 1: Configure some devices to prefer DNS2
# Manually set DNS on specific devices to 10.10.0.226, 10.10.0.16

# Option 2: Implement DNS round-robin (requires custom DHCP)
# Not recommended for homelab - adds complexity

# Option 3: Accept default behavior (recommended)
# Primary handles most traffic, secondary provides failover
# This is industry standard DNS HA behavior

Pi-hole Blocklist Blocking Legitimate Apps

Facebook Blocklist Breaking Messenger Kids (2026-03-05)

Symptoms: iPad could not connect to Facebook Messenger Kids. App would not load or send/receive messages. Disconnecting iPad from WiFi (using cellular) restored functionality.

Root Cause: The anudeepND/blacklist/master/facebook.txt blocklist was subscribed in Pi-hole, which blocked all core Facebook domains needed by Messenger Kids.

Blocked Domains (from pihole.log):

Domain Purpose
edge-mqtt.facebook.com MQTT real-time message transport
graph.facebook.com Facebook Graph API (login, contacts, profiles)
graph-fallback.facebook.com Graph API fallback (blocked via CNAME chain)
www.facebook.com Core Facebook domain

Allowed Domains (not on the blocklist, resolved fine):

  • dgw.c10r.facebook.com - Data gateway
  • mqtt.fallback.c10r.facebook.com - MQTT fallback
  • chat-e2ee.c10r.facebook.com - E2E encrypted chat

Diagnosis:

# Find blocked domains for a specific client IP
ssh pihole "docker exec pihole grep 'CLIENT_IP' /var/log/pihole/pihole.log | grep 'gravity blocked'"

# Check which blocklist contains a domain
ssh pihole "docker exec pihole pihole -q edge-mqtt.facebook.com"
# Output: https://raw.githubusercontent.com/anudeepND/blacklist/master/facebook.txt (block)

Resolution: Removed the Facebook blocklist from primary Pi-hole (secondary didn't have it). The blocklist contained ~3,997 Facebook domains.

Pi-hole v6 API - Deleting a Blocklist:

# Authenticate and get session ID
SID=$(curl -s -X POST 'http://PIHOLE_IP:PORT/api/auth' \
  -H 'Content-Type: application/json' \
  -d '{"password":"APP_PASSWORD"}' \
  | python3 -c 'import sys,json; print(json.load(sys.stdin)["session"]["sid"])')

# DELETE uses the URL-encoded list ADDRESS as path parameter (NOT numeric ID)
# The ?type=block parameter is REQUIRED
curl -s -X DELETE \
  "http://PIHOLE_IP:PORT/api/lists/URL_ENCODED_LIST_ADDRESS?type=block" \
  -H "X-FTL-SID: $SID"
# Success returns HTTP 204 No Content

# Update gravity after removal
ssh pihole "docker exec pihole pihole -g"

# Verify domain is no longer blocked
ssh pihole "docker exec pihole pihole -q edge-mqtt.facebook.com"

Important Pi-hole v6 API Notes:

  • List endpoints use the URL-encoded blocklist address as path param, not numeric IDs
  • ?type=block query parameter is mandatory for DELETE operations
  • Numeric ID DELETE returns 200 with {"took": ...} but DOES NOT actually delete (silent failure)
  • Successful address-based DELETE returns HTTP 204 (no body)
  • Must run pihole -g (gravity update) after deletion for changes to take effect

Future Improvement (TODO): Implement Pi-hole v6 group/client-based approach:

  • Create a group for the iPad that bypasses the Facebook blocklist
  • Re-add the Facebook blocklist assigned to the default group only
  • Assign the iPad's IP to a "Kids Devices" client group that excludes the Facebook list
  • This would maintain Facebook blocking for other devices while allowing Messenger Kids
  • See: Pi-hole v6 Admin -> Groups/Clients for per-device blocklist management

Service Discovery and DNS Issues

Local DNS Problems

Symptoms: Services unreachable by hostname, DNS timeouts Diagnosis:

# Test local DNS resolution
nslookup service.homelab.local
dig @10.10.0.16 service.homelab.local

# Check DNS server status
systemctl status bind9  # or named

Solutions:

# Add to /etc/hosts as temporary fix
echo "10.10.0.100 service.homelab.local" | sudo tee -a /etc/hosts

# Restart DNS services
sudo systemctl restart bind9
sudo systemctl restart systemd-resolved

Container Networking Issues

Symptoms: Containers cannot communicate, service discovery fails Diagnosis:

# Check Docker networks
docker network ls
docker network inspect bridge

# Test container connectivity
docker exec container1 ping container2
docker exec container1 nslookup container2

Solutions:

# Create custom network
docker network create --driver bridge app-network
docker run --network app-network container

# Fix DNS in containers
docker run --dns 8.8.8.8 container

Performance Issues

Network Latency Problems

Symptoms: Slow response times, timeouts, poor performance Diagnosis:

# Measure network latency
ping -c 100 host
mtr --report host

# Check network interface stats
ip -s link show
cat /proc/net/dev

Solutions:

# Optimize network settings
echo 'net.core.rmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
echo 'net.core.wmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

# Check for network congestion
iftop
nethogs

Bandwidth Issues

Symptoms: Slow transfers, network congestion, dropped packets Diagnosis:

# Test bandwidth
iperf3 -s  # Server
iperf3 -c server-ip  # Client

# Check interface utilization
vnstat -i eth0

Solutions:

# Implement QoS if needed
sudo tc qdisc add dev eth0 root fq_codel

# Optimize buffer sizes
sudo ethtool -G eth0 rx 4096 tx 4096

Emergency Recovery Procedures

Network Emergency Recovery

Complete network failure recovery:

# Reset all network configuration
sudo systemctl stop networking
sudo ip addr flush eth0
sudo ip route flush table main
sudo systemctl start networking

# Manual network configuration
sudo ip addr add 10.10.0.100/24 dev eth0
sudo ip route add default via 10.10.0.1
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf

SSH Emergency Access

When locked out of systems:

# Use emergency SSH key
ssh -i ~/.ssh/emergency_homelab_rsa user@host

# Via console access (if available)
# Use hypervisor console or physical access

# Reset SSH to allow password auth temporarily
sudo sed -i 's/PasswordAuthentication no/PasswordAuthentication yes/' /etc/ssh/sshd_config
sudo systemctl restart sshd

Service Recovery

Critical service restoration:

# Restart all network services
sudo systemctl restart networking
sudo systemctl restart nginx
sudo systemctl restart sshd

# Emergency firewall disable
sudo ufw disable  # CAUTION: Only for troubleshooting

# Service-specific recovery
sudo systemctl restart docker
sudo systemctl restart systemd-resolved

Monitoring and Prevention

Network Health Monitoring

#!/bin/bash
# network-monitor.sh
CRITICAL_HOSTS="10.10.0.1 10.10.0.16 nas.homelab.local"
CRITICAL_SERVICES="https://homelab.local http://proxmox.homelab.local:8006"

for host in $CRITICAL_HOSTS; do
    if ! ping -c1 -W5 $host >/dev/null 2>&1; then
        echo "ALERT: $host unreachable" | logger -t network-monitor
    fi
done

for service in $CRITICAL_SERVICES; do
    if ! curl -sSf --max-time 10 "$service" >/dev/null 2>&1; then
        echo "ALERT: $service unavailable" | logger -t network-monitor
    fi
done

Automated Recovery Scripts

#!/bin/bash
# network-recovery.sh
if ! ping -c1 8.8.8.8 >/dev/null 2>&1; then
    echo "Network down, attempting recovery..."
    sudo systemctl restart networking
    sleep 10
    if ping -c1 8.8.8.8 >/dev/null 2>&1; then
        echo "Network recovered"
    else
        echo "Manual intervention required"
    fi
fi

Quick Reference Commands

Network Diagnostics

# Connectivity tests
ping host
traceroute host
mtr host
nc -zv host port

# Service checks
systemctl status networking
systemctl status nginx
systemctl status sshd

# Network configuration
ip addr show
ip route show
ss -tuln

Emergency Commands

# Network restart
sudo systemctl restart networking

# SSH emergency access
ssh -i ~/.ssh/emergency_homelab_rsa user@host

# Firewall quick disable (emergency only)
sudo ufw disable

# DNS quick fix
echo "nameserver 8.8.8.8" | sudo tee /etc/resolv.conf

This troubleshooting guide provides comprehensive solutions for common networking issues in home lab environments.