claude-home/monitoring/troubleshooting.md
Cal Corum 10c9e0d854 CLAUDE: Migrate to technology-first documentation architecture
Complete restructure from patterns/examples/reference to technology-focused directories:

• Created technology-specific directories with comprehensive documentation:
  - /tdarr/ - Transcoding automation with gaming-aware scheduling
  - /docker/ - Container management with GPU acceleration patterns
  - /vm-management/ - Virtual machine automation and cloud-init
  - /networking/ - SSH infrastructure, reverse proxy, and security
  - /monitoring/ - System health checks and Discord notifications
  - /databases/ - Database patterns and troubleshooting
  - /development/ - Programming language patterns (bash, nodejs, python, vuejs)

• Enhanced CLAUDE.md with intelligent context loading:
  - Technology-first loading rules for automatic context provision
  - Troubleshooting keyword triggers for emergency scenarios
  - Documentation maintenance protocols with automated reminders
  - Context window management for optimal documentation updates

• Preserved valuable content from .claude/tmp/:
  - SSH security improvements and server inventory
  - Tdarr CIFS troubleshooting and Docker iptables solutions
  - Operational scripts with proper technology classification

• Benefits achieved:
  - Self-contained technology directories with complete context
  - Automatic loading of relevant documentation based on keywords
  - Emergency-ready troubleshooting with comprehensive guides
  - Scalable structure for future technology additions
  - Eliminated context bloat through targeted loading

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-12 23:20:15 -05:00

9.5 KiB

Monitoring System Troubleshooting Guide

Discord Notification Issues

Webhook Not Working

Symptoms: No Discord messages received, connection errors Diagnosis:

# Test webhook manually
curl -X POST "$DISCORD_WEBHOOK_URL" \
  -H "Content-Type: application/json" \
  -d '{"content": "Test message"}'

# Check webhook URL format
echo $DISCORD_WEBHOOK_URL | grep -E "https://discord.com/api/webhooks/[0-9]+/.+"

Solutions:

# Verify webhook URL is correct
# Format: https://discord.com/api/webhooks/ID/TOKEN

# Test with minimal payload
curl -X POST "$DISCORD_WEBHOOK_URL" \
  -H "Content-Type: application/json" \
  -d '{"content": "✅ Webhook working"}'

# Check for JSON formatting issues
echo '{"content": "test"}' | jq .  # Validate JSON

Message Formatting Problems

Symptoms: Malformed messages, broken markdown, missing user pings Common Issues:

# ❌ Broken JSON escaping
{"content": "Error: "quotes" break JSON"}

# ✅ Proper JSON escaping  
{"content": "Error: \"quotes\" properly escaped"}

# ❌ User ping inside code block (doesn't work)
{"content": "```\nIssue occurred <@user_id>\n```"}

# ✅ User ping outside code block
{"content": "```\nIssue occurred\n```\nManual intervention needed <@user_id>"}

Tdarr Monitoring Issues

Script Not Running

Symptoms: No monitoring alerts, script execution failures Diagnosis:

# Check cron job status
crontab -l | grep tdarr-timeout-monitor
systemctl status cron

# Run script manually for debugging
bash -x /path/to/tdarr-timeout-monitor.sh

# Check script permissions
ls -la /path/to/tdarr-timeout-monitor.sh

Solutions:

# Fix script permissions
chmod +x /path/to/tdarr-timeout-monitor.sh

# Reinstall cron job
crontab -e
# Add: */20 * * * * /full/path/to/tdarr-timeout-monitor.sh

# Check script environment
# Ensure PATH and variables are set correctly in script

API Connection Failures

Symptoms: Cannot connect to Tdarr server, timeout errors Diagnosis:

# Test Tdarr API manually
curl -f "http://tdarr-server:8266/api/v2/status"

# Check network connectivity
ping tdarr-server
nc -zv tdarr-server 8266

# Verify SSH access to server
ssh tdarr "docker ps | grep tdarr"

Solutions:

# Update server connection in script
# Verify server IP and port are correct

# Test API endpoints
curl "http://10.10.0.43:8265/api/v2/status"  # Web port
curl "http://10.10.0.43:8266/api/v2/status"  # Server port

# Check Tdarr server logs
ssh tdarr "docker logs tdarr | tail -20"

Windows Desktop Monitoring Issues

PowerShell Script Not Running

Symptoms: No reboot notifications from Windows systems Diagnosis:

# Check scheduled task status
Get-ScheduledTask -TaskName "Reboot*" | Get-ScheduledTaskInfo

# Test script execution manually
PowerShell -ExecutionPolicy Bypass -File "C:\path\to\windows-reboot-monitor.ps1"

# Check PowerShell execution policy
Get-ExecutionPolicy

Solutions:

# Set execution policy
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

# Recreate scheduled tasks
schtasks /Create /XML "C:\path\to\task.xml" /TN "RebootMonitor"

# Check task trigger configuration
Get-ScheduledTask -TaskName "RebootMonitor" | Get-ScheduledTaskTrigger

Network Access from Windows

Symptoms: PowerShell cannot reach Discord webhook Diagnosis:

# Test network connectivity
Test-NetConnection discord.com -Port 443

# Test webhook manually
Invoke-RestMethod -Uri $webhookUrl -Method Post -Body '{"content":"test"}' -ContentType "application/json"

# Check Windows firewall
Get-NetFirewallRule | Where-Object {$_.DisplayName -like "*PowerShell*"}

Solutions:

# Allow PowerShell through firewall
New-NetFirewallRule -DisplayName "PowerShell Outbound" -Direction Outbound -Program "C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe" -Action Allow

# Test with simplified request
$body = @{content="Test from Windows"} | ConvertTo-Json
Invoke-RestMethod -Uri $webhookUrl -Method Post -Body $body -ContentType "application/json"

Log Management Issues

Log Files Growing Too Large

Symptoms: Disk space filling up, slow log access Diagnosis:

# Check log file sizes
du -sh /var/log/homelab-*
du -sh /tmp/*monitor*.log

# Check available disk space
df -h /var/log
df -h /tmp

Solutions:

# Implement log rotation
cat > /etc/logrotate.d/homelab-monitoring << 'EOF'
/var/log/homelab-*.log {
    daily
    missingok
    rotate 7
    compress
    notifempty
    create 644 root root
}
EOF

# Manual log cleanup
find /tmp -name "*monitor*.log" -size +10M -delete
truncate -s 0 /tmp/large-log-file.log

Log Rotation Not Working

Symptoms: Old logs not being cleaned up Diagnosis:

# Check logrotate status
systemctl status logrotate
cat /var/lib/logrotate/status

# Test logrotate configuration
logrotate -d /etc/logrotate.d/homelab-monitoring

Solutions:

# Force log rotation
logrotate -f /etc/logrotate.d/homelab-monitoring

# Fix logrotate configuration
sudo nano /etc/logrotate.d/homelab-monitoring
# Verify syntax and permissions

Cron Job Issues

Scheduled Tasks Not Running

Symptoms: Scripts not executing at scheduled times Diagnosis:

# Check cron service
systemctl status cron
systemctl status crond  # RHEL/CentOS

# View cron logs
grep CRON /var/log/syslog
journalctl -u cron

# List all cron jobs
crontab -l
sudo crontab -l  # System crontab

Solutions:

# Restart cron service
sudo systemctl restart cron

# Fix cron job syntax
# Ensure absolute paths are used
# Example: */20 * * * * /full/path/to/script.sh

# Check script permissions and execution
ls -la /path/to/script.sh
/path/to/script.sh  # Test manual execution

Environment Variables in Cron

Symptoms: Scripts work manually but fail in cron Diagnosis:

# Create test cron job to check environment
* * * * * env > /tmp/cron-env.txt

# Compare with shell environment
env > /tmp/shell-env.txt
diff /tmp/shell-env.txt /tmp/cron-env.txt

Solutions:

# Set PATH in crontab
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

# Or set PATH in script
#!/bin/bash
export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

# Source environment if needed
source /etc/environment

Network Monitoring Issues

False Positives

Symptoms: Alerts for services that are actually working Diagnosis:

# Test monitoring checks manually
curl -sSf --max-time 10 "https://service.homelab.local"
ping -c1 -W5 10.10.0.100

# Check for intermittent network issues
for i in {1..10}; do ping -c1 host || echo "Fail $i"; done

Solutions:

# Adjust timeout values
curl --max-time 30 "$service"  # Increase timeout

# Add retry logic
for retry in {1..3}; do
    if curl -sSf "$service" >/dev/null 2>&1; then
        break
    elif [ $retry -eq 3 ]; then
        send_alert "Service $service failed after 3 retries"
    fi
    sleep 5
done

Missing Alerts

Symptoms: Real failures not triggering notifications Diagnosis:

# Verify monitoring script logic
bash -x monitoring-script.sh

# Check if services are actually down
systemctl status service-name
curl -v service-url

Solutions:

# Lower detection thresholds
# Increase monitoring frequency
# Add redundant monitoring methods

# Test alert mechanism
echo "Test alert" | send_alert_function

System Resource Issues

Monitoring Overhead

Symptoms: High CPU/memory usage from monitoring scripts Diagnosis:

# Monitor the monitoring scripts
top -p $(pgrep -f monitor)
ps aux | grep monitor

# Check monitoring frequency
crontab -l | grep monitor

Solutions:

# Reduce monitoring frequency
# Change from */1 to */5 minutes

# Optimize scripts
# Remove unnecessary commands
# Use efficient tools (prefer curl over wget, etc.)

# Add resource limits
timeout 30 monitoring-script.sh

Emergency Recovery

Complete Monitoring Failure

Recovery Steps:

# Restart all monitoring services
sudo systemctl restart cron
sudo systemctl restart rsyslog

# Reinstall monitoring scripts
cd /path/to/scripts
./install-monitoring.sh

# Test all components
./test-monitoring.sh

Discord Integration Lost

Quick Recovery:

# Test webhook
curl -X POST "$BACKUP_WEBHOOK_URL" -H "Content-Type: application/json" -d '{"content": "Monitoring restored"}'

# Switch to backup webhook if needed
export DISCORD_WEBHOOK_URL="$BACKUP_WEBHOOK_URL"

Prevention and Best Practices

Monitoring Health Checks

#!/bin/bash
# monitor-the-monitors.sh
MONITORING_SCRIPTS="/path/to/tdarr-monitor.sh /path/to/network-monitor.sh"

for script in $MONITORING_SCRIPTS; do
    if [ ! -x "$script" ]; then
        echo "ALERT: $script not executable" | send_alert
    fi
    
    # Check if script has run recently
    if [ $(($(date +%s) - $(stat -c %Y "$script.last_run" 2>/dev/null || echo 0))) -gt 3600 ]; then
        echo "ALERT: $script hasn't run in over an hour" | send_alert
    fi
done

Backup Alerting Channels

# Multiple notification methods
send_alert() {
    local message="$1"
    
    # Primary: Discord
    curl -X POST "$DISCORD_WEBHOOK" -d "{\"content\":\"$message\"}" || \
    # Backup: Email
    echo "$message" | mail -s "Homelab Alert" admin@domain.com || \
    # Last resort: Local log
    echo "$(date): $message" >> /var/log/critical-alerts.log
}

This troubleshooting guide covers the most common monitoring system issues and provides systematic recovery procedures.