# Monitoring System Troubleshooting Guide ## Discord Notification Issues ### Webhook Not Working **Symptoms**: No Discord messages received, connection errors **Diagnosis**: ```bash # Test webhook manually curl -X POST "$DISCORD_WEBHOOK_URL" \ -H "Content-Type: application/json" \ -d '{"content": "Test message"}' # Check webhook URL format echo $DISCORD_WEBHOOK_URL | grep -E "https://discord.com/api/webhooks/[0-9]+/.+" ``` **Solutions**: ```bash # Verify webhook URL is correct # Format: https://discord.com/api/webhooks/ID/TOKEN # Test with minimal payload curl -X POST "$DISCORD_WEBHOOK_URL" \ -H "Content-Type: application/json" \ -d '{"content": "✅ Webhook working"}' # Check for JSON formatting issues echo '{"content": "test"}' | jq . # Validate JSON ``` ### Message Formatting Problems **Symptoms**: Malformed messages, broken markdown, missing user pings **Common Issues**: ```bash # ❌ Broken JSON escaping {"content": "Error: "quotes" break JSON"} # ✅ Proper JSON escaping {"content": "Error: \"quotes\" properly escaped"} # ❌ User ping inside code block (doesn't work) {"content": "```\nIssue occurred <@user_id>\n```"} # ✅ User ping outside code block {"content": "```\nIssue occurred\n```\nManual intervention needed <@user_id>"} ``` ## Tdarr Monitoring Issues ### Script Not Running **Symptoms**: No monitoring alerts, script execution failures **Diagnosis**: ```bash # Check cron job status crontab -l | grep tdarr-timeout-monitor systemctl status cron # Run script manually for debugging bash -x /path/to/tdarr-timeout-monitor.sh # Check script permissions ls -la /path/to/tdarr-timeout-monitor.sh ``` **Solutions**: ```bash # Fix script permissions chmod +x /path/to/tdarr-timeout-monitor.sh # Reinstall cron job crontab -e # Add: */20 * * * * /full/path/to/tdarr-timeout-monitor.sh # Check script environment # Ensure PATH and variables are set correctly in script ``` ### API Connection Failures **Symptoms**: Cannot connect to Tdarr server, timeout errors **Diagnosis**: ```bash # Test Tdarr API manually curl -f "http://tdarr-server:8266/api/v2/status" # Check network connectivity ping tdarr-server nc -zv tdarr-server 8266 # Verify SSH access to server ssh tdarr "docker ps | grep tdarr" ``` **Solutions**: ```bash # Update server connection in script # Verify server IP and port are correct # Test API endpoints curl "http://10.10.0.43:8265/api/v2/status" # Web port curl "http://10.10.0.43:8266/api/v2/status" # Server port # Check Tdarr server logs ssh tdarr "docker logs tdarr | tail -20" ``` ## Windows Desktop Monitoring Issues ### PowerShell Script Not Running **Symptoms**: No reboot notifications from Windows systems **Diagnosis**: ```powershell # Check scheduled task status Get-ScheduledTask -TaskName "Reboot*" | Get-ScheduledTaskInfo # Test script execution manually PowerShell -ExecutionPolicy Bypass -File "C:\path\to\windows-reboot-monitor.ps1" # Check PowerShell execution policy Get-ExecutionPolicy ``` **Solutions**: ```powershell # Set execution policy Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser # Recreate scheduled tasks schtasks /Create /XML "C:\path\to\task.xml" /TN "RebootMonitor" # Check task trigger configuration Get-ScheduledTask -TaskName "RebootMonitor" | Get-ScheduledTaskTrigger ``` ### Network Access from Windows **Symptoms**: PowerShell cannot reach Discord webhook **Diagnosis**: ```powershell # Test network connectivity Test-NetConnection discord.com -Port 443 # Test webhook manually Invoke-RestMethod -Uri $webhookUrl -Method Post -Body '{"content":"test"}' -ContentType "application/json" # Check Windows firewall Get-NetFirewallRule | Where-Object {$_.DisplayName -like "*PowerShell*"} ``` **Solutions**: ```powershell # Allow PowerShell through firewall New-NetFirewallRule -DisplayName "PowerShell Outbound" -Direction Outbound -Program "C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe" -Action Allow # Test with simplified request $body = @{content="Test from Windows"} | ConvertTo-Json Invoke-RestMethod -Uri $webhookUrl -Method Post -Body $body -ContentType "application/json" ``` ## Log Management Issues ### Log Files Growing Too Large **Symptoms**: Disk space filling up, slow log access **Diagnosis**: ```bash # Check log file sizes du -sh /var/log/homelab-* du -sh /tmp/*monitor*.log # Check available disk space df -h /var/log df -h /tmp ``` **Solutions**: ```bash # Implement log rotation cat > /etc/logrotate.d/homelab-monitoring << 'EOF' /var/log/homelab-*.log { daily missingok rotate 7 compress notifempty create 644 root root } EOF # Manual log cleanup find /tmp -name "*monitor*.log" -size +10M -delete truncate -s 0 /tmp/large-log-file.log ``` ### Log Rotation Not Working **Symptoms**: Old logs not being cleaned up **Diagnosis**: ```bash # Check logrotate status systemctl status logrotate cat /var/lib/logrotate/status # Test logrotate configuration logrotate -d /etc/logrotate.d/homelab-monitoring ``` **Solutions**: ```bash # Force log rotation logrotate -f /etc/logrotate.d/homelab-monitoring # Fix logrotate configuration sudo nano /etc/logrotate.d/homelab-monitoring # Verify syntax and permissions ``` ## Cron Job Issues ### Scheduled Tasks Not Running **Symptoms**: Scripts not executing at scheduled times **Diagnosis**: ```bash # Check cron service systemctl status cron systemctl status crond # RHEL/CentOS # View cron logs grep CRON /var/log/syslog journalctl -u cron # List all cron jobs crontab -l sudo crontab -l # System crontab ``` **Solutions**: ```bash # Restart cron service sudo systemctl restart cron # Fix cron job syntax # Ensure absolute paths are used # Example: */20 * * * * /full/path/to/script.sh # Check script permissions and execution ls -la /path/to/script.sh /path/to/script.sh # Test manual execution ``` ### Environment Variables in Cron **Symptoms**: Scripts work manually but fail in cron **Diagnosis**: ```bash # Create test cron job to check environment * * * * * env > /tmp/cron-env.txt # Compare with shell environment env > /tmp/shell-env.txt diff /tmp/shell-env.txt /tmp/cron-env.txt ``` **Solutions**: ```bash # Set PATH in crontab PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin # Or set PATH in script #!/bin/bash export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin" # Source environment if needed source /etc/environment ``` ## Network Monitoring Issues ### False Positives **Symptoms**: Alerts for services that are actually working **Diagnosis**: ```bash # Test monitoring checks manually curl -sSf --max-time 10 "https://service.homelab.local" ping -c1 -W5 10.10.0.100 # Check for intermittent network issues for i in {1..10}; do ping -c1 host || echo "Fail $i"; done ``` **Solutions**: ```bash # Adjust timeout values curl --max-time 30 "$service" # Increase timeout # Add retry logic for retry in {1..3}; do if curl -sSf "$service" >/dev/null 2>&1; then break elif [ $retry -eq 3 ]; then send_alert "Service $service failed after 3 retries" fi sleep 5 done ``` ### Missing Alerts **Symptoms**: Real failures not triggering notifications **Diagnosis**: ```bash # Verify monitoring script logic bash -x monitoring-script.sh # Check if services are actually down systemctl status service-name curl -v service-url ``` **Solutions**: ```bash # Lower detection thresholds # Increase monitoring frequency # Add redundant monitoring methods # Test alert mechanism echo "Test alert" | send_alert_function ``` ## System Resource Issues ### Monitoring Overhead **Symptoms**: High CPU/memory usage from monitoring scripts **Diagnosis**: ```bash # Monitor the monitoring scripts top -p $(pgrep -f monitor) ps aux | grep monitor # Check monitoring frequency crontab -l | grep monitor ``` **Solutions**: ```bash # Reduce monitoring frequency # Change from */1 to */5 minutes # Optimize scripts # Remove unnecessary commands # Use efficient tools (prefer curl over wget, etc.) # Add resource limits timeout 30 monitoring-script.sh ``` ## Emergency Recovery ### Complete Monitoring Failure **Recovery Steps**: ```bash # Restart all monitoring services sudo systemctl restart cron sudo systemctl restart rsyslog # Reinstall monitoring scripts cd /path/to/scripts ./install-monitoring.sh # Test all components ./test-monitoring.sh ``` ### Discord Integration Lost **Quick Recovery**: ```bash # Test webhook curl -X POST "$BACKUP_WEBHOOK_URL" -H "Content-Type: application/json" -d '{"content": "Monitoring restored"}' # Switch to backup webhook if needed export DISCORD_WEBHOOK_URL="$BACKUP_WEBHOOK_URL" ``` ## Prevention and Best Practices ### Monitoring Health Checks ```bash #!/bin/bash # monitor-the-monitors.sh MONITORING_SCRIPTS="/path/to/tdarr-monitor.sh /path/to/network-monitor.sh" for script in $MONITORING_SCRIPTS; do if [ ! -x "$script" ]; then echo "ALERT: $script not executable" | send_alert fi # Check if script has run recently if [ $(($(date +%s) - $(stat -c %Y "$script.last_run" 2>/dev/null || echo 0))) -gt 3600 ]; then echo "ALERT: $script hasn't run in over an hour" | send_alert fi done ``` ### Backup Alerting Channels ```bash # Multiple notification methods send_alert() { local message="$1" # Primary: Discord curl -X POST "$DISCORD_WEBHOOK" -d "{\"content\":\"$message\"}" || \ # Backup: Email echo "$message" | mail -s "Homelab Alert" admin@domain.com || \ # Last resort: Local log echo "$(date): $message" >> /var/log/critical-alerts.log } ``` This troubleshooting guide covers the most common monitoring system issues and provides systematic recovery procedures.