claude-home/monitoring/troubleshooting.md
Cal Corum 4b7eca8a46
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
docs: add YAML frontmatter to all 151 markdown files
Adds title, description, type, domain, and tags frontmatter to every
doc for improved KB semantic search. The description field is prepended
to every search chunk, and domain/type/tags enable filtered queries.

Type values: context, guide, runbook, reference, troubleshooting
Domain values match directory structure (networking, docker, etc.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 09:00:44 -05:00

9.9 KiB

title description type domain tags
Monitoring Troubleshooting Guide Troubleshooting procedures for Discord webhook failures, Tdarr monitoring issues, Windows PowerShell script problems, log rotation, cron job failures, network false positives, and monitoring overhead. troubleshooting monitoring
discord
webhook
tdarr
windows
powershell
cron
log-rotation
network
alerts

Monitoring System Troubleshooting Guide

Discord Notification Issues

Webhook Not Working

Symptoms: No Discord messages received, connection errors Diagnosis:

# Test webhook manually
curl -X POST "$DISCORD_WEBHOOK_URL" \
  -H "Content-Type: application/json" \
  -d '{"content": "Test message"}'

# Check webhook URL format
echo $DISCORD_WEBHOOK_URL | grep -E "https://discord.com/api/webhooks/[0-9]+/.+"

Solutions:

# Verify webhook URL is correct
# Format: https://discord.com/api/webhooks/ID/TOKEN

# Test with minimal payload
curl -X POST "$DISCORD_WEBHOOK_URL" \
  -H "Content-Type: application/json" \
  -d '{"content": "✅ Webhook working"}'

# Check for JSON formatting issues
echo '{"content": "test"}' | jq .  # Validate JSON

Message Formatting Problems

Symptoms: Malformed messages, broken markdown, missing user pings Common Issues:

# ❌ Broken JSON escaping
{"content": "Error: "quotes" break JSON"}

# ✅ Proper JSON escaping  
{"content": "Error: \"quotes\" properly escaped"}

# ❌ User ping inside code block (doesn't work)
{"content": "```\nIssue occurred <@user_id>\n```"}

# ✅ User ping outside code block
{"content": "```\nIssue occurred\n```\nManual intervention needed <@user_id>"}

Tdarr Monitoring Issues

Script Not Running

Symptoms: No monitoring alerts, script execution failures Diagnosis:

# Check cron job status
crontab -l | grep tdarr-timeout-monitor
systemctl status cron

# Run script manually for debugging
bash -x /path/to/tdarr-timeout-monitor.sh

# Check script permissions
ls -la /path/to/tdarr-timeout-monitor.sh

Solutions:

# Fix script permissions
chmod +x /path/to/tdarr-timeout-monitor.sh

# Reinstall cron job
crontab -e
# Add: */20 * * * * /full/path/to/tdarr-timeout-monitor.sh

# Check script environment
# Ensure PATH and variables are set correctly in script

API Connection Failures

Symptoms: Cannot connect to Tdarr server, timeout errors Diagnosis:

# Test Tdarr API manually
curl -f "http://tdarr-server:8266/api/v2/status"

# Check network connectivity
ping tdarr-server
nc -zv tdarr-server 8266

# Verify SSH access to server
ssh tdarr "docker ps | grep tdarr"

Solutions:

# Update server connection in script
# Verify server IP and port are correct

# Test API endpoints
curl "http://10.10.0.43:8265/api/v2/status"  # Web port
curl "http://10.10.0.43:8266/api/v2/status"  # Server port

# Check Tdarr server logs
ssh tdarr "docker logs tdarr | tail -20"

Windows Desktop Monitoring Issues

PowerShell Script Not Running

Symptoms: No reboot notifications from Windows systems Diagnosis:

# Check scheduled task status
Get-ScheduledTask -TaskName "Reboot*" | Get-ScheduledTaskInfo

# Test script execution manually
PowerShell -ExecutionPolicy Bypass -File "C:\path\to\windows-reboot-monitor.ps1"

# Check PowerShell execution policy
Get-ExecutionPolicy

Solutions:

# Set execution policy
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

# Recreate scheduled tasks
schtasks /Create /XML "C:\path\to\task.xml" /TN "RebootMonitor"

# Check task trigger configuration
Get-ScheduledTask -TaskName "RebootMonitor" | Get-ScheduledTaskTrigger

Network Access from Windows

Symptoms: PowerShell cannot reach Discord webhook Diagnosis:

# Test network connectivity
Test-NetConnection discord.com -Port 443

# Test webhook manually
Invoke-RestMethod -Uri $webhookUrl -Method Post -Body '{"content":"test"}' -ContentType "application/json"

# Check Windows firewall
Get-NetFirewallRule | Where-Object {$_.DisplayName -like "*PowerShell*"}

Solutions:

# Allow PowerShell through firewall
New-NetFirewallRule -DisplayName "PowerShell Outbound" -Direction Outbound -Program "C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe" -Action Allow

# Test with simplified request
$body = @{content="Test from Windows"} | ConvertTo-Json
Invoke-RestMethod -Uri $webhookUrl -Method Post -Body $body -ContentType "application/json"

Log Management Issues

Log Files Growing Too Large

Symptoms: Disk space filling up, slow log access Diagnosis:

# Check log file sizes
du -sh /var/log/homelab-*
du -sh /tmp/*monitor*.log

# Check available disk space
df -h /var/log
df -h /tmp

Solutions:

# Implement log rotation
cat > /etc/logrotate.d/homelab-monitoring << 'EOF'
/var/log/homelab-*.log {
    daily
    missingok
    rotate 7
    compress
    notifempty
    create 644 root root
}
EOF

# Manual log cleanup
find /tmp -name "*monitor*.log" -size +10M -delete
truncate -s 0 /tmp/large-log-file.log

Log Rotation Not Working

Symptoms: Old logs not being cleaned up Diagnosis:

# Check logrotate status
systemctl status logrotate
cat /var/lib/logrotate/status

# Test logrotate configuration
logrotate -d /etc/logrotate.d/homelab-monitoring

Solutions:

# Force log rotation
logrotate -f /etc/logrotate.d/homelab-monitoring

# Fix logrotate configuration
sudo nano /etc/logrotate.d/homelab-monitoring
# Verify syntax and permissions

Cron Job Issues

Scheduled Tasks Not Running

Symptoms: Scripts not executing at scheduled times Diagnosis:

# Check cron service
systemctl status cron
systemctl status crond  # RHEL/CentOS

# View cron logs
grep CRON /var/log/syslog
journalctl -u cron

# List all cron jobs
crontab -l
sudo crontab -l  # System crontab

Solutions:

# Restart cron service
sudo systemctl restart cron

# Fix cron job syntax
# Ensure absolute paths are used
# Example: */20 * * * * /full/path/to/script.sh

# Check script permissions and execution
ls -la /path/to/script.sh
/path/to/script.sh  # Test manual execution

Environment Variables in Cron

Symptoms: Scripts work manually but fail in cron Diagnosis:

# Create test cron job to check environment
* * * * * env > /tmp/cron-env.txt

# Compare with shell environment
env > /tmp/shell-env.txt
diff /tmp/shell-env.txt /tmp/cron-env.txt

Solutions:

# Set PATH in crontab
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

# Or set PATH in script
#!/bin/bash
export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

# Source environment if needed
source /etc/environment

Network Monitoring Issues

False Positives

Symptoms: Alerts for services that are actually working Diagnosis:

# Test monitoring checks manually
curl -sSf --max-time 10 "https://service.homelab.local"
ping -c1 -W5 10.10.0.100

# Check for intermittent network issues
for i in {1..10}; do ping -c1 host || echo "Fail $i"; done

Solutions:

# Adjust timeout values
curl --max-time 30 "$service"  # Increase timeout

# Add retry logic
for retry in {1..3}; do
    if curl -sSf "$service" >/dev/null 2>&1; then
        break
    elif [ $retry -eq 3 ]; then
        send_alert "Service $service failed after 3 retries"
    fi
    sleep 5
done

Missing Alerts

Symptoms: Real failures not triggering notifications Diagnosis:

# Verify monitoring script logic
bash -x monitoring-script.sh

# Check if services are actually down
systemctl status service-name
curl -v service-url

Solutions:

# Lower detection thresholds
# Increase monitoring frequency
# Add redundant monitoring methods

# Test alert mechanism
echo "Test alert" | send_alert_function

System Resource Issues

Monitoring Overhead

Symptoms: High CPU/memory usage from monitoring scripts Diagnosis:

# Monitor the monitoring scripts
top -p $(pgrep -f monitor)
ps aux | grep monitor

# Check monitoring frequency
crontab -l | grep monitor

Solutions:

# Reduce monitoring frequency
# Change from */1 to */5 minutes

# Optimize scripts
# Remove unnecessary commands
# Use efficient tools (prefer curl over wget, etc.)

# Add resource limits
timeout 30 monitoring-script.sh

Emergency Recovery

Complete Monitoring Failure

Recovery Steps:

# Restart all monitoring services
sudo systemctl restart cron
sudo systemctl restart rsyslog

# Reinstall monitoring scripts
cd /path/to/scripts
./install-monitoring.sh

# Test all components
./test-monitoring.sh

Discord Integration Lost

Quick Recovery:

# Test webhook
curl -X POST "$BACKUP_WEBHOOK_URL" -H "Content-Type: application/json" -d '{"content": "Monitoring restored"}'

# Switch to backup webhook if needed
export DISCORD_WEBHOOK_URL="$BACKUP_WEBHOOK_URL"

Prevention and Best Practices

Monitoring Health Checks

#!/bin/bash
# monitor-the-monitors.sh
MONITORING_SCRIPTS="/path/to/tdarr-monitor.sh /path/to/network-monitor.sh"

for script in $MONITORING_SCRIPTS; do
    if [ ! -x "$script" ]; then
        echo "ALERT: $script not executable" | send_alert
    fi
    
    # Check if script has run recently
    if [ $(($(date +%s) - $(stat -c %Y "$script.last_run" 2>/dev/null || echo 0))) -gt 3600 ]; then
        echo "ALERT: $script hasn't run in over an hour" | send_alert
    fi
done

Backup Alerting Channels

# Multiple notification methods
send_alert() {
    local message="$1"
    
    # Primary: Discord
    curl -X POST "$DISCORD_WEBHOOK" -d "{\"content\":\"$message\"}" || \
    # Backup: Email
    echo "$message" | mail -s "Homelab Alert" admin@domain.com || \
    # Last resort: Local log
    echo "$(date): $message" >> /var/log/critical-alerts.log
}

This troubleshooting guide covers the most common monitoring system issues and provides systematic recovery procedures.