claude-home/monitoring/troubleshooting.md

---
title: "Monitoring Troubleshooting Guide"
description: "Troubleshooting procedures for Discord webhook failures, Tdarr monitoring issues, Windows PowerShell script problems, log rotation, cron job failures, network false positives, and monitoring overhead."
type: troubleshooting
domain: monitoring
tags: [discord, webhook, tdarr, windows, powershell, cron, log-rotation, network, alerts]
---

# Monitoring System Troubleshooting Guide

## Discord Notification Issues

### Webhook Not Working
**Symptoms**: No Discord messages received, connection errors
**Diagnosis**:
```bash
# Test webhook manually
curl -X POST "$DISCORD_WEBHOOK_URL" \
  -H "Content-Type: application/json" \
  -d '{"content": "Test message"}'

# Check webhook URL format
echo $DISCORD_WEBHOOK_URL | grep -E "https://discord.com/api/webhooks/[0-9]+/.+"
```

**Solutions**:
```bash
# Verify webhook URL is correct
# Format: https://discord.com/api/webhooks/ID/TOKEN

# Test with minimal payload
curl -X POST "$DISCORD_WEBHOOK_URL" \
  -H "Content-Type: application/json" \
  -d '{"content": "✅ Webhook working"}'

# Check for JSON formatting issues
echo '{"content": "test"}' | jq .  # Validate JSON
```

### Message Formatting Problems
**Symptoms**: Malformed messages, broken markdown, missing user pings
**Common Issues**:
```bash
# ❌ Broken JSON escaping
{"content": "Error: "quotes" break JSON"}

# ✅ Proper JSON escaping
{"content": "Error: \"quotes\" properly escaped"}

# ❌ User ping inside code block (doesn't work)
{"content": "```\nIssue occurred <@user_id>\n```"}

# ✅ User ping outside code block
{"content": "```\nIssue occurred\n```\nManual intervention needed <@user_id>"}
```

## Tdarr Monitoring Issues

### Script Not Running
**Symptoms**: No monitoring alerts, script execution failures
**Diagnosis**:
```bash
# Check cron job status
crontab -l | grep tdarr-timeout-monitor
systemctl status cron

# Run script manually for debugging
bash -x /path/to/tdarr-timeout-monitor.sh

# Check script permissions
ls -la /path/to/tdarr-timeout-monitor.sh
```

**Solutions**:
```bash
# Fix script permissions
chmod +x /path/to/tdarr-timeout-monitor.sh

# Reinstall cron job
crontab -e
# Add: */20 * * * * /full/path/to/tdarr-timeout-monitor.sh

# Check script environment
# Ensure PATH and variables are set correctly in script
```

### API Connection Failures
**Symptoms**: Cannot connect to Tdarr server, timeout errors
**Diagnosis**:
```bash
# Test Tdarr API manually
curl -f "http://tdarr-server:8266/api/v2/status"

# Check network connectivity
ping tdarr-server
nc -zv tdarr-server 8266

# Verify SSH access to server
ssh tdarr "docker ps | grep tdarr"
```

**Solutions**:
```bash
# Update server connection in script
# Verify server IP and port are correct

# Test API endpoints
curl "http://10.10.0.43:8265/api/v2/status"  # Web port
curl "http://10.10.0.43:8266/api/v2/status"  # Server port

# Check Tdarr server logs
ssh tdarr "docker logs tdarr | tail -20"
```

## Windows Desktop Monitoring Issues

### PowerShell Script Not Running
**Symptoms**: No reboot notifications from Windows systems
**Diagnosis**:
```powershell
# Check scheduled task status
Get-ScheduledTask -TaskName "Reboot*" | Get-ScheduledTaskInfo

# Test script execution manually
PowerShell -ExecutionPolicy Bypass -File "C:\path\to\windows-reboot-monitor.ps1"

# Check PowerShell execution policy
Get-ExecutionPolicy
```

**Solutions**:
```powershell
# Set execution policy
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

# Recreate scheduled tasks
schtasks /Create /XML "C:\path\to\task.xml" /TN "RebootMonitor"

# Check task trigger configuration
Get-ScheduledTask -TaskName "RebootMonitor" | Get-ScheduledTaskTrigger
```

### Network Access from Windows
**Symptoms**: PowerShell cannot reach Discord webhook
**Diagnosis**:
```powershell
# Test network connectivity
Test-NetConnection discord.com -Port 443

# Test webhook manually
Invoke-RestMethod -Uri $webhookUrl -Method Post -Body '{"content":"test"}' -ContentType "application/json"

# Check Windows firewall
Get-NetFirewallRule | Where-Object {$_.DisplayName -like "*PowerShell*"}
```

**Solutions**:
```powershell
# Allow PowerShell through firewall
New-NetFirewallRule -DisplayName "PowerShell Outbound" -Direction Outbound -Program "C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe" -Action Allow

# Test with simplified request
$body = @{content="Test from Windows"} | ConvertTo-Json
Invoke-RestMethod -Uri $webhookUrl -Method Post -Body $body -ContentType "application/json"
```

## Log Management Issues

### Log Files Growing Too Large
**Symptoms**: Disk space filling up, slow log access
**Diagnosis**:
```bash
# Check log file sizes
du -sh /var/log/homelab-*
du -sh /tmp/*monitor*.log

# Check available disk space
df -h /var/log
df -h /tmp
```

**Solutions**:
```bash
# Implement log rotation
cat > /etc/logrotate.d/homelab-monitoring << 'EOF'
/var/log/homelab-*.log {
    daily
    missingok
    rotate 7
    compress
    notifempty
    create 644 root root
}
EOF

# Manual log cleanup
find /tmp -name "*monitor*.log" -size +10M -delete
truncate -s 0 /tmp/large-log-file.log
```

### Log Rotation Not Working
**Symptoms**: Old logs not being cleaned up
**Diagnosis**:
```bash
# Check logrotate status
systemctl status logrotate
cat /var/lib/logrotate/status

# Test logrotate configuration
logrotate -d /etc/logrotate.d/homelab-monitoring
```

**Solutions**:
```bash
# Force log rotation
logrotate -f /etc/logrotate.d/homelab-monitoring

# Fix logrotate configuration
sudo nano /etc/logrotate.d/homelab-monitoring
# Verify syntax and permissions
```

## Cron Job Issues

### Scheduled Tasks Not Running
**Symptoms**: Scripts not executing at scheduled times
**Diagnosis**:
```bash
# Check cron service
systemctl status cron
systemctl status crond  # RHEL/CentOS

# View cron logs
grep CRON /var/log/syslog
journalctl -u cron

# List all cron jobs
crontab -l
sudo crontab -l  # System crontab
```

**Solutions**:
```bash
# Restart cron service
sudo systemctl restart cron

# Fix cron job syntax
# Ensure absolute paths are used
# Example: */20 * * * * /full/path/to/script.sh

# Check script permissions and execution
ls -la /path/to/script.sh
/path/to/script.sh  # Test manual execution
```

### Environment Variables in Cron
**Symptoms**: Scripts work manually but fail in cron
**Diagnosis**:
```bash
# Create test cron job to check environment
* * * * * env > /tmp/cron-env.txt

# Compare with shell environment
env > /tmp/shell-env.txt
diff /tmp/shell-env.txt /tmp/cron-env.txt
```

**Solutions**:
```bash
# Set PATH in crontab
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

# Or set PATH in script
#!/bin/bash
export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

# Source environment if needed
source /etc/environment
```

## Network Monitoring Issues

### False Positives
**Symptoms**: Alerts for services that are actually working
**Diagnosis**:
```bash
# Test monitoring checks manually
curl -sSf --max-time 10 "https://service.homelab.local"
ping -c1 -W5 10.10.0.100

# Check for intermittent network issues
for i in {1..10}; do ping -c1 host || echo "Fail $i"; done
```

**Solutions**:
```bash
# Adjust timeout values
curl --max-time 30 "$service"  # Increase timeout

# Add retry logic
for retry in {1..3}; do
    if curl -sSf "$service" >/dev/null 2>&1; then
        break
    elif [ $retry -eq 3 ]; then
        send_alert "Service $service failed after 3 retries"
    fi
    sleep 5
done
```

### Missing Alerts
**Symptoms**: Real failures not triggering notifications
**Diagnosis**:
```bash
# Verify monitoring script logic
bash -x monitoring-script.sh

# Check if services are actually down
systemctl status service-name
curl -v service-url
```

**Solutions**:
```bash
# Lower detection thresholds
# Increase monitoring frequency
# Add redundant monitoring methods

# Test alert mechanism
echo "Test alert" | send_alert_function
```

## System Resource Issues

### Monitoring Overhead
**Symptoms**: High CPU/memory usage from monitoring scripts
**Diagnosis**:
```bash
# Monitor the monitoring scripts
top -p $(pgrep -f monitor)
ps aux | grep monitor

# Check monitoring frequency
crontab -l | grep monitor
```

**Solutions**:
```bash
# Reduce monitoring frequency
# Change from */1 to */5 minutes

# Optimize scripts
# Remove unnecessary commands
# Use efficient tools (prefer curl over wget, etc.)

# Add resource limits
timeout 30 monitoring-script.sh
```

## Emergency Recovery

### Complete Monitoring Failure
**Recovery Steps**:
```bash
# Restart all monitoring services
sudo systemctl restart cron
sudo systemctl restart rsyslog

# Reinstall monitoring scripts
cd /path/to/scripts
./install-monitoring.sh

# Test all components
./test-monitoring.sh
```

### Discord Integration Lost
**Quick Recovery**:
```bash
# Test webhook
curl -X POST "$BACKUP_WEBHOOK_URL" -H "Content-Type: application/json" -d '{"content": "Monitoring restored"}'

# Switch to backup webhook if needed
export DISCORD_WEBHOOK_URL="$BACKUP_WEBHOOK_URL"
```

## Prevention and Best Practices

### Monitoring Health Checks
```bash
#!/bin/bash
# monitor-the-monitors.sh
MONITORING_SCRIPTS="/path/to/tdarr-monitor.sh /path/to/network-monitor.sh"

for script in $MONITORING_SCRIPTS; do
    if [ ! -x "$script" ]; then
        echo "ALERT: $script not executable" | send_alert
    fi

    # Check if script has run recently
    if [ $(($(date +%s) - $(stat -c %Y "$script.last_run" 2>/dev/null || echo 0))) -gt 3600 ]; then
        echo "ALERT: $script hasn't run in over an hour" | send_alert
    fi
done
```

### Backup Alerting Channels
```bash
# Multiple notification methods
send_alert() {
    local message="$1"

    # Primary: Discord
    curl -X POST "$DISCORD_WEBHOOK" -d "{\"content\":\"$message\"}" || \
    # Backup: Email
    echo "$message" | mail -s "Homelab Alert" admin@domain.com || \
    # Last resort: Local log
    echo "$(date): $message" >> /var/log/critical-alerts.log
}
```

This troubleshooting guide covers the most common monitoring system issues and provides systematic recovery procedures.