All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
Adds title, description, type, domain, and tags frontmatter to every doc for improved KB semantic search. The description field is prepended to every search chunk, and domain/type/tags enable filtered queries. Type values: context, guide, runbook, reference, troubleshooting Domain values match directory structure (networking, docker, etc.) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
422 lines
9.9 KiB
Markdown
422 lines
9.9 KiB
Markdown
---
|
|
title: "Monitoring Troubleshooting Guide"
|
|
description: "Troubleshooting procedures for Discord webhook failures, Tdarr monitoring issues, Windows PowerShell script problems, log rotation, cron job failures, network false positives, and monitoring overhead."
|
|
type: troubleshooting
|
|
domain: monitoring
|
|
tags: [discord, webhook, tdarr, windows, powershell, cron, log-rotation, network, alerts]
|
|
---
|
|
|
|
# Monitoring System Troubleshooting Guide
|
|
|
|
## Discord Notification Issues
|
|
|
|
### Webhook Not Working
|
|
**Symptoms**: No Discord messages received, connection errors
|
|
**Diagnosis**:
|
|
```bash
|
|
# Test webhook manually
|
|
curl -X POST "$DISCORD_WEBHOOK_URL" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"content": "Test message"}'
|
|
|
|
# Check webhook URL format
|
|
echo $DISCORD_WEBHOOK_URL | grep -E "https://discord.com/api/webhooks/[0-9]+/.+"
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Verify webhook URL is correct
|
|
# Format: https://discord.com/api/webhooks/ID/TOKEN
|
|
|
|
# Test with minimal payload
|
|
curl -X POST "$DISCORD_WEBHOOK_URL" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"content": "✅ Webhook working"}'
|
|
|
|
# Check for JSON formatting issues
|
|
echo '{"content": "test"}' | jq . # Validate JSON
|
|
```
|
|
|
|
### Message Formatting Problems
|
|
**Symptoms**: Malformed messages, broken markdown, missing user pings
|
|
**Common Issues**:
|
|
```bash
|
|
# ❌ Broken JSON escaping
|
|
{"content": "Error: "quotes" break JSON"}
|
|
|
|
# ✅ Proper JSON escaping
|
|
{"content": "Error: \"quotes\" properly escaped"}
|
|
|
|
# ❌ User ping inside code block (doesn't work)
|
|
{"content": "```\nIssue occurred <@user_id>\n```"}
|
|
|
|
# ✅ User ping outside code block
|
|
{"content": "```\nIssue occurred\n```\nManual intervention needed <@user_id>"}
|
|
```
|
|
|
|
## Tdarr Monitoring Issues
|
|
|
|
### Script Not Running
|
|
**Symptoms**: No monitoring alerts, script execution failures
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check cron job status
|
|
crontab -l | grep tdarr-timeout-monitor
|
|
systemctl status cron
|
|
|
|
# Run script manually for debugging
|
|
bash -x /path/to/tdarr-timeout-monitor.sh
|
|
|
|
# Check script permissions
|
|
ls -la /path/to/tdarr-timeout-monitor.sh
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Fix script permissions
|
|
chmod +x /path/to/tdarr-timeout-monitor.sh
|
|
|
|
# Reinstall cron job
|
|
crontab -e
|
|
# Add: */20 * * * * /full/path/to/tdarr-timeout-monitor.sh
|
|
|
|
# Check script environment
|
|
# Ensure PATH and variables are set correctly in script
|
|
```
|
|
|
|
### API Connection Failures
|
|
**Symptoms**: Cannot connect to Tdarr server, timeout errors
|
|
**Diagnosis**:
|
|
```bash
|
|
# Test Tdarr API manually
|
|
curl -f "http://tdarr-server:8266/api/v2/status"
|
|
|
|
# Check network connectivity
|
|
ping tdarr-server
|
|
nc -zv tdarr-server 8266
|
|
|
|
# Verify SSH access to server
|
|
ssh tdarr "docker ps | grep tdarr"
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Update server connection in script
|
|
# Verify server IP and port are correct
|
|
|
|
# Test API endpoints
|
|
curl "http://10.10.0.43:8265/api/v2/status" # Web port
|
|
curl "http://10.10.0.43:8266/api/v2/status" # Server port
|
|
|
|
# Check Tdarr server logs
|
|
ssh tdarr "docker logs tdarr | tail -20"
|
|
```
|
|
|
|
## Windows Desktop Monitoring Issues
|
|
|
|
### PowerShell Script Not Running
|
|
**Symptoms**: No reboot notifications from Windows systems
|
|
**Diagnosis**:
|
|
```powershell
|
|
# Check scheduled task status
|
|
Get-ScheduledTask -TaskName "Reboot*" | Get-ScheduledTaskInfo
|
|
|
|
# Test script execution manually
|
|
PowerShell -ExecutionPolicy Bypass -File "C:\path\to\windows-reboot-monitor.ps1"
|
|
|
|
# Check PowerShell execution policy
|
|
Get-ExecutionPolicy
|
|
```
|
|
|
|
**Solutions**:
|
|
```powershell
|
|
# Set execution policy
|
|
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
|
|
|
|
# Recreate scheduled tasks
|
|
schtasks /Create /XML "C:\path\to\task.xml" /TN "RebootMonitor"
|
|
|
|
# Check task trigger configuration
|
|
Get-ScheduledTask -TaskName "RebootMonitor" | Get-ScheduledTaskTrigger
|
|
```
|
|
|
|
### Network Access from Windows
|
|
**Symptoms**: PowerShell cannot reach Discord webhook
|
|
**Diagnosis**:
|
|
```powershell
|
|
# Test network connectivity
|
|
Test-NetConnection discord.com -Port 443
|
|
|
|
# Test webhook manually
|
|
Invoke-RestMethod -Uri $webhookUrl -Method Post -Body '{"content":"test"}' -ContentType "application/json"
|
|
|
|
# Check Windows firewall
|
|
Get-NetFirewallRule | Where-Object {$_.DisplayName -like "*PowerShell*"}
|
|
```
|
|
|
|
**Solutions**:
|
|
```powershell
|
|
# Allow PowerShell through firewall
|
|
New-NetFirewallRule -DisplayName "PowerShell Outbound" -Direction Outbound -Program "C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe" -Action Allow
|
|
|
|
# Test with simplified request
|
|
$body = @{content="Test from Windows"} | ConvertTo-Json
|
|
Invoke-RestMethod -Uri $webhookUrl -Method Post -Body $body -ContentType "application/json"
|
|
```
|
|
|
|
## Log Management Issues
|
|
|
|
### Log Files Growing Too Large
|
|
**Symptoms**: Disk space filling up, slow log access
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check log file sizes
|
|
du -sh /var/log/homelab-*
|
|
du -sh /tmp/*monitor*.log
|
|
|
|
# Check available disk space
|
|
df -h /var/log
|
|
df -h /tmp
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Implement log rotation
|
|
cat > /etc/logrotate.d/homelab-monitoring << 'EOF'
|
|
/var/log/homelab-*.log {
|
|
daily
|
|
missingok
|
|
rotate 7
|
|
compress
|
|
notifempty
|
|
create 644 root root
|
|
}
|
|
EOF
|
|
|
|
# Manual log cleanup
|
|
find /tmp -name "*monitor*.log" -size +10M -delete
|
|
truncate -s 0 /tmp/large-log-file.log
|
|
```
|
|
|
|
### Log Rotation Not Working
|
|
**Symptoms**: Old logs not being cleaned up
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check logrotate status
|
|
systemctl status logrotate
|
|
cat /var/lib/logrotate/status
|
|
|
|
# Test logrotate configuration
|
|
logrotate -d /etc/logrotate.d/homelab-monitoring
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Force log rotation
|
|
logrotate -f /etc/logrotate.d/homelab-monitoring
|
|
|
|
# Fix logrotate configuration
|
|
sudo nano /etc/logrotate.d/homelab-monitoring
|
|
# Verify syntax and permissions
|
|
```
|
|
|
|
## Cron Job Issues
|
|
|
|
### Scheduled Tasks Not Running
|
|
**Symptoms**: Scripts not executing at scheduled times
|
|
**Diagnosis**:
|
|
```bash
|
|
# Check cron service
|
|
systemctl status cron
|
|
systemctl status crond # RHEL/CentOS
|
|
|
|
# View cron logs
|
|
grep CRON /var/log/syslog
|
|
journalctl -u cron
|
|
|
|
# List all cron jobs
|
|
crontab -l
|
|
sudo crontab -l # System crontab
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Restart cron service
|
|
sudo systemctl restart cron
|
|
|
|
# Fix cron job syntax
|
|
# Ensure absolute paths are used
|
|
# Example: */20 * * * * /full/path/to/script.sh
|
|
|
|
# Check script permissions and execution
|
|
ls -la /path/to/script.sh
|
|
/path/to/script.sh # Test manual execution
|
|
```
|
|
|
|
### Environment Variables in Cron
|
|
**Symptoms**: Scripts work manually but fail in cron
|
|
**Diagnosis**:
|
|
```bash
|
|
# Create test cron job to check environment
|
|
* * * * * env > /tmp/cron-env.txt
|
|
|
|
# Compare with shell environment
|
|
env > /tmp/shell-env.txt
|
|
diff /tmp/shell-env.txt /tmp/cron-env.txt
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Set PATH in crontab
|
|
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
|
|
|
|
# Or set PATH in script
|
|
#!/bin/bash
|
|
export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
|
|
|
|
# Source environment if needed
|
|
source /etc/environment
|
|
```
|
|
|
|
## Network Monitoring Issues
|
|
|
|
### False Positives
|
|
**Symptoms**: Alerts for services that are actually working
|
|
**Diagnosis**:
|
|
```bash
|
|
# Test monitoring checks manually
|
|
curl -sSf --max-time 10 "https://service.homelab.local"
|
|
ping -c1 -W5 10.10.0.100
|
|
|
|
# Check for intermittent network issues
|
|
for i in {1..10}; do ping -c1 host || echo "Fail $i"; done
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Adjust timeout values
|
|
curl --max-time 30 "$service" # Increase timeout
|
|
|
|
# Add retry logic
|
|
for retry in {1..3}; do
|
|
if curl -sSf "$service" >/dev/null 2>&1; then
|
|
break
|
|
elif [ $retry -eq 3 ]; then
|
|
send_alert "Service $service failed after 3 retries"
|
|
fi
|
|
sleep 5
|
|
done
|
|
```
|
|
|
|
### Missing Alerts
|
|
**Symptoms**: Real failures not triggering notifications
|
|
**Diagnosis**:
|
|
```bash
|
|
# Verify monitoring script logic
|
|
bash -x monitoring-script.sh
|
|
|
|
# Check if services are actually down
|
|
systemctl status service-name
|
|
curl -v service-url
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Lower detection thresholds
|
|
# Increase monitoring frequency
|
|
# Add redundant monitoring methods
|
|
|
|
# Test alert mechanism
|
|
echo "Test alert" | send_alert_function
|
|
```
|
|
|
|
## System Resource Issues
|
|
|
|
### Monitoring Overhead
|
|
**Symptoms**: High CPU/memory usage from monitoring scripts
|
|
**Diagnosis**:
|
|
```bash
|
|
# Monitor the monitoring scripts
|
|
top -p $(pgrep -f monitor)
|
|
ps aux | grep monitor
|
|
|
|
# Check monitoring frequency
|
|
crontab -l | grep monitor
|
|
```
|
|
|
|
**Solutions**:
|
|
```bash
|
|
# Reduce monitoring frequency
|
|
# Change from */1 to */5 minutes
|
|
|
|
# Optimize scripts
|
|
# Remove unnecessary commands
|
|
# Use efficient tools (prefer curl over wget, etc.)
|
|
|
|
# Add resource limits
|
|
timeout 30 monitoring-script.sh
|
|
```
|
|
|
|
## Emergency Recovery
|
|
|
|
### Complete Monitoring Failure
|
|
**Recovery Steps**:
|
|
```bash
|
|
# Restart all monitoring services
|
|
sudo systemctl restart cron
|
|
sudo systemctl restart rsyslog
|
|
|
|
# Reinstall monitoring scripts
|
|
cd /path/to/scripts
|
|
./install-monitoring.sh
|
|
|
|
# Test all components
|
|
./test-monitoring.sh
|
|
```
|
|
|
|
### Discord Integration Lost
|
|
**Quick Recovery**:
|
|
```bash
|
|
# Test webhook
|
|
curl -X POST "$BACKUP_WEBHOOK_URL" -H "Content-Type: application/json" -d '{"content": "Monitoring restored"}'
|
|
|
|
# Switch to backup webhook if needed
|
|
export DISCORD_WEBHOOK_URL="$BACKUP_WEBHOOK_URL"
|
|
```
|
|
|
|
## Prevention and Best Practices
|
|
|
|
### Monitoring Health Checks
|
|
```bash
|
|
#!/bin/bash
|
|
# monitor-the-monitors.sh
|
|
MONITORING_SCRIPTS="/path/to/tdarr-monitor.sh /path/to/network-monitor.sh"
|
|
|
|
for script in $MONITORING_SCRIPTS; do
|
|
if [ ! -x "$script" ]; then
|
|
echo "ALERT: $script not executable" | send_alert
|
|
fi
|
|
|
|
# Check if script has run recently
|
|
if [ $(($(date +%s) - $(stat -c %Y "$script.last_run" 2>/dev/null || echo 0))) -gt 3600 ]; then
|
|
echo "ALERT: $script hasn't run in over an hour" | send_alert
|
|
fi
|
|
done
|
|
```
|
|
|
|
### Backup Alerting Channels
|
|
```bash
|
|
# Multiple notification methods
|
|
send_alert() {
|
|
local message="$1"
|
|
|
|
# Primary: Discord
|
|
curl -X POST "$DISCORD_WEBHOOK" -d "{\"content\":\"$message\"}" || \
|
|
# Backup: Email
|
|
echo "$message" | mail -s "Homelab Alert" admin@domain.com || \
|
|
# Last resort: Local log
|
|
echo "$(date): $message" >> /var/log/critical-alerts.log
|
|
}
|
|
```
|
|
|
|
This troubleshooting guide covers the most common monitoring system issues and provides systematic recovery procedures. |