claude-home/monitoring/troubleshooting.md
Cal Corum 4b7eca8a46
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
docs: add YAML frontmatter to all 151 markdown files
Adds title, description, type, domain, and tags frontmatter to every
doc for improved KB semantic search. The description field is prepended
to every search chunk, and domain/type/tags enable filtered queries.

Type values: context, guide, runbook, reference, troubleshooting
Domain values match directory structure (networking, docker, etc.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 09:00:44 -05:00

422 lines
9.9 KiB
Markdown

---
title: "Monitoring Troubleshooting Guide"
description: "Troubleshooting procedures for Discord webhook failures, Tdarr monitoring issues, Windows PowerShell script problems, log rotation, cron job failures, network false positives, and monitoring overhead."
type: troubleshooting
domain: monitoring
tags: [discord, webhook, tdarr, windows, powershell, cron, log-rotation, network, alerts]
---
# Monitoring System Troubleshooting Guide
## Discord Notification Issues
### Webhook Not Working
**Symptoms**: No Discord messages received, connection errors
**Diagnosis**:
```bash
# Test webhook manually
curl -X POST "$DISCORD_WEBHOOK_URL" \
-H "Content-Type: application/json" \
-d '{"content": "Test message"}'
# Check webhook URL format
echo $DISCORD_WEBHOOK_URL | grep -E "https://discord.com/api/webhooks/[0-9]+/.+"
```
**Solutions**:
```bash
# Verify webhook URL is correct
# Format: https://discord.com/api/webhooks/ID/TOKEN
# Test with minimal payload
curl -X POST "$DISCORD_WEBHOOK_URL" \
-H "Content-Type: application/json" \
-d '{"content": "✅ Webhook working"}'
# Check for JSON formatting issues
echo '{"content": "test"}' | jq . # Validate JSON
```
### Message Formatting Problems
**Symptoms**: Malformed messages, broken markdown, missing user pings
**Common Issues**:
```bash
# ❌ Broken JSON escaping
{"content": "Error: "quotes" break JSON"}
# ✅ Proper JSON escaping
{"content": "Error: \"quotes\" properly escaped"}
# ❌ User ping inside code block (doesn't work)
{"content": "```\nIssue occurred <@user_id>\n```"}
# ✅ User ping outside code block
{"content": "```\nIssue occurred\n```\nManual intervention needed <@user_id>"}
```
## Tdarr Monitoring Issues
### Script Not Running
**Symptoms**: No monitoring alerts, script execution failures
**Diagnosis**:
```bash
# Check cron job status
crontab -l | grep tdarr-timeout-monitor
systemctl status cron
# Run script manually for debugging
bash -x /path/to/tdarr-timeout-monitor.sh
# Check script permissions
ls -la /path/to/tdarr-timeout-monitor.sh
```
**Solutions**:
```bash
# Fix script permissions
chmod +x /path/to/tdarr-timeout-monitor.sh
# Reinstall cron job
crontab -e
# Add: */20 * * * * /full/path/to/tdarr-timeout-monitor.sh
# Check script environment
# Ensure PATH and variables are set correctly in script
```
### API Connection Failures
**Symptoms**: Cannot connect to Tdarr server, timeout errors
**Diagnosis**:
```bash
# Test Tdarr API manually
curl -f "http://tdarr-server:8266/api/v2/status"
# Check network connectivity
ping tdarr-server
nc -zv tdarr-server 8266
# Verify SSH access to server
ssh tdarr "docker ps | grep tdarr"
```
**Solutions**:
```bash
# Update server connection in script
# Verify server IP and port are correct
# Test API endpoints
curl "http://10.10.0.43:8265/api/v2/status" # Web port
curl "http://10.10.0.43:8266/api/v2/status" # Server port
# Check Tdarr server logs
ssh tdarr "docker logs tdarr | tail -20"
```
## Windows Desktop Monitoring Issues
### PowerShell Script Not Running
**Symptoms**: No reboot notifications from Windows systems
**Diagnosis**:
```powershell
# Check scheduled task status
Get-ScheduledTask -TaskName "Reboot*" | Get-ScheduledTaskInfo
# Test script execution manually
PowerShell -ExecutionPolicy Bypass -File "C:\path\to\windows-reboot-monitor.ps1"
# Check PowerShell execution policy
Get-ExecutionPolicy
```
**Solutions**:
```powershell
# Set execution policy
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
# Recreate scheduled tasks
schtasks /Create /XML "C:\path\to\task.xml" /TN "RebootMonitor"
# Check task trigger configuration
Get-ScheduledTask -TaskName "RebootMonitor" | Get-ScheduledTaskTrigger
```
### Network Access from Windows
**Symptoms**: PowerShell cannot reach Discord webhook
**Diagnosis**:
```powershell
# Test network connectivity
Test-NetConnection discord.com -Port 443
# Test webhook manually
Invoke-RestMethod -Uri $webhookUrl -Method Post -Body '{"content":"test"}' -ContentType "application/json"
# Check Windows firewall
Get-NetFirewallRule | Where-Object {$_.DisplayName -like "*PowerShell*"}
```
**Solutions**:
```powershell
# Allow PowerShell through firewall
New-NetFirewallRule -DisplayName "PowerShell Outbound" -Direction Outbound -Program "C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe" -Action Allow
# Test with simplified request
$body = @{content="Test from Windows"} | ConvertTo-Json
Invoke-RestMethod -Uri $webhookUrl -Method Post -Body $body -ContentType "application/json"
```
## Log Management Issues
### Log Files Growing Too Large
**Symptoms**: Disk space filling up, slow log access
**Diagnosis**:
```bash
# Check log file sizes
du -sh /var/log/homelab-*
du -sh /tmp/*monitor*.log
# Check available disk space
df -h /var/log
df -h /tmp
```
**Solutions**:
```bash
# Implement log rotation
cat > /etc/logrotate.d/homelab-monitoring << 'EOF'
/var/log/homelab-*.log {
daily
missingok
rotate 7
compress
notifempty
create 644 root root
}
EOF
# Manual log cleanup
find /tmp -name "*monitor*.log" -size +10M -delete
truncate -s 0 /tmp/large-log-file.log
```
### Log Rotation Not Working
**Symptoms**: Old logs not being cleaned up
**Diagnosis**:
```bash
# Check logrotate status
systemctl status logrotate
cat /var/lib/logrotate/status
# Test logrotate configuration
logrotate -d /etc/logrotate.d/homelab-monitoring
```
**Solutions**:
```bash
# Force log rotation
logrotate -f /etc/logrotate.d/homelab-monitoring
# Fix logrotate configuration
sudo nano /etc/logrotate.d/homelab-monitoring
# Verify syntax and permissions
```
## Cron Job Issues
### Scheduled Tasks Not Running
**Symptoms**: Scripts not executing at scheduled times
**Diagnosis**:
```bash
# Check cron service
systemctl status cron
systemctl status crond # RHEL/CentOS
# View cron logs
grep CRON /var/log/syslog
journalctl -u cron
# List all cron jobs
crontab -l
sudo crontab -l # System crontab
```
**Solutions**:
```bash
# Restart cron service
sudo systemctl restart cron
# Fix cron job syntax
# Ensure absolute paths are used
# Example: */20 * * * * /full/path/to/script.sh
# Check script permissions and execution
ls -la /path/to/script.sh
/path/to/script.sh # Test manual execution
```
### Environment Variables in Cron
**Symptoms**: Scripts work manually but fail in cron
**Diagnosis**:
```bash
# Create test cron job to check environment
* * * * * env > /tmp/cron-env.txt
# Compare with shell environment
env > /tmp/shell-env.txt
diff /tmp/shell-env.txt /tmp/cron-env.txt
```
**Solutions**:
```bash
# Set PATH in crontab
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
# Or set PATH in script
#!/bin/bash
export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
# Source environment if needed
source /etc/environment
```
## Network Monitoring Issues
### False Positives
**Symptoms**: Alerts for services that are actually working
**Diagnosis**:
```bash
# Test monitoring checks manually
curl -sSf --max-time 10 "https://service.homelab.local"
ping -c1 -W5 10.10.0.100
# Check for intermittent network issues
for i in {1..10}; do ping -c1 host || echo "Fail $i"; done
```
**Solutions**:
```bash
# Adjust timeout values
curl --max-time 30 "$service" # Increase timeout
# Add retry logic
for retry in {1..3}; do
if curl -sSf "$service" >/dev/null 2>&1; then
break
elif [ $retry -eq 3 ]; then
send_alert "Service $service failed after 3 retries"
fi
sleep 5
done
```
### Missing Alerts
**Symptoms**: Real failures not triggering notifications
**Diagnosis**:
```bash
# Verify monitoring script logic
bash -x monitoring-script.sh
# Check if services are actually down
systemctl status service-name
curl -v service-url
```
**Solutions**:
```bash
# Lower detection thresholds
# Increase monitoring frequency
# Add redundant monitoring methods
# Test alert mechanism
echo "Test alert" | send_alert_function
```
## System Resource Issues
### Monitoring Overhead
**Symptoms**: High CPU/memory usage from monitoring scripts
**Diagnosis**:
```bash
# Monitor the monitoring scripts
top -p $(pgrep -f monitor)
ps aux | grep monitor
# Check monitoring frequency
crontab -l | grep monitor
```
**Solutions**:
```bash
# Reduce monitoring frequency
# Change from */1 to */5 minutes
# Optimize scripts
# Remove unnecessary commands
# Use efficient tools (prefer curl over wget, etc.)
# Add resource limits
timeout 30 monitoring-script.sh
```
## Emergency Recovery
### Complete Monitoring Failure
**Recovery Steps**:
```bash
# Restart all monitoring services
sudo systemctl restart cron
sudo systemctl restart rsyslog
# Reinstall monitoring scripts
cd /path/to/scripts
./install-monitoring.sh
# Test all components
./test-monitoring.sh
```
### Discord Integration Lost
**Quick Recovery**:
```bash
# Test webhook
curl -X POST "$BACKUP_WEBHOOK_URL" -H "Content-Type: application/json" -d '{"content": "Monitoring restored"}'
# Switch to backup webhook if needed
export DISCORD_WEBHOOK_URL="$BACKUP_WEBHOOK_URL"
```
## Prevention and Best Practices
### Monitoring Health Checks
```bash
#!/bin/bash
# monitor-the-monitors.sh
MONITORING_SCRIPTS="/path/to/tdarr-monitor.sh /path/to/network-monitor.sh"
for script in $MONITORING_SCRIPTS; do
if [ ! -x "$script" ]; then
echo "ALERT: $script not executable" | send_alert
fi
# Check if script has run recently
if [ $(($(date +%s) - $(stat -c %Y "$script.last_run" 2>/dev/null || echo 0))) -gt 3600 ]; then
echo "ALERT: $script hasn't run in over an hour" | send_alert
fi
done
```
### Backup Alerting Channels
```bash
# Multiple notification methods
send_alert() {
local message="$1"
# Primary: Discord
curl -X POST "$DISCORD_WEBHOOK" -d "{\"content\":\"$message\"}" || \
# Backup: Email
echo "$message" | mail -s "Homelab Alert" admin@domain.com || \
# Last resort: Local log
echo "$(date): $message" >> /var/log/critical-alerts.log
}
```
This troubleshooting guide covers the most common monitoring system issues and provides systematic recovery procedures.