claude-home/monitoring/troubleshooting.md
Cal Corum 10c9e0d854 CLAUDE: Migrate to technology-first documentation architecture
Complete restructure from patterns/examples/reference to technology-focused directories:

• Created technology-specific directories with comprehensive documentation:
  - /tdarr/ - Transcoding automation with gaming-aware scheduling
  - /docker/ - Container management with GPU acceleration patterns
  - /vm-management/ - Virtual machine automation and cloud-init
  - /networking/ - SSH infrastructure, reverse proxy, and security
  - /monitoring/ - System health checks and Discord notifications
  - /databases/ - Database patterns and troubleshooting
  - /development/ - Programming language patterns (bash, nodejs, python, vuejs)

• Enhanced CLAUDE.md with intelligent context loading:
  - Technology-first loading rules for automatic context provision
  - Troubleshooting keyword triggers for emergency scenarios
  - Documentation maintenance protocols with automated reminders
  - Context window management for optimal documentation updates

• Preserved valuable content from .claude/tmp/:
  - SSH security improvements and server inventory
  - Tdarr CIFS troubleshooting and Docker iptables solutions
  - Operational scripts with proper technology classification

• Benefits achieved:
  - Self-contained technology directories with complete context
  - Automatic loading of relevant documentation based on keywords
  - Emergency-ready troubleshooting with comprehensive guides
  - Scalable structure for future technology additions
  - Eliminated context bloat through targeted loading

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-12 23:20:15 -05:00

414 lines
9.5 KiB
Markdown

# Monitoring System Troubleshooting Guide
## Discord Notification Issues
### Webhook Not Working
**Symptoms**: No Discord messages received, connection errors
**Diagnosis**:
```bash
# Test webhook manually
curl -X POST "$DISCORD_WEBHOOK_URL" \
-H "Content-Type: application/json" \
-d '{"content": "Test message"}'
# Check webhook URL format
echo $DISCORD_WEBHOOK_URL | grep -E "https://discord.com/api/webhooks/[0-9]+/.+"
```
**Solutions**:
```bash
# Verify webhook URL is correct
# Format: https://discord.com/api/webhooks/ID/TOKEN
# Test with minimal payload
curl -X POST "$DISCORD_WEBHOOK_URL" \
-H "Content-Type: application/json" \
-d '{"content": "✅ Webhook working"}'
# Check for JSON formatting issues
echo '{"content": "test"}' | jq . # Validate JSON
```
### Message Formatting Problems
**Symptoms**: Malformed messages, broken markdown, missing user pings
**Common Issues**:
```bash
# ❌ Broken JSON escaping
{"content": "Error: "quotes" break JSON"}
# ✅ Proper JSON escaping
{"content": "Error: \"quotes\" properly escaped"}
# ❌ User ping inside code block (doesn't work)
{"content": "```\nIssue occurred <@user_id>\n```"}
# ✅ User ping outside code block
{"content": "```\nIssue occurred\n```\nManual intervention needed <@user_id>"}
```
## Tdarr Monitoring Issues
### Script Not Running
**Symptoms**: No monitoring alerts, script execution failures
**Diagnosis**:
```bash
# Check cron job status
crontab -l | grep tdarr-timeout-monitor
systemctl status cron
# Run script manually for debugging
bash -x /path/to/tdarr-timeout-monitor.sh
# Check script permissions
ls -la /path/to/tdarr-timeout-monitor.sh
```
**Solutions**:
```bash
# Fix script permissions
chmod +x /path/to/tdarr-timeout-monitor.sh
# Reinstall cron job
crontab -e
# Add: */20 * * * * /full/path/to/tdarr-timeout-monitor.sh
# Check script environment
# Ensure PATH and variables are set correctly in script
```
### API Connection Failures
**Symptoms**: Cannot connect to Tdarr server, timeout errors
**Diagnosis**:
```bash
# Test Tdarr API manually
curl -f "http://tdarr-server:8266/api/v2/status"
# Check network connectivity
ping tdarr-server
nc -zv tdarr-server 8266
# Verify SSH access to server
ssh tdarr "docker ps | grep tdarr"
```
**Solutions**:
```bash
# Update server connection in script
# Verify server IP and port are correct
# Test API endpoints
curl "http://10.10.0.43:8265/api/v2/status" # Web port
curl "http://10.10.0.43:8266/api/v2/status" # Server port
# Check Tdarr server logs
ssh tdarr "docker logs tdarr | tail -20"
```
## Windows Desktop Monitoring Issues
### PowerShell Script Not Running
**Symptoms**: No reboot notifications from Windows systems
**Diagnosis**:
```powershell
# Check scheduled task status
Get-ScheduledTask -TaskName "Reboot*" | Get-ScheduledTaskInfo
# Test script execution manually
PowerShell -ExecutionPolicy Bypass -File "C:\path\to\windows-reboot-monitor.ps1"
# Check PowerShell execution policy
Get-ExecutionPolicy
```
**Solutions**:
```powershell
# Set execution policy
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
# Recreate scheduled tasks
schtasks /Create /XML "C:\path\to\task.xml" /TN "RebootMonitor"
# Check task trigger configuration
Get-ScheduledTask -TaskName "RebootMonitor" | Get-ScheduledTaskTrigger
```
### Network Access from Windows
**Symptoms**: PowerShell cannot reach Discord webhook
**Diagnosis**:
```powershell
# Test network connectivity
Test-NetConnection discord.com -Port 443
# Test webhook manually
Invoke-RestMethod -Uri $webhookUrl -Method Post -Body '{"content":"test"}' -ContentType "application/json"
# Check Windows firewall
Get-NetFirewallRule | Where-Object {$_.DisplayName -like "*PowerShell*"}
```
**Solutions**:
```powershell
# Allow PowerShell through firewall
New-NetFirewallRule -DisplayName "PowerShell Outbound" -Direction Outbound -Program "C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe" -Action Allow
# Test with simplified request
$body = @{content="Test from Windows"} | ConvertTo-Json
Invoke-RestMethod -Uri $webhookUrl -Method Post -Body $body -ContentType "application/json"
```
## Log Management Issues
### Log Files Growing Too Large
**Symptoms**: Disk space filling up, slow log access
**Diagnosis**:
```bash
# Check log file sizes
du -sh /var/log/homelab-*
du -sh /tmp/*monitor*.log
# Check available disk space
df -h /var/log
df -h /tmp
```
**Solutions**:
```bash
# Implement log rotation
cat > /etc/logrotate.d/homelab-monitoring << 'EOF'
/var/log/homelab-*.log {
daily
missingok
rotate 7
compress
notifempty
create 644 root root
}
EOF
# Manual log cleanup
find /tmp -name "*monitor*.log" -size +10M -delete
truncate -s 0 /tmp/large-log-file.log
```
### Log Rotation Not Working
**Symptoms**: Old logs not being cleaned up
**Diagnosis**:
```bash
# Check logrotate status
systemctl status logrotate
cat /var/lib/logrotate/status
# Test logrotate configuration
logrotate -d /etc/logrotate.d/homelab-monitoring
```
**Solutions**:
```bash
# Force log rotation
logrotate -f /etc/logrotate.d/homelab-monitoring
# Fix logrotate configuration
sudo nano /etc/logrotate.d/homelab-monitoring
# Verify syntax and permissions
```
## Cron Job Issues
### Scheduled Tasks Not Running
**Symptoms**: Scripts not executing at scheduled times
**Diagnosis**:
```bash
# Check cron service
systemctl status cron
systemctl status crond # RHEL/CentOS
# View cron logs
grep CRON /var/log/syslog
journalctl -u cron
# List all cron jobs
crontab -l
sudo crontab -l # System crontab
```
**Solutions**:
```bash
# Restart cron service
sudo systemctl restart cron
# Fix cron job syntax
# Ensure absolute paths are used
# Example: */20 * * * * /full/path/to/script.sh
# Check script permissions and execution
ls -la /path/to/script.sh
/path/to/script.sh # Test manual execution
```
### Environment Variables in Cron
**Symptoms**: Scripts work manually but fail in cron
**Diagnosis**:
```bash
# Create test cron job to check environment
* * * * * env > /tmp/cron-env.txt
# Compare with shell environment
env > /tmp/shell-env.txt
diff /tmp/shell-env.txt /tmp/cron-env.txt
```
**Solutions**:
```bash
# Set PATH in crontab
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
# Or set PATH in script
#!/bin/bash
export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
# Source environment if needed
source /etc/environment
```
## Network Monitoring Issues
### False Positives
**Symptoms**: Alerts for services that are actually working
**Diagnosis**:
```bash
# Test monitoring checks manually
curl -sSf --max-time 10 "https://service.homelab.local"
ping -c1 -W5 10.10.0.100
# Check for intermittent network issues
for i in {1..10}; do ping -c1 host || echo "Fail $i"; done
```
**Solutions**:
```bash
# Adjust timeout values
curl --max-time 30 "$service" # Increase timeout
# Add retry logic
for retry in {1..3}; do
if curl -sSf "$service" >/dev/null 2>&1; then
break
elif [ $retry -eq 3 ]; then
send_alert "Service $service failed after 3 retries"
fi
sleep 5
done
```
### Missing Alerts
**Symptoms**: Real failures not triggering notifications
**Diagnosis**:
```bash
# Verify monitoring script logic
bash -x monitoring-script.sh
# Check if services are actually down
systemctl status service-name
curl -v service-url
```
**Solutions**:
```bash
# Lower detection thresholds
# Increase monitoring frequency
# Add redundant monitoring methods
# Test alert mechanism
echo "Test alert" | send_alert_function
```
## System Resource Issues
### Monitoring Overhead
**Symptoms**: High CPU/memory usage from monitoring scripts
**Diagnosis**:
```bash
# Monitor the monitoring scripts
top -p $(pgrep -f monitor)
ps aux | grep monitor
# Check monitoring frequency
crontab -l | grep monitor
```
**Solutions**:
```bash
# Reduce monitoring frequency
# Change from */1 to */5 minutes
# Optimize scripts
# Remove unnecessary commands
# Use efficient tools (prefer curl over wget, etc.)
# Add resource limits
timeout 30 monitoring-script.sh
```
## Emergency Recovery
### Complete Monitoring Failure
**Recovery Steps**:
```bash
# Restart all monitoring services
sudo systemctl restart cron
sudo systemctl restart rsyslog
# Reinstall monitoring scripts
cd /path/to/scripts
./install-monitoring.sh
# Test all components
./test-monitoring.sh
```
### Discord Integration Lost
**Quick Recovery**:
```bash
# Test webhook
curl -X POST "$BACKUP_WEBHOOK_URL" -H "Content-Type: application/json" -d '{"content": "Monitoring restored"}'
# Switch to backup webhook if needed
export DISCORD_WEBHOOK_URL="$BACKUP_WEBHOOK_URL"
```
## Prevention and Best Practices
### Monitoring Health Checks
```bash
#!/bin/bash
# monitor-the-monitors.sh
MONITORING_SCRIPTS="/path/to/tdarr-monitor.sh /path/to/network-monitor.sh"
for script in $MONITORING_SCRIPTS; do
if [ ! -x "$script" ]; then
echo "ALERT: $script not executable" | send_alert
fi
# Check if script has run recently
if [ $(($(date +%s) - $(stat -c %Y "$script.last_run" 2>/dev/null || echo 0))) -gt 3600 ]; then
echo "ALERT: $script hasn't run in over an hour" | send_alert
fi
done
```
### Backup Alerting Channels
```bash
# Multiple notification methods
send_alert() {
local message="$1"
# Primary: Discord
curl -X POST "$DISCORD_WEBHOOK" -d "{\"content\":\"$message\"}" || \
# Backup: Email
echo "$message" | mail -s "Homelab Alert" admin@domain.com || \
# Last resort: Local log
echo "$(date): $message" >> /var/log/critical-alerts.log
}
```
This troubleshooting guide covers the most common monitoring system issues and provides systematic recovery procedures.