Complete restructure from patterns/examples/reference to technology-focused directories: • Created technology-specific directories with comprehensive documentation: - /tdarr/ - Transcoding automation with gaming-aware scheduling - /docker/ - Container management with GPU acceleration patterns - /vm-management/ - Virtual machine automation and cloud-init - /networking/ - SSH infrastructure, reverse proxy, and security - /monitoring/ - System health checks and Discord notifications - /databases/ - Database patterns and troubleshooting - /development/ - Programming language patterns (bash, nodejs, python, vuejs) • Enhanced CLAUDE.md with intelligent context loading: - Technology-first loading rules for automatic context provision - Troubleshooting keyword triggers for emergency scenarios - Documentation maintenance protocols with automated reminders - Context window management for optimal documentation updates • Preserved valuable content from .claude/tmp/: - SSH security improvements and server inventory - Tdarr CIFS troubleshooting and Docker iptables solutions - Operational scripts with proper technology classification • Benefits achieved: - Self-contained technology directories with complete context - Automatic loading of relevant documentation based on keywords - Emergency-ready troubleshooting with comprehensive guides - Scalable structure for future technology additions - Eliminated context bloat through targeted loading 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
9.5 KiB
Monitoring System Troubleshooting Guide
Discord Notification Issues
Webhook Not Working
Symptoms: No Discord messages received, connection errors Diagnosis:
# Test webhook manually
curl -X POST "$DISCORD_WEBHOOK_URL" \
-H "Content-Type: application/json" \
-d '{"content": "Test message"}'
# Check webhook URL format
echo $DISCORD_WEBHOOK_URL | grep -E "https://discord.com/api/webhooks/[0-9]+/.+"
Solutions:
# Verify webhook URL is correct
# Format: https://discord.com/api/webhooks/ID/TOKEN
# Test with minimal payload
curl -X POST "$DISCORD_WEBHOOK_URL" \
-H "Content-Type: application/json" \
-d '{"content": "✅ Webhook working"}'
# Check for JSON formatting issues
echo '{"content": "test"}' | jq . # Validate JSON
Message Formatting Problems
Symptoms: Malformed messages, broken markdown, missing user pings Common Issues:
# ❌ Broken JSON escaping
{"content": "Error: "quotes" break JSON"}
# ✅ Proper JSON escaping
{"content": "Error: \"quotes\" properly escaped"}
# ❌ User ping inside code block (doesn't work)
{"content": "```\nIssue occurred <@user_id>\n```"}
# ✅ User ping outside code block
{"content": "```\nIssue occurred\n```\nManual intervention needed <@user_id>"}
Tdarr Monitoring Issues
Script Not Running
Symptoms: No monitoring alerts, script execution failures Diagnosis:
# Check cron job status
crontab -l | grep tdarr-timeout-monitor
systemctl status cron
# Run script manually for debugging
bash -x /path/to/tdarr-timeout-monitor.sh
# Check script permissions
ls -la /path/to/tdarr-timeout-monitor.sh
Solutions:
# Fix script permissions
chmod +x /path/to/tdarr-timeout-monitor.sh
# Reinstall cron job
crontab -e
# Add: */20 * * * * /full/path/to/tdarr-timeout-monitor.sh
# Check script environment
# Ensure PATH and variables are set correctly in script
API Connection Failures
Symptoms: Cannot connect to Tdarr server, timeout errors Diagnosis:
# Test Tdarr API manually
curl -f "http://tdarr-server:8266/api/v2/status"
# Check network connectivity
ping tdarr-server
nc -zv tdarr-server 8266
# Verify SSH access to server
ssh tdarr "docker ps | grep tdarr"
Solutions:
# Update server connection in script
# Verify server IP and port are correct
# Test API endpoints
curl "http://10.10.0.43:8265/api/v2/status" # Web port
curl "http://10.10.0.43:8266/api/v2/status" # Server port
# Check Tdarr server logs
ssh tdarr "docker logs tdarr | tail -20"
Windows Desktop Monitoring Issues
PowerShell Script Not Running
Symptoms: No reboot notifications from Windows systems Diagnosis:
# Check scheduled task status
Get-ScheduledTask -TaskName "Reboot*" | Get-ScheduledTaskInfo
# Test script execution manually
PowerShell -ExecutionPolicy Bypass -File "C:\path\to\windows-reboot-monitor.ps1"
# Check PowerShell execution policy
Get-ExecutionPolicy
Solutions:
# Set execution policy
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
# Recreate scheduled tasks
schtasks /Create /XML "C:\path\to\task.xml" /TN "RebootMonitor"
# Check task trigger configuration
Get-ScheduledTask -TaskName "RebootMonitor" | Get-ScheduledTaskTrigger
Network Access from Windows
Symptoms: PowerShell cannot reach Discord webhook Diagnosis:
# Test network connectivity
Test-NetConnection discord.com -Port 443
# Test webhook manually
Invoke-RestMethod -Uri $webhookUrl -Method Post -Body '{"content":"test"}' -ContentType "application/json"
# Check Windows firewall
Get-NetFirewallRule | Where-Object {$_.DisplayName -like "*PowerShell*"}
Solutions:
# Allow PowerShell through firewall
New-NetFirewallRule -DisplayName "PowerShell Outbound" -Direction Outbound -Program "C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe" -Action Allow
# Test with simplified request
$body = @{content="Test from Windows"} | ConvertTo-Json
Invoke-RestMethod -Uri $webhookUrl -Method Post -Body $body -ContentType "application/json"
Log Management Issues
Log Files Growing Too Large
Symptoms: Disk space filling up, slow log access Diagnosis:
# Check log file sizes
du -sh /var/log/homelab-*
du -sh /tmp/*monitor*.log
# Check available disk space
df -h /var/log
df -h /tmp
Solutions:
# Implement log rotation
cat > /etc/logrotate.d/homelab-monitoring << 'EOF'
/var/log/homelab-*.log {
daily
missingok
rotate 7
compress
notifempty
create 644 root root
}
EOF
# Manual log cleanup
find /tmp -name "*monitor*.log" -size +10M -delete
truncate -s 0 /tmp/large-log-file.log
Log Rotation Not Working
Symptoms: Old logs not being cleaned up Diagnosis:
# Check logrotate status
systemctl status logrotate
cat /var/lib/logrotate/status
# Test logrotate configuration
logrotate -d /etc/logrotate.d/homelab-monitoring
Solutions:
# Force log rotation
logrotate -f /etc/logrotate.d/homelab-monitoring
# Fix logrotate configuration
sudo nano /etc/logrotate.d/homelab-monitoring
# Verify syntax and permissions
Cron Job Issues
Scheduled Tasks Not Running
Symptoms: Scripts not executing at scheduled times Diagnosis:
# Check cron service
systemctl status cron
systemctl status crond # RHEL/CentOS
# View cron logs
grep CRON /var/log/syslog
journalctl -u cron
# List all cron jobs
crontab -l
sudo crontab -l # System crontab
Solutions:
# Restart cron service
sudo systemctl restart cron
# Fix cron job syntax
# Ensure absolute paths are used
# Example: */20 * * * * /full/path/to/script.sh
# Check script permissions and execution
ls -la /path/to/script.sh
/path/to/script.sh # Test manual execution
Environment Variables in Cron
Symptoms: Scripts work manually but fail in cron Diagnosis:
# Create test cron job to check environment
* * * * * env > /tmp/cron-env.txt
# Compare with shell environment
env > /tmp/shell-env.txt
diff /tmp/shell-env.txt /tmp/cron-env.txt
Solutions:
# Set PATH in crontab
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
# Or set PATH in script
#!/bin/bash
export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
# Source environment if needed
source /etc/environment
Network Monitoring Issues
False Positives
Symptoms: Alerts for services that are actually working Diagnosis:
# Test monitoring checks manually
curl -sSf --max-time 10 "https://service.homelab.local"
ping -c1 -W5 10.10.0.100
# Check for intermittent network issues
for i in {1..10}; do ping -c1 host || echo "Fail $i"; done
Solutions:
# Adjust timeout values
curl --max-time 30 "$service" # Increase timeout
# Add retry logic
for retry in {1..3}; do
if curl -sSf "$service" >/dev/null 2>&1; then
break
elif [ $retry -eq 3 ]; then
send_alert "Service $service failed after 3 retries"
fi
sleep 5
done
Missing Alerts
Symptoms: Real failures not triggering notifications Diagnosis:
# Verify monitoring script logic
bash -x monitoring-script.sh
# Check if services are actually down
systemctl status service-name
curl -v service-url
Solutions:
# Lower detection thresholds
# Increase monitoring frequency
# Add redundant monitoring methods
# Test alert mechanism
echo "Test alert" | send_alert_function
System Resource Issues
Monitoring Overhead
Symptoms: High CPU/memory usage from monitoring scripts Diagnosis:
# Monitor the monitoring scripts
top -p $(pgrep -f monitor)
ps aux | grep monitor
# Check monitoring frequency
crontab -l | grep monitor
Solutions:
# Reduce monitoring frequency
# Change from */1 to */5 minutes
# Optimize scripts
# Remove unnecessary commands
# Use efficient tools (prefer curl over wget, etc.)
# Add resource limits
timeout 30 monitoring-script.sh
Emergency Recovery
Complete Monitoring Failure
Recovery Steps:
# Restart all monitoring services
sudo systemctl restart cron
sudo systemctl restart rsyslog
# Reinstall monitoring scripts
cd /path/to/scripts
./install-monitoring.sh
# Test all components
./test-monitoring.sh
Discord Integration Lost
Quick Recovery:
# Test webhook
curl -X POST "$BACKUP_WEBHOOK_URL" -H "Content-Type: application/json" -d '{"content": "Monitoring restored"}'
# Switch to backup webhook if needed
export DISCORD_WEBHOOK_URL="$BACKUP_WEBHOOK_URL"
Prevention and Best Practices
Monitoring Health Checks
#!/bin/bash
# monitor-the-monitors.sh
MONITORING_SCRIPTS="/path/to/tdarr-monitor.sh /path/to/network-monitor.sh"
for script in $MONITORING_SCRIPTS; do
if [ ! -x "$script" ]; then
echo "ALERT: $script not executable" | send_alert
fi
# Check if script has run recently
if [ $(($(date +%s) - $(stat -c %Y "$script.last_run" 2>/dev/null || echo 0))) -gt 3600 ]; then
echo "ALERT: $script hasn't run in over an hour" | send_alert
fi
done
Backup Alerting Channels
# Multiple notification methods
send_alert() {
local message="$1"
# Primary: Discord
curl -X POST "$DISCORD_WEBHOOK" -d "{\"content\":\"$message\"}" || \
# Backup: Email
echo "$message" | mail -s "Homelab Alert" admin@domain.com || \
# Last resort: Local log
echo "$(date): $message" >> /var/log/critical-alerts.log
}
This troubleshooting guide covers the most common monitoring system issues and provides systematic recovery procedures.