Cal Corum 7edb4a3a9c CLAUDE: Update VM management patterns and Tdarr operational scripts

- Update patterns/vm-management/README.md: Add comprehensive automation workflows
  - Cloud-init deployment strategies and post-install automation
  - SSH key management integration and security hardening patterns
  - Implementation workflows for new and existing VM provisioning

- Add complete VM management examples and reference documentation
  - examples/vm-management/: Proxmox automation and provisioning examples
  - reference/vm-management/: Troubleshooting guides and best practices
  - scripts/vm-management/: Operational scripts for automated VM setup

- Update reference/docker/tdarr-monitoring-configuration.md: API monitoring integration
  - Document new tdarr_monitor.py integration with existing Discord monitoring
  - Add API-based health checks and cron scheduling examples
  - Enhanced gaming scheduler integration with health verification

- Update Tdarr operational scripts with stability improvements
  - scripts/tdarr/start-tdarr-gpu-podman-clean.sh: Resource limits and CDI GPU access
  - scripts/tdarr/tdarr-schedule-manager.sh: Updated container name references
  - scripts/monitoring/tdarr-timeout-monitor.sh: Enhanced completion monitoring

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-08-12 12:18:43 -05:00

13 KiB

Raw Blame History

Tdarr Monitoring Configuration - Discord Integration

Overview

This document describes the active Discord monitoring system for Tdarr worker timeouts, stalls, and operational status across the homelab environment.

System Architecture

Components

Tdarr Server: ubuntu-ct (10.10.0.43) running tdarr-clean container
Tdarr Node: nobara-pc (unmapped architecture) running tdarr-node-gpu-unmapped
Monitoring Script: /scripts/monitoring/tdarr-timeout-monitor.sh
Discord Integration: Webhook-based notifications to designated channel

Monitoring Targets

✅ Server Limbo Timeouts - Files stuck in staging beyond timeout period
✅ Node Worker Stalls - Workers that hang during transcoding operations
✅ Worker Disconnections - Unexpected worker disconnects
✅ Success Notifications - Optional completion tracking (currently disabled)

Active Configuration

Script Location

/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh

Key Settings

Polling Interval: 15 minutes (900 seconds)
Discord Webhook: Configured and active
Server Connection: SSH alias tdarr → ubuntu-ct (10.10.0.43)
Node Container: tdarr-node-gpu-unmapped via Podman
Working Directory: /tmp/tdarr-monitor/

Monitored Log Patterns

Server Timeout Events

# Pattern: "has been in limbo for X seconds, removing from staging section"
grep -i "has been in limbo"

Example Alert:

⚠️ 4 file(s) timed out in staging:
TV/Survivor/Season 48/Survivor (2000) - S48E04...
TV/Survivor/Season 48/Survivor (2000) - S48E11...
TV/Survivor/Season 26/Survivor (2000) - S26E05...
TV/Survivor/Season 22/Survivor (2000) - S22E13...
Files were removed from staging and will retry.

Node Worker Events

# Patterns: "Worker X has stalled, cancelling" | "Worker X disconnected. Pruning."
grep -i "worker.*stalled\|worker.*disconnected"

Example Alert:

🔴 4 worker stall(s) detected:
Worker eager-eyas
Worker oblong-owl
Worker other-olm
Worker dry-dugong
Workers were cancelled and will restart.

Historical Context

Problem Timeline

Initial Issue: Multiple workers stalling due to resource competition
Worker Reduction: Reduced to 1 CPU + 1 GPU worker each
Timeout Increase: Extended staging timeout from 5 minutes to 15-20 minutes
SMB Optimization: Improved file transfer from 61.8 MB/s to 103 MB/s
Monitoring Implementation: Custom Discord alerts for proactive issue detection

Root Causes Addressed

Large File Timeouts: TV episodes (3-8GB+) exceeding 5-minute staging timeout
Worker Competition: Multiple workers competing for GPU resources
Network Performance: Slow SMB transfers causing download delays
Node Instability: Workers hanging during complex transcoding flows

Deployment Status

Current State

✅ Script Deployed: Located and executable
✅ Discord Webhook: Configured and tested
✅ Function Order: Fixed (send_discord_notification defined before use)
✅ Polling Interval: Set to 15 minutes
✅ Initial Test: "monitoring started" message sent successfully

Automation Setup

Status: Ready for automation - choose deployment method:

Option A: Cron Job

# Edit user crontab
crontab -e

# Add monitoring every 15 minutes
*/15 * * * * /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh >/dev/null 2>&1

Option B: Systemd Service (Recommended)

# Service file
sudo tee /etc/systemd/system/tdarr-monitor.service > /dev/null <<EOF
[Unit]
Description=Tdarr Timeout Monitor
After=network.target

[Service]
Type=oneshot
User=cal
ExecStart=/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
EOF

# Timer file
sudo tee /etc/systemd/system/tdarr-monitor.timer > /dev/null <<EOF
[Unit]
Description=Run Tdarr Monitor every 15 minutes
Requires=tdarr-monitor.service

[Timer]
OnCalendar=*:0/15
Persistent=true

[Install]
WantedBy=timers.target
EOF

# Enable and start
sudo systemctl daemon-reload
sudo systemctl enable tdarr-monitor.timer
sudo systemctl start tdarr-monitor.timer

Discord Integration Details

Webhook Configuration

Channel: Designated homelab monitoring channel
Message Format: Rich embeds with color coding
Alert Colors: Red (15158332) for errors, Green (3066993) for success
Content: File paths, worker names, timestamps, hostname

Message Examples

Successful Start

{
  "title": "🎬 Tdarr Monitoring Alert",
  "description": "Tdarr timeout monitoring started",
  "color": 3066993,
  "timestamp": "2025-08-10T14:19:55.000Z",
  "footer": {
    "text": "nobara-pc - Sun Aug 10 09:19:55 AM CDT 2025"
  }
}

Timeout Alert

{
  "title": "🎬 Tdarr Monitoring Alert", 
  "description": "⚠️ **4 file(s) timed out in staging:**\n```\nTV/Survivor/Season 48/...\n```\nFiles were removed from staging and will retry.",
  "color": 15158332,
  "timestamp": "2025-08-10T14:49:00.000Z"
}

Performance Impact

System Resources

CPU Usage: Minimal - quick log parsing every 15 minutes
Network Impact: One SSH connection + Docker logs query per check
Storage: < 1MB in /tmp/tdarr-monitor/
Execution Time: ~2-5 seconds per check

Benefits

Proactive Issue Detection: Know about problems before manual checking
Historical Tracking: Discord provides persistent log of all alerts
Mobile Notifications: Get alerts on phone via Discord app
Reduced Manual Monitoring: Automated awareness of system health

Troubleshooting

Common Issues

No Discord Messages

# Test webhook manually
curl -H "Content-Type: application/json" -X POST \
     -d '{"content":"Manual test message"}' \
     "https://discord.com/api/webhooks/1404105821549498398/y2Ud1RK9rzFjv58xbypUfQNe3jrL7ZUq1FkQHa4_dfOHm2ylp93z0f4tY0O8Z-vQgKhD"

# Check script execution
/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh

SSH Connection Issues

# Test server connection
ssh tdarr "echo 'SSH working'"

# Check SSH key authentication
ssh-add -l

Podman Container Access

# Verify container is running
podman ps | grep tdarr-node

# Test log access
podman logs --tail 5 tdarr-node-gpu-unmapped

Force Monitoring Check

# Reset timestamp to trigger immediate check
echo $(($(date +%s) - 1000)) > /tmp/tdarr-monitor/last_check.timestamp

# Run manual check
/mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh

Security Considerations

Webhook Protection

URL Security: Webhook URL contains sensitive token
Access Control: Only monitoring script has access to webhook
Scope Limitation: Webhook only has permission to post messages

SSH Access

Key-based Authentication: No password authentication used
Limited Commands: Only Docker logs and basic system commands
Network Isolation: SSH connection within trusted homelab network

File Permissions

# Verify script permissions
ls -la /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh
# Should be: -rwxr--r-- (755) or more restrictive

# Secure if needed
chmod 750 /mnt/NV2/Development/claude-home/scripts/monitoring/tdarr-timeout-monitor.sh

Maintenance

Regular Checks

Weekly: Verify Discord messages are still being received
Monthly: Check /tmp/tdarr-monitor/ directory size
Quarterly: Review alert frequency and adjust thresholds if needed

Log Rotation

# Clean old monitoring logs if they accumulate
find /tmp/tdarr-monitor/ -name "*.log" -mtime +7 -delete

Updates

Script Updates: Version control via git in homelab documentation
Webhook Rotation: Update webhook URL if Discord server changes
Threshold Tuning: Adjust 15-minute interval based on operational experience

API-Based Monitoring Enhancement

Tdarr API Monitoring Script

Location: tdarr_monitor.py - Comprehensive API-based monitoring client Server: http://10.10.0.43:8265 (main Tdarr server) Dependencies: Python 3 with requests library

Key Features

Server Health Monitoring: Version, uptime, connectivity status
Queue Management: Processing statistics, queue depth, item details
Node Status Tracking: Online/offline nodes, worker counts, active jobs
Library Scan Progress: File counts, scan status, completion percentage
Overall Statistics: Transcoding metrics, space saved, processing speeds
Comprehensive Health Checks: Multi-component status assessment

API Endpoints Monitored

/api/v2/get-server-info - Server version, system info, uptime
/api/v2/get-queue - Current queue status and processing items
/api/v2/get-nodes - Connected nodes and their status
/api/v2/get-libraries - Library scan progress and file counts
/api/v2/get-stats - Overall transcoding statistics and metrics

Usage Examples

# Quick health check
python3 tdarr_monitor.py --server http://10.10.0.43:8265 --check health

# Queue status monitoring
python3 tdarr_monitor.py --server http://10.10.0.43:8265 --check queue --output json

# Node performance check
python3 tdarr_monitor.py --server http://10.10.0.43:8265 --check nodes --verbose

# Complete system status
python3 tdarr_monitor.py --server http://10.10.0.43:8265 --check all

Integration with Existing Discord Monitoring

#!/bin/bash
# Enhanced health monitoring with API integration
HEALTH_OUTPUT=$(python3 tdarr_monitor.py --server http://10.10.0.43:8265 --check health --output json)
STATUS=$(echo "$HEALTH_OUTPUT" | jq -r '.overall_status')

if [ "$STATUS" != "healthy" ]; then
    # Extract specific issues
    SERVER_HEALTHY=$(echo "$HEALTH_OUTPUT" | jq -r '.checks.server.healthy')
    NODE_COUNT=$(echo "$HEALTH_OUTPUT" | jq -r '.checks.nodes.online_count')
    QUEUE_HEALTHY=$(echo "$HEALTH_OUTPUT" | jq -r '.checks.queue.healthy')
    
    MESSAGE="🎬 **Tdarr API Health Alert**\n"
    MESSAGE+="Overall Status: **$STATUS**\n\n"
    
    if [ "$SERVER_HEALTHY" != "true" ]; then
        MESSAGE+="❌ Server: Offline or unreachable\n"
    fi
    
    if [ "$NODE_COUNT" == "0" ]; then
        MESSAGE+="❌ Nodes: No online transcoding nodes\n"
    fi
    
    if [ "$QUEUE_HEALTHY" != "true" ]; then
        MESSAGE+="❌ Queue: Unable to access queue data\n"
    fi
    
    MESSAGE+="\nCheck server status and node connectivity."
    
    # Send via existing Discord webhook
    send_discord_message "$MESSAGE"
fi

Cron-based API Monitoring

# API health check every 5 minutes (complement log-based monitoring)
*/5 * * * * /path/to/tdarr_monitor.py --server http://10.10.0.43:8265 --check health >> /tmp/tdarr-api-health.log 2>&1

# Full status check hourly for detailed metrics
0 * * * * /path/to/tdarr_monitor.py --server http://10.10.0.43:8265 --check all --output json > /tmp/tdarr-status.json

Gaming Scheduler Integration

# Before starting transcoding, verify server health via API
if python3 tdarr_monitor.py --server http://10.10.0.43:8265 --check health; then
    echo "Tdarr server healthy, proceeding with scheduled start"
    ./start-tdarr-gpu-podman-clean.sh
else
    echo "Tdarr server unhealthy, skipping scheduled start"
    # Send alert via existing Discord system
    send_discord_message "🎮 **Gaming Scheduler Alert**: Tdarr server unhealthy, skipping transcoding session"
fi

Monitoring Metrics Available

Server Metrics: Uptime, version, system info, connectivity
Queue Metrics: Total items, processing count, queued count, completed count
Node Metrics: Online/offline status, worker counts, active jobs, heartbeat
Library Metrics: Total files, scan progress, library count, scan status
Performance Metrics: Total transcodes, space saved, processing speed, error rates

Integration with Other Systems

Current Monitoring Stack

Log-based Monitoring: Timeout and worker stall detection (15-minute polling)
API-based Monitoring: Real-time health and performance metrics (5-minute polling)
Discord Integration: Unified alerting for both monitoring methods
Gaming Scheduler: API health checks before transcoding sessions

Future Enhancements

Grafana Integration: Visualize API metrics (queue depth, processing rates, node performance)
Prometheus Metrics: Export API data for time-series analysis
Home Assistant: Integrate server status with home automation
Email Backup: Secondary notification method for critical API alerts
Metric Correlation: Combine log-based alerts with API performance data

NAS Mount Configuration - SMB optimization context
Tdarr Troubleshooting - Worker timeout background
SSH Key Management - Server access setup
Tdarr Gaming Scheduler - Gaming-aware automation

Status: ✅ Log-based monitoring active, API monitoring script created
Last Updated: August 12, 2025
Next Review: September 12, 2025
Discord Channel: Homelab monitoring alerts configured and tested
API Endpoint: http://10.10.0.43:8265 (verified accessible)

13 KiB Raw Blame History