Cal Corum edc78c2dd6 CLAUDE: Add comprehensive gaming-aware Tdarr management system

- Created complete gaming detection and priority system
- Added gaming schedule configuration and enforcement
- Implemented Steam library monitoring with auto-detection
- Built comprehensive game process detection for multiple platforms
- Added gaming-aware Tdarr worker management with priority controls
- Created emergency gaming mode for immediate worker shutdown
- Integrated Discord notifications for gaming state changes
- Replaced old bash monitoring with enhanced Python monitoring system
- Added persistent state management and memory tracking
- Implemented configurable gaming time windows and schedules
- Updated .gitignore to exclude logs directories

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-08-14 15:17:52 -05:00

20 KiB

Raw Blame History

Prometheus + Grafana Home Lab Monitoring Setup

Overview

This document provides a complete setup for monitoring a Proxmox home lab with 8 Ubuntu Server VMs running Docker applications. The solution uses the Prometheus + Grafana + Alertmanager stack to provide comprehensive monitoring with custom metrics, alerting, and visualization.

Architecture

Components

Prometheus: Time-series database for metrics collection (pull-based)
Grafana: Web-based visualization and dashboards
Alertmanager: Alert routing and notifications
Node Exporter: System metrics (CPU, memory, disk, network)
cAdvisor: Docker container metrics
Custom Exporters: Application-specific metrics (transcodes, web server stats, etc.)

Deployment Strategy

Main Monitoring VM: Runs Prometheus, Grafana, and Alertmanager
Each Monitored VM: Runs Node Exporter, cAdvisor, and any custom exporters
Proxmox Host: Runs Proxmox VE Exporter for hypervisor metrics

Main Monitoring Stack Deployment

Directory Structure

monitoring/
├── docker-compose.yml
├── prometheus/
│   ├── prometheus.yml
│   └── alerts.yml
├── alertmanager/
│   └── alertmanager.yml
├── grafana/
│   └── provisioning/
│       ├── datasources/
│       │   └── prometheus.yml
│       └── dashboards/
│           └── dashboard.yml
└── data/
    ├── prometheus/
    ├── grafana/
    └── alertmanager/

Docker Compose Configuration

version: '3.8'

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=200h'
      - '--web.enable-lifecycle'
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/alerts.yml:/etc/prometheus/alerts.yml
      - prometheus_data:/prometheus
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    restart: unless-stopped
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager_data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
      - '--web.external-url=http://localhost:9093'
    networks:
      - monitoring

Configuration Files

Prometheus Configuration (prometheus/prometheus.yml)

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alerts.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporters - Update with your VM IPs
  - job_name: 'node-exporter'
    static_configs:
      - targets:
        - '192.168.1.101:9100'  # VM1
        - '192.168.1.102:9100'  # VM2
        - '192.168.1.103:9100'  # VM3
        - '192.168.1.104:9100'  # VM4
        - '192.168.1.105:9100'  # VM5
        - '192.168.1.106:9100'  # VM6
        - '192.168.1.107:9100'  # VM7
        - '192.168.1.108:9100'  # VM8

  # cAdvisor for Docker metrics
  - job_name: 'cadvisor'
    static_configs:
      - targets:
        - '192.168.1.101:8080'  # VM1
        - '192.168.1.102:8080'  # VM2
        - '192.168.1.103:8080'  # VM3
        - '192.168.1.104:8080'  # VM4
        - '192.168.1.105:8080'  # VM5
        - '192.168.1.106:8080'  # VM6
        - '192.168.1.107:8080'  # VM7
        - '192.168.1.108:8080'  # VM8

  # Proxmox VE Exporter - Update with your Proxmox host IP
  - job_name: 'proxmox'
    static_configs:
      - targets: ['192.168.1.100:9221']  # Proxmox host

  # Custom application exporters
  - job_name: 'custom-apps'
    static_configs:
      - targets:
        - '192.168.1.101:9999'  # Custom app metrics
        - '192.168.1.102:9999'  # Media server metrics

  # Home Assistant Prometheus integration
  - job_name: 'homeassistant'
    scrape_interval: 30s
    metrics_path: /api/prometheus
    bearer_token: 'YOUR_LONG_LIVED_ACCESS_TOKEN'  # Generate in HA Profile settings
    static_configs:
      - targets: ['192.168.1.XXX:8123']  # Your Home Assistant IP

  # Home Assistant API exporter (alternative)
  - job_name: 'homeassistant-api'
    static_configs:
      - targets: ['192.168.1.XXX:9998']  # Custom HA exporter

  # HomeKit via MQTT bridge (if using)
  - job_name: 'homekit'
    static_configs:
      - targets: ['192.168.1.XXX:9997']  # Custom HomeKit exporter

Alert Rules (prometheus/alerts.yml)

groups:
  - name: system-alerts
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} down"
          description: "{{ $labels.instance }} has been down for more than 1 minute."

      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% for more than 2 minutes."

      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 80% for more than 2 minutes."

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk space is below 20% on root filesystem."

  - name: docker-alerts
    rules:
      - alert: ContainerDown
        expr: absent(container_last_seen) or time() - container_last_seen > 60
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} is down"
          description: "Container has been down for more than 1 minute."

Alertmanager Configuration (alertmanager/alertmanager.yml)

global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alertmanager@yourdomain.com'
  # Configure with your email settings

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'
    - match:
        severity: warning
      receiver: 'warning-alerts'

receivers:
  - name: 'web.hook'
    webhook_configs:
      - url: 'http://127.0.0.1:5001/'

  - name: 'critical-alerts'
    email_configs:
      - to: 'admin@yourdomain.com'
        subject: 'CRITICAL: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}          
    # Uncomment and configure for Discord/Slack webhooks
    # webhook_configs:
    #   - url: 'YOUR_DISCORD_WEBHOOK_URL'

  - name: 'warning-alerts'
    email_configs:
      - to: 'admin@yourdomain.com'
        subject: 'WARNING: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}          

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

Grafana Datasource Provisioning (grafana/provisioning/datasources/prometheus.yml)

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true

VM Configuration

Node Exporter and cAdvisor Deployment

Deploy this on each VM:

# docker-compose.yml for each VM
version: '3.8'

services:
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    ports:
      - "9100:9100"
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    devices:
      - /dev/kmsg

Custom Metrics Example

For media server transcode monitoring:

#!/usr/bin/env python3
# custom_exporter.py
from http.server import HTTPServer, BaseHTTPRequestHandler
import subprocess
import json

class MetricsHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == '/metrics':
            metrics = self.get_metrics()
            self.send_response(200)
            self.send_header('Content-type', 'text/plain')
            self.end_headers()
            self.wfile.write(metrics.encode())
        else:
            self.send_response(404)
            self.end_headers()

    def get_metrics(self):
        metrics = []
        
        # Example: Count completed transcodes from log file
        try:
            with open('/var/log/transcodes.log', 'r') as f:
                completed = len([line for line in f if 'COMPLETED' in line])
                failed = len([line for line in f if 'FAILED' in line])
            
            metrics.append(f'transcodes_completed_total {completed}')
            metrics.append(f'transcodes_failed_total {failed}')
        except FileNotFoundError:
            metrics.append('transcodes_completed_total 0')
            metrics.append('transcodes_failed_total 0')
        
        # Add more custom metrics as needed
        return '\n'.join(metrics) + '\n'

if __name__ == '__main__':
    server = HTTPServer(('0.0.0.0', 9999), MetricsHandler)
    print("Custom metrics server running on port 9999")
    server.serve_forever()

Deploy as a systemd service or Docker container.

Home Assistant Metrics Exporter

#!/usr/bin/env python3
# homeassistant_exporter.py
import requests
import time
from http.server import HTTPServer, BaseHTTPRequestHandler
import json

HA_URL = "http://192.168.1.XXX:8123"  # Your HA IP
HA_TOKEN = "your-long-lived-access-token"  # Generate in HA Profile

class HAMetricsHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == '/metrics':
            metrics = self.get_ha_metrics()
            self.send_response(200)
            self.send_header('Content-type', 'text/plain')
            self.end_headers()
            self.wfile.write(metrics.encode())
        else:
            self.send_response(404)
            self.end_headers()

    def get_ha_metrics(self):
        headers = {
            "Authorization": f"Bearer {HA_TOKEN}",
            "Content-Type": "application/json"
        }
        
        try:
            response = requests.get(f"{HA_URL}/api/states", headers=headers, timeout=10)
            entities = response.json()
        except Exception as e:
            return f"# Error fetching HA data: {e}\n"
        
        metrics = []
        metrics.append("# HELP homeassistant_entity_state Home Assistant entity states")
        metrics.append("# TYPE homeassistant_entity_state gauge")
        
        for entity in entities:
            entity_id = entity['entity_id']
            domain = entity_id.split('.')[0]
            name = entity_id.replace('.', '_').replace('-', '_')
            state = entity['state']
            
            # Add labels for better organization
            labels = f'domain="{domain}",entity_id="{entity_id}"'
            
            try:
                # Try to convert to numeric value
                value = float(state)
                metrics.append(f'homeassistant_entity_state{{{labels}}} {value}')
            except (ValueError, TypeError):
                # Handle boolean states
                if state.lower() in ['on', 'true', 'open', 'home']:
                    metrics.append(f'homeassistant_entity_state{{{labels}}} 1')
                elif state.lower() in ['off', 'false', 'closed', 'away']:
                    metrics.append(f'homeassistant_entity_state{{{labels}}} 0')
                else:
                    # For text states, create info metric
                    info_labels = f'domain="{domain}",entity_id="{entity_id}",state="{state}"'
                    metrics.append(f'homeassistant_entity_info{{{info_labels}}} 1')
        
        return '\n'.join(metrics) + '\n'

if __name__ == '__main__':
    server = HTTPServer(('0.0.0.0', 9998), HAMetricsHandler)
    print("Home Assistant metrics exporter running on port 9998")
    server.serve_forever()

HomeKit Bridge Exporter (Optional)

#!/usr/bin/env python3
# homekit_exporter.py - Requires homekit2mqtt or similar bridge
import paho.mqtt.client as mqtt
import json
from http.server import HTTPServer, BaseHTTPRequestHandler
import threading

MQTT_BROKER = "192.168.1.XXX"  # Your MQTT broker
MQTT_PORT = 1883
HOMEKIT_TOPIC = "homekit/#"

class HomeKitData:
    def __init__(self):
        self.devices = {}
        self.client = mqtt.Client()
        self.client.on_connect = self.on_connect
        self.client.on_message = self.on_message
        
    def on_connect(self, client, userdata, flags, rc):
        print(f"Connected to MQTT broker with result code {rc}")
        client.subscribe(HOMEKIT_TOPIC)
        
    def on_message(self, client, userdata, msg):
        try:
            topic_parts = msg.topic.split('/')
            device_id = topic_parts[1] if len(topic_parts) > 1 else "unknown"
            characteristic = topic_parts[2] if len(topic_parts) > 2 else "state"
            
            value = json.loads(msg.payload.decode())
            
            if device_id not in self.devices:
                self.devices[device_id] = {}
            self.devices[device_id][characteristic] = value
        except Exception as e:
            print(f"Error processing MQTT message: {e}")

homekit_data = HomeKitData()

class HomeKitMetricsHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == '/metrics':
            metrics = self.get_homekit_metrics()
            self.send_response(200)
            self.send_header('Content-type', 'text/plain')
            self.end_headers()
            self.wfile.write(metrics.encode())
        else:
            self.send_response(404)
            self.end_headers()

    def get_homekit_metrics(self):
        metrics = []
        metrics.append("# HELP homekit_device_state HomeKit device states")
        metrics.append("# TYPE homekit_device_state gauge")
        
        for device_id, characteristics in homekit_data.devices.items():
            for char_name, value in characteristics.items():
                labels = f'device_id="{device_id}",characteristic="{char_name}"'
                try:
                    numeric_value = float(value)
                    metrics.append(f'homekit_device_state{{{labels}}} {numeric_value}')
                except (ValueError, TypeError):
                    # Handle boolean values
                    if str(value).lower() in ['true', 'on']:
                        metrics.append(f'homekit_device_state{{{labels}}} 1')
                    elif str(value).lower() in ['false', 'off']:
                        metrics.append(f'homekit_device_state{{{labels}}} 0')
        
        return '\n'.join(metrics) + '\n'

def start_mqtt_client():
    homekit_data.client.connect(MQTT_BROKER, MQTT_PORT, 60)
    homekit_data.client.loop_forever()

if __name__ == '__main__':
    # Start MQTT client in background thread
    mqtt_thread = threading.Thread(target=start_mqtt_client)
    mqtt_thread.daemon = True
    mqtt_thread.start()
    
    # Start HTTP server
    server = HTTPServer(('0.0.0.0', 9997), HomeKitMetricsHandler)
    print("HomeKit metrics exporter running on port 9997")
    server.serve_forever()

Proxmox Host Configuration

Install Proxmox VE Exporter on your Proxmox host:

# On Proxmox host
wget https://github.com/prometheus-pve/prometheus-pve-exporter/releases/latest/download/prometheus-pve-exporter
chmod +x prometheus-pve-exporter

# Create config file
cat > pve.yml << EOF
default:
  user: monitoring@pve
  password: your-password
  verify_ssl: false
EOF

# Run the exporter
./prometheus-pve-exporter --config.file=pve.yml

Installation Steps

Create monitoring VM with adequate resources (4GB RAM, 20GB disk recommended)

Set up main monitoring stack:

mkdir -p monitoring/{prometheus,alertmanager,grafana/provisioning/{datasources,dashboards}}
cd monitoring
# Copy all configuration files from above
docker-compose up -d

Deploy exporters on each VM:

# On each VM
docker-compose -f node-cadvisor-compose.yml up -d

Configure Proxmox exporter on the host

Set up Home Assistant integration:

Option A: Enable Prometheus in Home Assistant

# Add to Home Assistant configuration.yaml
prometheus:
  namespace: homeassistant
  filter:
    include_domains:
      - sensor
      - binary_sensor
      - switch
      - light
      - climate
      - weather

Option B: Deploy custom HA exporter

# Create and run the Home Assistant exporter
python3 homeassistant_exporter.py

Optional: Set up HomeKit integration

# If using homekit2mqtt bridge
npm install -g homekit2mqtt
homekit2mqtt --mqtt-url mqtt://your-mqtt-broker

# Then run the HomeKit exporter
python3 homekit_exporter.py

Access interfaces:
- Grafana: http://monitoring-vm-ip:3000 (admin/admin123)
- Prometheus: http://monitoring-vm-ip:9090
- Alertmanager: http://monitoring-vm-ip:9093
Import dashboards in Grafana:
- Node Exporter Full (Dashboard ID: 1860)
- Docker and system monitoring (Dashboard ID: 893)
- Proxmox VE (Dashboard ID: 10347)
- Home Assistant (Dashboard ID: 11021)
Generate Home Assistant Long-Lived Access Token:
- Go to HA Profile → Long-Lived Access Tokens
- Create new token and update exporter configs

Customization Notes

Update all IP addresses in prometheus.yml to match your network
Configure email settings in alertmanager.yml
Adjust alert thresholds in alerts.yml based on your requirements
Add custom exporters for specific application monitoring
Set up proper authentication and SSL for production use

Maintenance

Monitor disk space usage for time-series data
Regular backups of configuration files
Update container images periodically
Review and tune alert rules based on false positive rates

This setup provides comprehensive monitoring for your home lab with room for expansion as your infrastructure grows.

20 KiB Raw Blame History