claude-home/.claude/plans/prometheus-grafana-setup.md
Cal Corum edc78c2dd6 CLAUDE: Add comprehensive gaming-aware Tdarr management system
- Created complete gaming detection and priority system
- Added gaming schedule configuration and enforcement
- Implemented Steam library monitoring with auto-detection
- Built comprehensive game process detection for multiple platforms
- Added gaming-aware Tdarr worker management with priority controls
- Created emergency gaming mode for immediate worker shutdown
- Integrated Discord notifications for gaming state changes
- Replaced old bash monitoring with enhanced Python monitoring system
- Added persistent state management and memory tracking
- Implemented configurable gaming time windows and schedules
- Updated .gitignore to exclude logs directories

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-14 15:17:52 -05:00

20 KiB

Prometheus + Grafana Home Lab Monitoring Setup

Overview

This document provides a complete setup for monitoring a Proxmox home lab with 8 Ubuntu Server VMs running Docker applications. The solution uses the Prometheus + Grafana + Alertmanager stack to provide comprehensive monitoring with custom metrics, alerting, and visualization.

Architecture

Components

  • Prometheus: Time-series database for metrics collection (pull-based)
  • Grafana: Web-based visualization and dashboards
  • Alertmanager: Alert routing and notifications
  • Node Exporter: System metrics (CPU, memory, disk, network)
  • cAdvisor: Docker container metrics
  • Custom Exporters: Application-specific metrics (transcodes, web server stats, etc.)

Deployment Strategy

  • Main Monitoring VM: Runs Prometheus, Grafana, and Alertmanager
  • Each Monitored VM: Runs Node Exporter, cAdvisor, and any custom exporters
  • Proxmox Host: Runs Proxmox VE Exporter for hypervisor metrics

Main Monitoring Stack Deployment

Directory Structure

monitoring/
├── docker-compose.yml
├── prometheus/
│   ├── prometheus.yml
│   └── alerts.yml
├── alertmanager/
│   └── alertmanager.yml
├── grafana/
│   └── provisioning/
│       ├── datasources/
│       │   └── prometheus.yml
│       └── dashboards/
│           └── dashboard.yml
└── data/
    ├── prometheus/
    ├── grafana/
    └── alertmanager/

Docker Compose Configuration

version: '3.8'

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=200h'
      - '--web.enable-lifecycle'
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/alerts.yml:/etc/prometheus/alerts.yml
      - prometheus_data:/prometheus
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    restart: unless-stopped
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager_data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
      - '--web.external-url=http://localhost:9093'
    networks:
      - monitoring

Configuration Files

Prometheus Configuration (prometheus/prometheus.yml)

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alerts.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporters - Update with your VM IPs
  - job_name: 'node-exporter'
    static_configs:
      - targets:
        - '192.168.1.101:9100'  # VM1
        - '192.168.1.102:9100'  # VM2
        - '192.168.1.103:9100'  # VM3
        - '192.168.1.104:9100'  # VM4
        - '192.168.1.105:9100'  # VM5
        - '192.168.1.106:9100'  # VM6
        - '192.168.1.107:9100'  # VM7
        - '192.168.1.108:9100'  # VM8

  # cAdvisor for Docker metrics
  - job_name: 'cadvisor'
    static_configs:
      - targets:
        - '192.168.1.101:8080'  # VM1
        - '192.168.1.102:8080'  # VM2
        - '192.168.1.103:8080'  # VM3
        - '192.168.1.104:8080'  # VM4
        - '192.168.1.105:8080'  # VM5
        - '192.168.1.106:8080'  # VM6
        - '192.168.1.107:8080'  # VM7
        - '192.168.1.108:8080'  # VM8

  # Proxmox VE Exporter - Update with your Proxmox host IP
  - job_name: 'proxmox'
    static_configs:
      - targets: ['192.168.1.100:9221']  # Proxmox host

  # Custom application exporters
  - job_name: 'custom-apps'
    static_configs:
      - targets:
        - '192.168.1.101:9999'  # Custom app metrics
        - '192.168.1.102:9999'  # Media server metrics

  # Home Assistant Prometheus integration
  - job_name: 'homeassistant'
    scrape_interval: 30s
    metrics_path: /api/prometheus
    bearer_token: 'YOUR_LONG_LIVED_ACCESS_TOKEN'  # Generate in HA Profile settings
    static_configs:
      - targets: ['192.168.1.XXX:8123']  # Your Home Assistant IP

  # Home Assistant API exporter (alternative)
  - job_name: 'homeassistant-api'
    static_configs:
      - targets: ['192.168.1.XXX:9998']  # Custom HA exporter

  # HomeKit via MQTT bridge (if using)
  - job_name: 'homekit'
    static_configs:
      - targets: ['192.168.1.XXX:9997']  # Custom HomeKit exporter

Alert Rules (prometheus/alerts.yml)

groups:
  - name: system-alerts
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} down"
          description: "{{ $labels.instance }} has been down for more than 1 minute."

      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% for more than 2 minutes."

      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 80% for more than 2 minutes."

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk space is below 20% on root filesystem."

  - name: docker-alerts
    rules:
      - alert: ContainerDown
        expr: absent(container_last_seen) or time() - container_last_seen > 60
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} is down"
          description: "Container has been down for more than 1 minute."

Alertmanager Configuration (alertmanager/alertmanager.yml)

global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alertmanager@yourdomain.com'
  # Configure with your email settings

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'
    - match:
        severity: warning
      receiver: 'warning-alerts'

receivers:
  - name: 'web.hook'
    webhook_configs:
      - url: 'http://127.0.0.1:5001/'

  - name: 'critical-alerts'
    email_configs:
      - to: 'admin@yourdomain.com'
        subject: 'CRITICAL: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}          
    # Uncomment and configure for Discord/Slack webhooks
    # webhook_configs:
    #   - url: 'YOUR_DISCORD_WEBHOOK_URL'

  - name: 'warning-alerts'
    email_configs:
      - to: 'admin@yourdomain.com'
        subject: 'WARNING: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}          

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

Grafana Datasource Provisioning (grafana/provisioning/datasources/prometheus.yml)

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true

VM Configuration

Node Exporter and cAdvisor Deployment

Deploy this on each VM:

# docker-compose.yml for each VM
version: '3.8'

services:
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    ports:
      - "9100:9100"
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    devices:
      - /dev/kmsg

Custom Metrics Example

For media server transcode monitoring:

#!/usr/bin/env python3
# custom_exporter.py
from http.server import HTTPServer, BaseHTTPRequestHandler
import subprocess
import json

class MetricsHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == '/metrics':
            metrics = self.get_metrics()
            self.send_response(200)
            self.send_header('Content-type', 'text/plain')
            self.end_headers()
            self.wfile.write(metrics.encode())
        else:
            self.send_response(404)
            self.end_headers()

    def get_metrics(self):
        metrics = []
        
        # Example: Count completed transcodes from log file
        try:
            with open('/var/log/transcodes.log', 'r') as f:
                completed = len([line for line in f if 'COMPLETED' in line])
                failed = len([line for line in f if 'FAILED' in line])
            
            metrics.append(f'transcodes_completed_total {completed}')
            metrics.append(f'transcodes_failed_total {failed}')
        except FileNotFoundError:
            metrics.append('transcodes_completed_total 0')
            metrics.append('transcodes_failed_total 0')
        
        # Add more custom metrics as needed
        return '\n'.join(metrics) + '\n'

if __name__ == '__main__':
    server = HTTPServer(('0.0.0.0', 9999), MetricsHandler)
    print("Custom metrics server running on port 9999")
    server.serve_forever()

Deploy as a systemd service or Docker container.

Home Assistant Metrics Exporter

#!/usr/bin/env python3
# homeassistant_exporter.py
import requests
import time
from http.server import HTTPServer, BaseHTTPRequestHandler
import json

HA_URL = "http://192.168.1.XXX:8123"  # Your HA IP
HA_TOKEN = "your-long-lived-access-token"  # Generate in HA Profile

class HAMetricsHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == '/metrics':
            metrics = self.get_ha_metrics()
            self.send_response(200)
            self.send_header('Content-type', 'text/plain')
            self.end_headers()
            self.wfile.write(metrics.encode())
        else:
            self.send_response(404)
            self.end_headers()

    def get_ha_metrics(self):
        headers = {
            "Authorization": f"Bearer {HA_TOKEN}",
            "Content-Type": "application/json"
        }
        
        try:
            response = requests.get(f"{HA_URL}/api/states", headers=headers, timeout=10)
            entities = response.json()
        except Exception as e:
            return f"# Error fetching HA data: {e}\n"
        
        metrics = []
        metrics.append("# HELP homeassistant_entity_state Home Assistant entity states")
        metrics.append("# TYPE homeassistant_entity_state gauge")
        
        for entity in entities:
            entity_id = entity['entity_id']
            domain = entity_id.split('.')[0]
            name = entity_id.replace('.', '_').replace('-', '_')
            state = entity['state']
            
            # Add labels for better organization
            labels = f'domain="{domain}",entity_id="{entity_id}"'
            
            try:
                # Try to convert to numeric value
                value = float(state)
                metrics.append(f'homeassistant_entity_state{{{labels}}} {value}')
            except (ValueError, TypeError):
                # Handle boolean states
                if state.lower() in ['on', 'true', 'open', 'home']:
                    metrics.append(f'homeassistant_entity_state{{{labels}}} 1')
                elif state.lower() in ['off', 'false', 'closed', 'away']:
                    metrics.append(f'homeassistant_entity_state{{{labels}}} 0')
                else:
                    # For text states, create info metric
                    info_labels = f'domain="{domain}",entity_id="{entity_id}",state="{state}"'
                    metrics.append(f'homeassistant_entity_info{{{info_labels}}} 1')
        
        return '\n'.join(metrics) + '\n'

if __name__ == '__main__':
    server = HTTPServer(('0.0.0.0', 9998), HAMetricsHandler)
    print("Home Assistant metrics exporter running on port 9998")
    server.serve_forever()

HomeKit Bridge Exporter (Optional)

#!/usr/bin/env python3
# homekit_exporter.py - Requires homekit2mqtt or similar bridge
import paho.mqtt.client as mqtt
import json
from http.server import HTTPServer, BaseHTTPRequestHandler
import threading

MQTT_BROKER = "192.168.1.XXX"  # Your MQTT broker
MQTT_PORT = 1883
HOMEKIT_TOPIC = "homekit/#"

class HomeKitData:
    def __init__(self):
        self.devices = {}
        self.client = mqtt.Client()
        self.client.on_connect = self.on_connect
        self.client.on_message = self.on_message
        
    def on_connect(self, client, userdata, flags, rc):
        print(f"Connected to MQTT broker with result code {rc}")
        client.subscribe(HOMEKIT_TOPIC)
        
    def on_message(self, client, userdata, msg):
        try:
            topic_parts = msg.topic.split('/')
            device_id = topic_parts[1] if len(topic_parts) > 1 else "unknown"
            characteristic = topic_parts[2] if len(topic_parts) > 2 else "state"
            
            value = json.loads(msg.payload.decode())
            
            if device_id not in self.devices:
                self.devices[device_id] = {}
            self.devices[device_id][characteristic] = value
        except Exception as e:
            print(f"Error processing MQTT message: {e}")

homekit_data = HomeKitData()

class HomeKitMetricsHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == '/metrics':
            metrics = self.get_homekit_metrics()
            self.send_response(200)
            self.send_header('Content-type', 'text/plain')
            self.end_headers()
            self.wfile.write(metrics.encode())
        else:
            self.send_response(404)
            self.end_headers()

    def get_homekit_metrics(self):
        metrics = []
        metrics.append("# HELP homekit_device_state HomeKit device states")
        metrics.append("# TYPE homekit_device_state gauge")
        
        for device_id, characteristics in homekit_data.devices.items():
            for char_name, value in characteristics.items():
                labels = f'device_id="{device_id}",characteristic="{char_name}"'
                try:
                    numeric_value = float(value)
                    metrics.append(f'homekit_device_state{{{labels}}} {numeric_value}')
                except (ValueError, TypeError):
                    # Handle boolean values
                    if str(value).lower() in ['true', 'on']:
                        metrics.append(f'homekit_device_state{{{labels}}} 1')
                    elif str(value).lower() in ['false', 'off']:
                        metrics.append(f'homekit_device_state{{{labels}}} 0')
        
        return '\n'.join(metrics) + '\n'

def start_mqtt_client():
    homekit_data.client.connect(MQTT_BROKER, MQTT_PORT, 60)
    homekit_data.client.loop_forever()

if __name__ == '__main__':
    # Start MQTT client in background thread
    mqtt_thread = threading.Thread(target=start_mqtt_client)
    mqtt_thread.daemon = True
    mqtt_thread.start()
    
    # Start HTTP server
    server = HTTPServer(('0.0.0.0', 9997), HomeKitMetricsHandler)
    print("HomeKit metrics exporter running on port 9997")
    server.serve_forever()

Proxmox Host Configuration

Install Proxmox VE Exporter on your Proxmox host:

# On Proxmox host
wget https://github.com/prometheus-pve/prometheus-pve-exporter/releases/latest/download/prometheus-pve-exporter
chmod +x prometheus-pve-exporter

# Create config file
cat > pve.yml << EOF
default:
  user: monitoring@pve
  password: your-password
  verify_ssl: false
EOF

# Run the exporter
./prometheus-pve-exporter --config.file=pve.yml

Installation Steps

  1. Create monitoring VM with adequate resources (4GB RAM, 20GB disk recommended)

  2. Set up main monitoring stack:

    mkdir -p monitoring/{prometheus,alertmanager,grafana/provisioning/{datasources,dashboards}}
    cd monitoring
    # Copy all configuration files from above
    docker-compose up -d
    
  3. Deploy exporters on each VM:

    # On each VM
    docker-compose -f node-cadvisor-compose.yml up -d
    
  4. Configure Proxmox exporter on the host

  5. Set up Home Assistant integration:

    Option A: Enable Prometheus in Home Assistant

    # Add to Home Assistant configuration.yaml
    prometheus:
      namespace: homeassistant
      filter:
        include_domains:
          - sensor
          - binary_sensor
          - switch
          - light
          - climate
          - weather
    

    Option B: Deploy custom HA exporter

    # Create and run the Home Assistant exporter
    python3 homeassistant_exporter.py
    
  6. Optional: Set up HomeKit integration

    # If using homekit2mqtt bridge
    npm install -g homekit2mqtt
    homekit2mqtt --mqtt-url mqtt://your-mqtt-broker
    
    # Then run the HomeKit exporter
    python3 homekit_exporter.py
    
  7. Access interfaces:

  8. Import dashboards in Grafana:

    • Node Exporter Full (Dashboard ID: 1860)
    • Docker and system monitoring (Dashboard ID: 893)
    • Proxmox VE (Dashboard ID: 10347)
    • Home Assistant (Dashboard ID: 11021)
  9. Generate Home Assistant Long-Lived Access Token:

    • Go to HA Profile → Long-Lived Access Tokens
    • Create new token and update exporter configs

Customization Notes

  • Update all IP addresses in prometheus.yml to match your network
  • Configure email settings in alertmanager.yml
  • Adjust alert thresholds in alerts.yml based on your requirements
  • Add custom exporters for specific application monitoring
  • Set up proper authentication and SSL for production use

Maintenance

  • Monitor disk space usage for time-series data
  • Regular backups of configuration files
  • Update container images periodically
  • Review and tune alert rules based on false positive rates

This setup provides comprehensive monitoring for your home lab with room for expansion as your infrastructure grows.