claude-home/.claude/plans/prometheus-grafana-setup.md

# Prometheus + Grafana Home Lab Monitoring Setup

## Overview

This document provides a complete setup for monitoring a Proxmox home lab with 8 Ubuntu Server VMs running Docker applications. The solution uses the Prometheus + Grafana + Alertmanager stack to provide comprehensive monitoring with custom metrics, alerting, and visualization.

## Architecture

### Components
- **Prometheus**: Time-series database for metrics collection (pull-based)
- **Grafana**: Web-based visualization and dashboards
- **Alertmanager**: Alert routing and notifications
- **Node Exporter**: System metrics (CPU, memory, disk, network)
- **cAdvisor**: Docker container metrics
- **Custom Exporters**: Application-specific metrics (transcodes, web server stats, etc.)

### Deployment Strategy
- **Main Monitoring VM**: Runs Prometheus, Grafana, and Alertmanager
- **Each Monitored VM**: Runs Node Exporter, cAdvisor, and any custom exporters
- **Proxmox Host**: Runs Proxmox VE Exporter for hypervisor metrics

## Main Monitoring Stack Deployment

### Directory Structure
```
monitoring/
├── docker-compose.yml
├── prometheus/
│   ├── prometheus.yml
│   └── alerts.yml
├── alertmanager/
│   └── alertmanager.yml
├── grafana/
│   └── provisioning/
│       ├── datasources/
│       │   └── prometheus.yml
│       └── dashboards/
│           └── dashboard.yml
└── data/
    ├── prometheus/
    ├── grafana/
    └── alertmanager/
```

### Docker Compose Configuration

```yaml
version: '3.8'

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--storage.tsdb.retention.time=200h'
      - '--web.enable-lifecycle'
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/alerts.yml:/etc/prometheus/alerts.yml
      - prometheus_data:/prometheus
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin123
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    restart: unless-stopped
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager_data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
      - '--web.external-url=http://localhost:9093'
    networks:
      - monitoring
```

## Configuration Files

### Prometheus Configuration (prometheus/prometheus.yml)

```yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alerts.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporters - Update with your VM IPs
  - job_name: 'node-exporter'
    static_configs:
      - targets:
        - '192.168.1.101:9100'  # VM1
        - '192.168.1.102:9100'  # VM2
        - '192.168.1.103:9100'  # VM3
        - '192.168.1.104:9100'  # VM4
        - '192.168.1.105:9100'  # VM5
        - '192.168.1.106:9100'  # VM6
        - '192.168.1.107:9100'  # VM7
        - '192.168.1.108:9100'  # VM8

  # cAdvisor for Docker metrics
  - job_name: 'cadvisor'
    static_configs:
      - targets:
        - '192.168.1.101:8080'  # VM1
        - '192.168.1.102:8080'  # VM2
        - '192.168.1.103:8080'  # VM3
        - '192.168.1.104:8080'  # VM4
        - '192.168.1.105:8080'  # VM5
        - '192.168.1.106:8080'  # VM6
        - '192.168.1.107:8080'  # VM7
        - '192.168.1.108:8080'  # VM8

  # Proxmox VE Exporter - Update with your Proxmox host IP
  - job_name: 'proxmox'
    static_configs:
      - targets: ['192.168.1.100:9221']  # Proxmox host

  # Custom application exporters
  - job_name: 'custom-apps'
    static_configs:
      - targets:
        - '192.168.1.101:9999'  # Custom app metrics
        - '192.168.1.102:9999'  # Media server metrics

  # Home Assistant Prometheus integration
  - job_name: 'homeassistant'
    scrape_interval: 30s
    metrics_path: /api/prometheus
    bearer_token: 'YOUR_LONG_LIVED_ACCESS_TOKEN'  # Generate in HA Profile settings
    static_configs:
      - targets: ['192.168.1.XXX:8123']  # Your Home Assistant IP

  # Home Assistant API exporter (alternative)
  - job_name: 'homeassistant-api'
    static_configs:
      - targets: ['192.168.1.XXX:9998']  # Custom HA exporter

  # HomeKit via MQTT bridge (if using)
  - job_name: 'homekit'
    static_configs:
      - targets: ['192.168.1.XXX:9997']  # Custom HomeKit exporter
```

### Alert Rules (prometheus/alerts.yml)

```yaml
groups:
  - name: system-alerts
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} down"
          description: "{{ $labels.instance }} has been down for more than 1 minute."

      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% for more than 2 minutes."

      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 80% for more than 2 minutes."

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk space is below 20% on root filesystem."

  - name: docker-alerts
    rules:
      - alert: ContainerDown
        expr: absent(container_last_seen) or time() - container_last_seen > 60
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} is down"
          description: "Container has been down for more than 1 minute."
```

### Alertmanager Configuration (alertmanager/alertmanager.yml)

```yaml
global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alertmanager@yourdomain.com'
  # Configure with your email settings

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'
    - match:
        severity: warning
      receiver: 'warning-alerts'

receivers:
  - name: 'web.hook'
    webhook_configs:
      - url: 'http://127.0.0.1:5001/'

  - name: 'critical-alerts'
    email_configs:
      - to: 'admin@yourdomain.com'
        subject: 'CRITICAL: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}
    # Uncomment and configure for Discord/Slack webhooks
    # webhook_configs:
    #   - url: 'YOUR_DISCORD_WEBHOOK_URL'

  - name: 'warning-alerts'
    email_configs:
      - to: 'admin@yourdomain.com'
        subject: 'WARNING: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']
```

### Grafana Datasource Provisioning (grafana/provisioning/datasources/prometheus.yml)

```yaml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
```

## VM Configuration

### Node Exporter and cAdvisor Deployment

Deploy this on each VM:

```yaml
# docker-compose.yml for each VM
version: '3.8'

services:
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    ports:
      - "9100:9100"
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    devices:
      - /dev/kmsg
```

### Custom Metrics Example

For media server transcode monitoring:

```python
#!/usr/bin/env python3
# custom_exporter.py
from http.server import HTTPServer, BaseHTTPRequestHandler
import subprocess
import json

class MetricsHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == '/metrics':
            metrics = self.get_metrics()
            self.send_response(200)
            self.send_header('Content-type', 'text/plain')
            self.end_headers()
            self.wfile.write(metrics.encode())
        else:
            self.send_response(404)
            self.end_headers()

    def get_metrics(self):
        metrics = []

        # Example: Count completed transcodes from log file
        try:
            with open('/var/log/transcodes.log', 'r') as f:
                completed = len([line for line in f if 'COMPLETED' in line])
                failed = len([line for line in f if 'FAILED' in line])

            metrics.append(f'transcodes_completed_total {completed}')
            metrics.append(f'transcodes_failed_total {failed}')
        except FileNotFoundError:
            metrics.append('transcodes_completed_total 0')
            metrics.append('transcodes_failed_total 0')

        # Add more custom metrics as needed
        return '\n'.join(metrics) + '\n'

if __name__ == '__main__':
    server = HTTPServer(('0.0.0.0', 9999), MetricsHandler)
    print("Custom metrics server running on port 9999")
    server.serve_forever()
```

Deploy as a systemd service or Docker container.

### Home Assistant Metrics Exporter

```python
#!/usr/bin/env python3
# homeassistant_exporter.py
import requests
import time
from http.server import HTTPServer, BaseHTTPRequestHandler
import json

HA_URL = "http://192.168.1.XXX:8123"  # Your HA IP
HA_TOKEN = "your-long-lived-access-token"  # Generate in HA Profile

class HAMetricsHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == '/metrics':
            metrics = self.get_ha_metrics()
            self.send_response(200)
            self.send_header('Content-type', 'text/plain')
            self.end_headers()
            self.wfile.write(metrics.encode())
        else:
            self.send_response(404)
            self.end_headers()

    def get_ha_metrics(self):
        headers = {
            "Authorization": f"Bearer {HA_TOKEN}",
            "Content-Type": "application/json"
        }

        try:
            response = requests.get(f"{HA_URL}/api/states", headers=headers, timeout=10)
            entities = response.json()
        except Exception as e:
            return f"# Error fetching HA data: {e}\n"

        metrics = []
        metrics.append("# HELP homeassistant_entity_state Home Assistant entity states")
        metrics.append("# TYPE homeassistant_entity_state gauge")

        for entity in entities:
            entity_id = entity['entity_id']
            domain = entity_id.split('.')[0]
            name = entity_id.replace('.', '_').replace('-', '_')
            state = entity['state']

            # Add labels for better organization
            labels = f'domain="{domain}",entity_id="{entity_id}"'

            try:
                # Try to convert to numeric value
                value = float(state)
                metrics.append(f'homeassistant_entity_state{{{labels}}} {value}')
            except (ValueError, TypeError):
                # Handle boolean states
                if state.lower() in ['on', 'true', 'open', 'home']:
                    metrics.append(f'homeassistant_entity_state{{{labels}}} 1')
                elif state.lower() in ['off', 'false', 'closed', 'away']:
                    metrics.append(f'homeassistant_entity_state{{{labels}}} 0')
                else:
                    # For text states, create info metric
                    info_labels = f'domain="{domain}",entity_id="{entity_id}",state="{state}"'
                    metrics.append(f'homeassistant_entity_info{{{info_labels}}} 1')

        return '\n'.join(metrics) + '\n'

if __name__ == '__main__':
    server = HTTPServer(('0.0.0.0', 9998), HAMetricsHandler)
    print("Home Assistant metrics exporter running on port 9998")
    server.serve_forever()
```

### HomeKit Bridge Exporter (Optional)

```python
#!/usr/bin/env python3
# homekit_exporter.py - Requires homekit2mqtt or similar bridge
import paho.mqtt.client as mqtt
import json
from http.server import HTTPServer, BaseHTTPRequestHandler
import threading

MQTT_BROKER = "192.168.1.XXX"  # Your MQTT broker
MQTT_PORT = 1883
HOMEKIT_TOPIC = "homekit/#"

class HomeKitData:
    def __init__(self):
        self.devices = {}
        self.client = mqtt.Client()
        self.client.on_connect = self.on_connect
        self.client.on_message = self.on_message

    def on_connect(self, client, userdata, flags, rc):
        print(f"Connected to MQTT broker with result code {rc}")
        client.subscribe(HOMEKIT_TOPIC)

    def on_message(self, client, userdata, msg):
        try:
            topic_parts = msg.topic.split('/')
            device_id = topic_parts[1] if len(topic_parts) > 1 else "unknown"
            characteristic = topic_parts[2] if len(topic_parts) > 2 else "state"

            value = json.loads(msg.payload.decode())

            if device_id not in self.devices:
                self.devices[device_id] = {}
            self.devices[device_id][characteristic] = value
        except Exception as e:
            print(f"Error processing MQTT message: {e}")

homekit_data = HomeKitData()

class HomeKitMetricsHandler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == '/metrics':
            metrics = self.get_homekit_metrics()
            self.send_response(200)
            self.send_header('Content-type', 'text/plain')
            self.end_headers()
            self.wfile.write(metrics.encode())
        else:
            self.send_response(404)
            self.end_headers()

    def get_homekit_metrics(self):
        metrics = []
        metrics.append("# HELP homekit_device_state HomeKit device states")
        metrics.append("# TYPE homekit_device_state gauge")

        for device_id, characteristics in homekit_data.devices.items():
            for char_name, value in characteristics.items():
                labels = f'device_id="{device_id}",characteristic="{char_name}"'
                try:
                    numeric_value = float(value)
                    metrics.append(f'homekit_device_state{{{labels}}} {numeric_value}')
                except (ValueError, TypeError):
                    # Handle boolean values
                    if str(value).lower() in ['true', 'on']:
                        metrics.append(f'homekit_device_state{{{labels}}} 1')
                    elif str(value).lower() in ['false', 'off']:
                        metrics.append(f'homekit_device_state{{{labels}}} 0')

        return '\n'.join(metrics) + '\n'

def start_mqtt_client():
    homekit_data.client.connect(MQTT_BROKER, MQTT_PORT, 60)
    homekit_data.client.loop_forever()

if __name__ == '__main__':
    # Start MQTT client in background thread
    mqtt_thread = threading.Thread(target=start_mqtt_client)
    mqtt_thread.daemon = True
    mqtt_thread.start()

    # Start HTTP server
    server = HTTPServer(('0.0.0.0', 9997), HomeKitMetricsHandler)
    print("HomeKit metrics exporter running on port 9997")
    server.serve_forever()
```

## Proxmox Host Configuration

Install Proxmox VE Exporter on your Proxmox host:

```bash
# On Proxmox host
wget https://github.com/prometheus-pve/prometheus-pve-exporter/releases/latest/download/prometheus-pve-exporter
chmod +x prometheus-pve-exporter

# Create config file
cat > pve.yml << EOF
default:
  user: monitoring@pve
  password: your-password
  verify_ssl: false
EOF

# Run the exporter
./prometheus-pve-exporter --config.file=pve.yml
```

## Installation Steps

1. **Create monitoring VM** with adequate resources (4GB RAM, 20GB disk recommended)

2. **Set up main monitoring stack:**
   ```bash
   mkdir -p monitoring/{prometheus,alertmanager,grafana/provisioning/{datasources,dashboards}}
   cd monitoring
   # Copy all configuration files from above
   docker-compose up -d
   ```

3. **Deploy exporters on each VM:**
   ```bash
   # On each VM
   docker-compose -f node-cadvisor-compose.yml up -d
   ```

4. **Configure Proxmox exporter** on the host

5. **Set up Home Assistant integration:**

   **Option A: Enable Prometheus in Home Assistant**
   ```yaml
   # Add to Home Assistant configuration.yaml
   prometheus:
     namespace: homeassistant
     filter:
       include_domains:
         - sensor
         - binary_sensor
         - switch
         - light
         - climate
         - weather
   ```

   **Option B: Deploy custom HA exporter**
   ```bash
   # Create and run the Home Assistant exporter
   python3 homeassistant_exporter.py
   ```

6. **Optional: Set up HomeKit integration**
   ```bash
   # If using homekit2mqtt bridge
   npm install -g homekit2mqtt
   homekit2mqtt --mqtt-url mqtt://your-mqtt-broker

   # Then run the HomeKit exporter
   python3 homekit_exporter.py
   ```

7. **Access interfaces:**
   - Grafana: http://monitoring-vm-ip:3000 (admin/admin123)
   - Prometheus: http://monitoring-vm-ip:9090
   - Alertmanager: http://monitoring-vm-ip:9093

8. **Import dashboards** in Grafana:
   - Node Exporter Full (Dashboard ID: 1860)
   - Docker and system monitoring (Dashboard ID: 893)
   - Proxmox VE (Dashboard ID: 10347)
   - Home Assistant (Dashboard ID: 11021)

9. **Generate Home Assistant Long-Lived Access Token:**
   - Go to HA Profile → Long-Lived Access Tokens
   - Create new token and update exporter configs

## Customization Notes

- Update all IP addresses in prometheus.yml to match your network
- Configure email settings in alertmanager.yml
- Adjust alert thresholds in alerts.yml based on your requirements
- Add custom exporters for specific application monitoring
- Set up proper authentication and SSL for production use

## Maintenance

- Monitor disk space usage for time-series data
- Regular backups of configuration files
- Update container images periodically
- Review and tune alert rules based on false positive rates

This setup provides comprehensive monitoring for your home lab with room for expansion as your infrastructure grows.