# Prometheus + Grafana Home Lab Monitoring Setup ## Overview This document provides a complete setup for monitoring a Proxmox home lab with 8 Ubuntu Server VMs running Docker applications. The solution uses the Prometheus + Grafana + Alertmanager stack to provide comprehensive monitoring with custom metrics, alerting, and visualization. ## Architecture ### Components - **Prometheus**: Time-series database for metrics collection (pull-based) - **Grafana**: Web-based visualization and dashboards - **Alertmanager**: Alert routing and notifications - **Node Exporter**: System metrics (CPU, memory, disk, network) - **cAdvisor**: Docker container metrics - **Custom Exporters**: Application-specific metrics (transcodes, web server stats, etc.) ### Deployment Strategy - **Main Monitoring VM**: Runs Prometheus, Grafana, and Alertmanager - **Each Monitored VM**: Runs Node Exporter, cAdvisor, and any custom exporters - **Proxmox Host**: Runs Proxmox VE Exporter for hypervisor metrics ## Main Monitoring Stack Deployment ### Directory Structure ``` monitoring/ ├── docker-compose.yml ├── prometheus/ │ ├── prometheus.yml │ └── alerts.yml ├── alertmanager/ │ └── alertmanager.yml ├── grafana/ │ └── provisioning/ │ ├── datasources/ │ │ └── prometheus.yml │ └── dashboards/ │ └── dashboard.yml └── data/ ├── prometheus/ ├── grafana/ └── alertmanager/ ``` ### Docker Compose Configuration ```yaml version: '3.8' networks: monitoring: driver: bridge volumes: prometheus_data: grafana_data: alertmanager_data: services: prometheus: image: prom/prometheus:latest container_name: prometheus restart: unless-stopped ports: - "9090:9090" command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--web.console.libraries=/etc/prometheus/console_libraries' - '--web.console.templates=/etc/prometheus/consoles' - '--storage.tsdb.retention.time=200h' - '--web.enable-lifecycle' volumes: - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml - ./prometheus/alerts.yml:/etc/prometheus/alerts.yml - prometheus_data:/prometheus networks: - monitoring grafana: image: grafana/grafana:latest container_name: grafana restart: unless-stopped ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_USER=admin - GF_SECURITY_ADMIN_PASSWORD=admin123 - GF_USERS_ALLOW_SIGN_UP=false volumes: - grafana_data:/var/lib/grafana - ./grafana/provisioning:/etc/grafana/provisioning networks: - monitoring alertmanager: image: prom/alertmanager:latest container_name: alertmanager restart: unless-stopped ports: - "9093:9093" volumes: - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml - alertmanager_data:/alertmanager command: - '--config.file=/etc/alertmanager/alertmanager.yml' - '--storage.path=/alertmanager' - '--web.external-url=http://localhost:9093' networks: - monitoring ``` ## Configuration Files ### Prometheus Configuration (prometheus/prometheus.yml) ```yaml global: scrape_interval: 15s evaluation_interval: 15s rule_files: - "alerts.yml" alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093 scrape_configs: # Prometheus itself - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] # Node Exporters - Update with your VM IPs - job_name: 'node-exporter' static_configs: - targets: - '192.168.1.101:9100' # VM1 - '192.168.1.102:9100' # VM2 - '192.168.1.103:9100' # VM3 - '192.168.1.104:9100' # VM4 - '192.168.1.105:9100' # VM5 - '192.168.1.106:9100' # VM6 - '192.168.1.107:9100' # VM7 - '192.168.1.108:9100' # VM8 # cAdvisor for Docker metrics - job_name: 'cadvisor' static_configs: - targets: - '192.168.1.101:8080' # VM1 - '192.168.1.102:8080' # VM2 - '192.168.1.103:8080' # VM3 - '192.168.1.104:8080' # VM4 - '192.168.1.105:8080' # VM5 - '192.168.1.106:8080' # VM6 - '192.168.1.107:8080' # VM7 - '192.168.1.108:8080' # VM8 # Proxmox VE Exporter - Update with your Proxmox host IP - job_name: 'proxmox' static_configs: - targets: ['192.168.1.100:9221'] # Proxmox host # Custom application exporters - job_name: 'custom-apps' static_configs: - targets: - '192.168.1.101:9999' # Custom app metrics - '192.168.1.102:9999' # Media server metrics # Home Assistant Prometheus integration - job_name: 'homeassistant' scrape_interval: 30s metrics_path: /api/prometheus bearer_token: 'YOUR_LONG_LIVED_ACCESS_TOKEN' # Generate in HA Profile settings static_configs: - targets: ['192.168.1.XXX:8123'] # Your Home Assistant IP # Home Assistant API exporter (alternative) - job_name: 'homeassistant-api' static_configs: - targets: ['192.168.1.XXX:9998'] # Custom HA exporter # HomeKit via MQTT bridge (if using) - job_name: 'homekit' static_configs: - targets: ['192.168.1.XXX:9997'] # Custom HomeKit exporter ``` ### Alert Rules (prometheus/alerts.yml) ```yaml groups: - name: system-alerts rules: - alert: InstanceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Instance {{ $labels.instance }} down" description: "{{ $labels.instance }} has been down for more than 1 minute." - alert: HighCPUUsage expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80 for: 2m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is above 80% for more than 2 minutes." - alert: HighMemoryUsage expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80 for: 2m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is above 80% for more than 2 minutes." - alert: DiskSpaceLow expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20 for: 1m labels: severity: critical annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Disk space is below 20% on root filesystem." - name: docker-alerts rules: - alert: ContainerDown expr: absent(container_last_seen) or time() - container_last_seen > 60 for: 1m labels: severity: warning annotations: summary: "Container {{ $labels.name }} is down" description: "Container has been down for more than 1 minute." ``` ### Alertmanager Configuration (alertmanager/alertmanager.yml) ```yaml global: smtp_smarthost: 'localhost:587' smtp_from: 'alertmanager@yourdomain.com' # Configure with your email settings route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'web.hook' routes: - match: severity: critical receiver: 'critical-alerts' - match: severity: warning receiver: 'warning-alerts' receivers: - name: 'web.hook' webhook_configs: - url: 'http://127.0.0.1:5001/' - name: 'critical-alerts' email_configs: - to: 'admin@yourdomain.com' subject: 'CRITICAL: {{ .GroupLabels.alertname }}' body: | {{ range .Alerts }} Alert: {{ .Annotations.summary }} Description: {{ .Annotations.description }} {{ end }} # Uncomment and configure for Discord/Slack webhooks # webhook_configs: # - url: 'YOUR_DISCORD_WEBHOOK_URL' - name: 'warning-alerts' email_configs: - to: 'admin@yourdomain.com' subject: 'WARNING: {{ .GroupLabels.alertname }}' body: | {{ range .Alerts }} Alert: {{ .Annotations.summary }} Description: {{ .Annotations.description }} {{ end }} inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance'] ``` ### Grafana Datasource Provisioning (grafana/provisioning/datasources/prometheus.yml) ```yaml apiVersion: 1 datasources: - name: Prometheus type: prometheus access: proxy url: http://prometheus:9090 isDefault: true ``` ## VM Configuration ### Node Exporter and cAdvisor Deployment Deploy this on each VM: ```yaml # docker-compose.yml for each VM version: '3.8' services: node-exporter: image: prom/node-exporter:latest container_name: node-exporter restart: unless-stopped ports: - "9100:9100" command: - '--path.procfs=/host/proc' - '--path.rootfs=/rootfs' - '--path.sysfs=/host/sys' - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)' volumes: - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/rootfs:ro cadvisor: image: gcr.io/cadvisor/cadvisor:latest container_name: cadvisor restart: unless-stopped ports: - "8080:8080" volumes: - /:/rootfs:ro - /var/run:/var/run:ro - /sys:/sys:ro - /var/lib/docker/:/var/lib/docker:ro - /dev/disk/:/dev/disk:ro devices: - /dev/kmsg ``` ### Custom Metrics Example For media server transcode monitoring: ```python #!/usr/bin/env python3 # custom_exporter.py from http.server import HTTPServer, BaseHTTPRequestHandler import subprocess import json class MetricsHandler(BaseHTTPRequestHandler): def do_GET(self): if self.path == '/metrics': metrics = self.get_metrics() self.send_response(200) self.send_header('Content-type', 'text/plain') self.end_headers() self.wfile.write(metrics.encode()) else: self.send_response(404) self.end_headers() def get_metrics(self): metrics = [] # Example: Count completed transcodes from log file try: with open('/var/log/transcodes.log', 'r') as f: completed = len([line for line in f if 'COMPLETED' in line]) failed = len([line for line in f if 'FAILED' in line]) metrics.append(f'transcodes_completed_total {completed}') metrics.append(f'transcodes_failed_total {failed}') except FileNotFoundError: metrics.append('transcodes_completed_total 0') metrics.append('transcodes_failed_total 0') # Add more custom metrics as needed return '\n'.join(metrics) + '\n' if __name__ == '__main__': server = HTTPServer(('0.0.0.0', 9999), MetricsHandler) print("Custom metrics server running on port 9999") server.serve_forever() ``` Deploy as a systemd service or Docker container. ### Home Assistant Metrics Exporter ```python #!/usr/bin/env python3 # homeassistant_exporter.py import requests import time from http.server import HTTPServer, BaseHTTPRequestHandler import json HA_URL = "http://192.168.1.XXX:8123" # Your HA IP HA_TOKEN = "your-long-lived-access-token" # Generate in HA Profile class HAMetricsHandler(BaseHTTPRequestHandler): def do_GET(self): if self.path == '/metrics': metrics = self.get_ha_metrics() self.send_response(200) self.send_header('Content-type', 'text/plain') self.end_headers() self.wfile.write(metrics.encode()) else: self.send_response(404) self.end_headers() def get_ha_metrics(self): headers = { "Authorization": f"Bearer {HA_TOKEN}", "Content-Type": "application/json" } try: response = requests.get(f"{HA_URL}/api/states", headers=headers, timeout=10) entities = response.json() except Exception as e: return f"# Error fetching HA data: {e}\n" metrics = [] metrics.append("# HELP homeassistant_entity_state Home Assistant entity states") metrics.append("# TYPE homeassistant_entity_state gauge") for entity in entities: entity_id = entity['entity_id'] domain = entity_id.split('.')[0] name = entity_id.replace('.', '_').replace('-', '_') state = entity['state'] # Add labels for better organization labels = f'domain="{domain}",entity_id="{entity_id}"' try: # Try to convert to numeric value value = float(state) metrics.append(f'homeassistant_entity_state{{{labels}}} {value}') except (ValueError, TypeError): # Handle boolean states if state.lower() in ['on', 'true', 'open', 'home']: metrics.append(f'homeassistant_entity_state{{{labels}}} 1') elif state.lower() in ['off', 'false', 'closed', 'away']: metrics.append(f'homeassistant_entity_state{{{labels}}} 0') else: # For text states, create info metric info_labels = f'domain="{domain}",entity_id="{entity_id}",state="{state}"' metrics.append(f'homeassistant_entity_info{{{info_labels}}} 1') return '\n'.join(metrics) + '\n' if __name__ == '__main__': server = HTTPServer(('0.0.0.0', 9998), HAMetricsHandler) print("Home Assistant metrics exporter running on port 9998") server.serve_forever() ``` ### HomeKit Bridge Exporter (Optional) ```python #!/usr/bin/env python3 # homekit_exporter.py - Requires homekit2mqtt or similar bridge import paho.mqtt.client as mqtt import json from http.server import HTTPServer, BaseHTTPRequestHandler import threading MQTT_BROKER = "192.168.1.XXX" # Your MQTT broker MQTT_PORT = 1883 HOMEKIT_TOPIC = "homekit/#" class HomeKitData: def __init__(self): self.devices = {} self.client = mqtt.Client() self.client.on_connect = self.on_connect self.client.on_message = self.on_message def on_connect(self, client, userdata, flags, rc): print(f"Connected to MQTT broker with result code {rc}") client.subscribe(HOMEKIT_TOPIC) def on_message(self, client, userdata, msg): try: topic_parts = msg.topic.split('/') device_id = topic_parts[1] if len(topic_parts) > 1 else "unknown" characteristic = topic_parts[2] if len(topic_parts) > 2 else "state" value = json.loads(msg.payload.decode()) if device_id not in self.devices: self.devices[device_id] = {} self.devices[device_id][characteristic] = value except Exception as e: print(f"Error processing MQTT message: {e}") homekit_data = HomeKitData() class HomeKitMetricsHandler(BaseHTTPRequestHandler): def do_GET(self): if self.path == '/metrics': metrics = self.get_homekit_metrics() self.send_response(200) self.send_header('Content-type', 'text/plain') self.end_headers() self.wfile.write(metrics.encode()) else: self.send_response(404) self.end_headers() def get_homekit_metrics(self): metrics = [] metrics.append("# HELP homekit_device_state HomeKit device states") metrics.append("# TYPE homekit_device_state gauge") for device_id, characteristics in homekit_data.devices.items(): for char_name, value in characteristics.items(): labels = f'device_id="{device_id}",characteristic="{char_name}"' try: numeric_value = float(value) metrics.append(f'homekit_device_state{{{labels}}} {numeric_value}') except (ValueError, TypeError): # Handle boolean values if str(value).lower() in ['true', 'on']: metrics.append(f'homekit_device_state{{{labels}}} 1') elif str(value).lower() in ['false', 'off']: metrics.append(f'homekit_device_state{{{labels}}} 0') return '\n'.join(metrics) + '\n' def start_mqtt_client(): homekit_data.client.connect(MQTT_BROKER, MQTT_PORT, 60) homekit_data.client.loop_forever() if __name__ == '__main__': # Start MQTT client in background thread mqtt_thread = threading.Thread(target=start_mqtt_client) mqtt_thread.daemon = True mqtt_thread.start() # Start HTTP server server = HTTPServer(('0.0.0.0', 9997), HomeKitMetricsHandler) print("HomeKit metrics exporter running on port 9997") server.serve_forever() ``` ## Proxmox Host Configuration Install Proxmox VE Exporter on your Proxmox host: ```bash # On Proxmox host wget https://github.com/prometheus-pve/prometheus-pve-exporter/releases/latest/download/prometheus-pve-exporter chmod +x prometheus-pve-exporter # Create config file cat > pve.yml << EOF default: user: monitoring@pve password: your-password verify_ssl: false EOF # Run the exporter ./prometheus-pve-exporter --config.file=pve.yml ``` ## Installation Steps 1. **Create monitoring VM** with adequate resources (4GB RAM, 20GB disk recommended) 2. **Set up main monitoring stack:** ```bash mkdir -p monitoring/{prometheus,alertmanager,grafana/provisioning/{datasources,dashboards}} cd monitoring # Copy all configuration files from above docker-compose up -d ``` 3. **Deploy exporters on each VM:** ```bash # On each VM docker-compose -f node-cadvisor-compose.yml up -d ``` 4. **Configure Proxmox exporter** on the host 5. **Set up Home Assistant integration:** **Option A: Enable Prometheus in Home Assistant** ```yaml # Add to Home Assistant configuration.yaml prometheus: namespace: homeassistant filter: include_domains: - sensor - binary_sensor - switch - light - climate - weather ``` **Option B: Deploy custom HA exporter** ```bash # Create and run the Home Assistant exporter python3 homeassistant_exporter.py ``` 6. **Optional: Set up HomeKit integration** ```bash # If using homekit2mqtt bridge npm install -g homekit2mqtt homekit2mqtt --mqtt-url mqtt://your-mqtt-broker # Then run the HomeKit exporter python3 homekit_exporter.py ``` 7. **Access interfaces:** - Grafana: http://monitoring-vm-ip:3000 (admin/admin123) - Prometheus: http://monitoring-vm-ip:9090 - Alertmanager: http://monitoring-vm-ip:9093 8. **Import dashboards** in Grafana: - Node Exporter Full (Dashboard ID: 1860) - Docker and system monitoring (Dashboard ID: 893) - Proxmox VE (Dashboard ID: 10347) - Home Assistant (Dashboard ID: 11021) 9. **Generate Home Assistant Long-Lived Access Token:** - Go to HA Profile → Long-Lived Access Tokens - Create new token and update exporter configs ## Customization Notes - Update all IP addresses in prometheus.yml to match your network - Configure email settings in alertmanager.yml - Adjust alert thresholds in alerts.yml based on your requirements - Add custom exporters for specific application monitoring - Set up proper authentication and SSL for production use ## Maintenance - Monitor disk space usage for time-series data - Regular backups of configuration files - Update container images periodically - Review and tune alert rules based on false positive rates This setup provides comprehensive monitoring for your home lab with room for expansion as your infrastructure grows.