- Created complete gaming detection and priority system - Added gaming schedule configuration and enforcement - Implemented Steam library monitoring with auto-detection - Built comprehensive game process detection for multiple platforms - Added gaming-aware Tdarr worker management with priority controls - Created emergency gaming mode for immediate worker shutdown - Integrated Discord notifications for gaming state changes - Replaced old bash monitoring with enhanced Python monitoring system - Added persistent state management and memory tracking - Implemented configurable gaming time windows and schedules - Updated .gitignore to exclude logs directories 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
20 KiB
Prometheus + Grafana Home Lab Monitoring Setup
Overview
This document provides a complete setup for monitoring a Proxmox home lab with 8 Ubuntu Server VMs running Docker applications. The solution uses the Prometheus + Grafana + Alertmanager stack to provide comprehensive monitoring with custom metrics, alerting, and visualization.
Architecture
Components
- Prometheus: Time-series database for metrics collection (pull-based)
- Grafana: Web-based visualization and dashboards
- Alertmanager: Alert routing and notifications
- Node Exporter: System metrics (CPU, memory, disk, network)
- cAdvisor: Docker container metrics
- Custom Exporters: Application-specific metrics (transcodes, web server stats, etc.)
Deployment Strategy
- Main Monitoring VM: Runs Prometheus, Grafana, and Alertmanager
- Each Monitored VM: Runs Node Exporter, cAdvisor, and any custom exporters
- Proxmox Host: Runs Proxmox VE Exporter for hypervisor metrics
Main Monitoring Stack Deployment
Directory Structure
monitoring/
├── docker-compose.yml
├── prometheus/
│ ├── prometheus.yml
│ └── alerts.yml
├── alertmanager/
│ └── alertmanager.yml
├── grafana/
│ └── provisioning/
│ ├── datasources/
│ │ └── prometheus.yml
│ └── dashboards/
│ └── dashboard.yml
└── data/
├── prometheus/
├── grafana/
└── alertmanager/
Docker Compose Configuration
version: '3.8'
networks:
monitoring:
driver: bridge
volumes:
prometheus_data:
grafana_data:
alertmanager_data:
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
ports:
- "9090:9090"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--storage.tsdb.retention.time=200h'
- '--web.enable-lifecycle'
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/alerts.yml:/etc/prometheus/alerts.yml
- prometheus_data:/prometheus
networks:
- monitoring
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: unless-stopped
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin123
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
networks:
- monitoring
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
restart: unless-stopped
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager_data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
- '--web.external-url=http://localhost:9093'
networks:
- monitoring
Configuration Files
Prometheus Configuration (prometheus/prometheus.yml)
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alerts.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporters - Update with your VM IPs
- job_name: 'node-exporter'
static_configs:
- targets:
- '192.168.1.101:9100' # VM1
- '192.168.1.102:9100' # VM2
- '192.168.1.103:9100' # VM3
- '192.168.1.104:9100' # VM4
- '192.168.1.105:9100' # VM5
- '192.168.1.106:9100' # VM6
- '192.168.1.107:9100' # VM7
- '192.168.1.108:9100' # VM8
# cAdvisor for Docker metrics
- job_name: 'cadvisor'
static_configs:
- targets:
- '192.168.1.101:8080' # VM1
- '192.168.1.102:8080' # VM2
- '192.168.1.103:8080' # VM3
- '192.168.1.104:8080' # VM4
- '192.168.1.105:8080' # VM5
- '192.168.1.106:8080' # VM6
- '192.168.1.107:8080' # VM7
- '192.168.1.108:8080' # VM8
# Proxmox VE Exporter - Update with your Proxmox host IP
- job_name: 'proxmox'
static_configs:
- targets: ['192.168.1.100:9221'] # Proxmox host
# Custom application exporters
- job_name: 'custom-apps'
static_configs:
- targets:
- '192.168.1.101:9999' # Custom app metrics
- '192.168.1.102:9999' # Media server metrics
# Home Assistant Prometheus integration
- job_name: 'homeassistant'
scrape_interval: 30s
metrics_path: /api/prometheus
bearer_token: 'YOUR_LONG_LIVED_ACCESS_TOKEN' # Generate in HA Profile settings
static_configs:
- targets: ['192.168.1.XXX:8123'] # Your Home Assistant IP
# Home Assistant API exporter (alternative)
- job_name: 'homeassistant-api'
static_configs:
- targets: ['192.168.1.XXX:9998'] # Custom HA exporter
# HomeKit via MQTT bridge (if using)
- job_name: 'homekit'
static_configs:
- targets: ['192.168.1.XXX:9997'] # Custom HomeKit exporter
Alert Rules (prometheus/alerts.yml)
groups:
- name: system-alerts
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} has been down for more than 1 minute."
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
for: 2m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% for more than 2 minutes."
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
for: 2m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 80% for more than 2 minutes."
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20
for: 1m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk space is below 20% on root filesystem."
- name: docker-alerts
rules:
- alert: ContainerDown
expr: absent(container_last_seen) or time() - container_last_seen > 60
for: 1m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} is down"
description: "Container has been down for more than 1 minute."
Alertmanager Configuration (alertmanager/alertmanager.yml)
global:
smtp_smarthost: 'localhost:587'
smtp_from: 'alertmanager@yourdomain.com'
# Configure with your email settings
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
- match:
severity: warning
receiver: 'warning-alerts'
receivers:
- name: 'web.hook'
webhook_configs:
- url: 'http://127.0.0.1:5001/'
- name: 'critical-alerts'
email_configs:
- to: 'admin@yourdomain.com'
subject: 'CRITICAL: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
# Uncomment and configure for Discord/Slack webhooks
# webhook_configs:
# - url: 'YOUR_DISCORD_WEBHOOK_URL'
- name: 'warning-alerts'
email_configs:
- to: 'admin@yourdomain.com'
subject: 'WARNING: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
Grafana Datasource Provisioning (grafana/provisioning/datasources/prometheus.yml)
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
VM Configuration
Node Exporter and cAdvisor Deployment
Deploy this on each VM:
# docker-compose.yml for each VM
version: '3.8'
services:
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
restart: unless-stopped
ports:
- "9100:9100"
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
restart: unless-stopped
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
devices:
- /dev/kmsg
Custom Metrics Example
For media server transcode monitoring:
#!/usr/bin/env python3
# custom_exporter.py
from http.server import HTTPServer, BaseHTTPRequestHandler
import subprocess
import json
class MetricsHandler(BaseHTTPRequestHandler):
def do_GET(self):
if self.path == '/metrics':
metrics = self.get_metrics()
self.send_response(200)
self.send_header('Content-type', 'text/plain')
self.end_headers()
self.wfile.write(metrics.encode())
else:
self.send_response(404)
self.end_headers()
def get_metrics(self):
metrics = []
# Example: Count completed transcodes from log file
try:
with open('/var/log/transcodes.log', 'r') as f:
completed = len([line for line in f if 'COMPLETED' in line])
failed = len([line for line in f if 'FAILED' in line])
metrics.append(f'transcodes_completed_total {completed}')
metrics.append(f'transcodes_failed_total {failed}')
except FileNotFoundError:
metrics.append('transcodes_completed_total 0')
metrics.append('transcodes_failed_total 0')
# Add more custom metrics as needed
return '\n'.join(metrics) + '\n'
if __name__ == '__main__':
server = HTTPServer(('0.0.0.0', 9999), MetricsHandler)
print("Custom metrics server running on port 9999")
server.serve_forever()
Deploy as a systemd service or Docker container.
Home Assistant Metrics Exporter
#!/usr/bin/env python3
# homeassistant_exporter.py
import requests
import time
from http.server import HTTPServer, BaseHTTPRequestHandler
import json
HA_URL = "http://192.168.1.XXX:8123" # Your HA IP
HA_TOKEN = "your-long-lived-access-token" # Generate in HA Profile
class HAMetricsHandler(BaseHTTPRequestHandler):
def do_GET(self):
if self.path == '/metrics':
metrics = self.get_ha_metrics()
self.send_response(200)
self.send_header('Content-type', 'text/plain')
self.end_headers()
self.wfile.write(metrics.encode())
else:
self.send_response(404)
self.end_headers()
def get_ha_metrics(self):
headers = {
"Authorization": f"Bearer {HA_TOKEN}",
"Content-Type": "application/json"
}
try:
response = requests.get(f"{HA_URL}/api/states", headers=headers, timeout=10)
entities = response.json()
except Exception as e:
return f"# Error fetching HA data: {e}\n"
metrics = []
metrics.append("# HELP homeassistant_entity_state Home Assistant entity states")
metrics.append("# TYPE homeassistant_entity_state gauge")
for entity in entities:
entity_id = entity['entity_id']
domain = entity_id.split('.')[0]
name = entity_id.replace('.', '_').replace('-', '_')
state = entity['state']
# Add labels for better organization
labels = f'domain="{domain}",entity_id="{entity_id}"'
try:
# Try to convert to numeric value
value = float(state)
metrics.append(f'homeassistant_entity_state{{{labels}}} {value}')
except (ValueError, TypeError):
# Handle boolean states
if state.lower() in ['on', 'true', 'open', 'home']:
metrics.append(f'homeassistant_entity_state{{{labels}}} 1')
elif state.lower() in ['off', 'false', 'closed', 'away']:
metrics.append(f'homeassistant_entity_state{{{labels}}} 0')
else:
# For text states, create info metric
info_labels = f'domain="{domain}",entity_id="{entity_id}",state="{state}"'
metrics.append(f'homeassistant_entity_info{{{info_labels}}} 1')
return '\n'.join(metrics) + '\n'
if __name__ == '__main__':
server = HTTPServer(('0.0.0.0', 9998), HAMetricsHandler)
print("Home Assistant metrics exporter running on port 9998")
server.serve_forever()
HomeKit Bridge Exporter (Optional)
#!/usr/bin/env python3
# homekit_exporter.py - Requires homekit2mqtt or similar bridge
import paho.mqtt.client as mqtt
import json
from http.server import HTTPServer, BaseHTTPRequestHandler
import threading
MQTT_BROKER = "192.168.1.XXX" # Your MQTT broker
MQTT_PORT = 1883
HOMEKIT_TOPIC = "homekit/#"
class HomeKitData:
def __init__(self):
self.devices = {}
self.client = mqtt.Client()
self.client.on_connect = self.on_connect
self.client.on_message = self.on_message
def on_connect(self, client, userdata, flags, rc):
print(f"Connected to MQTT broker with result code {rc}")
client.subscribe(HOMEKIT_TOPIC)
def on_message(self, client, userdata, msg):
try:
topic_parts = msg.topic.split('/')
device_id = topic_parts[1] if len(topic_parts) > 1 else "unknown"
characteristic = topic_parts[2] if len(topic_parts) > 2 else "state"
value = json.loads(msg.payload.decode())
if device_id not in self.devices:
self.devices[device_id] = {}
self.devices[device_id][characteristic] = value
except Exception as e:
print(f"Error processing MQTT message: {e}")
homekit_data = HomeKitData()
class HomeKitMetricsHandler(BaseHTTPRequestHandler):
def do_GET(self):
if self.path == '/metrics':
metrics = self.get_homekit_metrics()
self.send_response(200)
self.send_header('Content-type', 'text/plain')
self.end_headers()
self.wfile.write(metrics.encode())
else:
self.send_response(404)
self.end_headers()
def get_homekit_metrics(self):
metrics = []
metrics.append("# HELP homekit_device_state HomeKit device states")
metrics.append("# TYPE homekit_device_state gauge")
for device_id, characteristics in homekit_data.devices.items():
for char_name, value in characteristics.items():
labels = f'device_id="{device_id}",characteristic="{char_name}"'
try:
numeric_value = float(value)
metrics.append(f'homekit_device_state{{{labels}}} {numeric_value}')
except (ValueError, TypeError):
# Handle boolean values
if str(value).lower() in ['true', 'on']:
metrics.append(f'homekit_device_state{{{labels}}} 1')
elif str(value).lower() in ['false', 'off']:
metrics.append(f'homekit_device_state{{{labels}}} 0')
return '\n'.join(metrics) + '\n'
def start_mqtt_client():
homekit_data.client.connect(MQTT_BROKER, MQTT_PORT, 60)
homekit_data.client.loop_forever()
if __name__ == '__main__':
# Start MQTT client in background thread
mqtt_thread = threading.Thread(target=start_mqtt_client)
mqtt_thread.daemon = True
mqtt_thread.start()
# Start HTTP server
server = HTTPServer(('0.0.0.0', 9997), HomeKitMetricsHandler)
print("HomeKit metrics exporter running on port 9997")
server.serve_forever()
Proxmox Host Configuration
Install Proxmox VE Exporter on your Proxmox host:
# On Proxmox host
wget https://github.com/prometheus-pve/prometheus-pve-exporter/releases/latest/download/prometheus-pve-exporter
chmod +x prometheus-pve-exporter
# Create config file
cat > pve.yml << EOF
default:
user: monitoring@pve
password: your-password
verify_ssl: false
EOF
# Run the exporter
./prometheus-pve-exporter --config.file=pve.yml
Installation Steps
-
Create monitoring VM with adequate resources (4GB RAM, 20GB disk recommended)
-
Set up main monitoring stack:
mkdir -p monitoring/{prometheus,alertmanager,grafana/provisioning/{datasources,dashboards}} cd monitoring # Copy all configuration files from above docker-compose up -d -
Deploy exporters on each VM:
# On each VM docker-compose -f node-cadvisor-compose.yml up -d -
Configure Proxmox exporter on the host
-
Set up Home Assistant integration:
Option A: Enable Prometheus in Home Assistant
# Add to Home Assistant configuration.yaml prometheus: namespace: homeassistant filter: include_domains: - sensor - binary_sensor - switch - light - climate - weatherOption B: Deploy custom HA exporter
# Create and run the Home Assistant exporter python3 homeassistant_exporter.py -
Optional: Set up HomeKit integration
# If using homekit2mqtt bridge npm install -g homekit2mqtt homekit2mqtt --mqtt-url mqtt://your-mqtt-broker # Then run the HomeKit exporter python3 homekit_exporter.py -
Access interfaces:
- Grafana: http://monitoring-vm-ip:3000 (admin/admin123)
- Prometheus: http://monitoring-vm-ip:9090
- Alertmanager: http://monitoring-vm-ip:9093
-
Import dashboards in Grafana:
- Node Exporter Full (Dashboard ID: 1860)
- Docker and system monitoring (Dashboard ID: 893)
- Proxmox VE (Dashboard ID: 10347)
- Home Assistant (Dashboard ID: 11021)
-
Generate Home Assistant Long-Lived Access Token:
- Go to HA Profile → Long-Lived Access Tokens
- Create new token and update exporter configs
Customization Notes
- Update all IP addresses in prometheus.yml to match your network
- Configure email settings in alertmanager.yml
- Adjust alert thresholds in alerts.yml based on your requirements
- Add custom exporters for specific application monitoring
- Set up proper authentication and SSL for production use
Maintenance
- Monitor disk space usage for time-series data
- Regular backups of configuration files
- Update container images periodically
- Review and tune alert rules based on false positive rates
This setup provides comprehensive monitoring for your home lab with room for expansion as your infrastructure grows.