claude-home/monitoring/server-diagnostics/CONTEXT.md

---
title: "Server Diagnostics Architecture"
description: "Deployment and architecture docs for the CT 302 claude-runner two-tier health monitoring system. Covers n8n integration, cost model, repository layout, SSH auth, and adding new servers to monitoring."
type: reference
domain: monitoring
tags: [claude-runner, server-diagnostics, n8n, ssh, health-check, ct-302, gitea]
---

# Server Diagnostics — Deployment & Architecture

## Overview

Automated server health monitoring running on CT 302 (claude-runner, 10.10.0.148).
Two-tier system: Python health checks handle 99% of issues autonomously; Claude
is only invoked for complex failures that scripts can't resolve.

## Architecture

```
┌──────────────────────┐     ┌──────────────────────────────────┐
│  N8N (LXC 210)       │     │  CT 302 — claude-runner          │
│  10.10.0.210         │     │  10.10.0.148                     │
│                      │     │                                  │
│  ┌─────────────────┐ │ SSH │  ┌──────────────────────────┐   │
│  │ Cron: */15 min  │─┼─────┼─→│ health_check.py          │   │
│  │                 │ │     │  │ (exit 0/1/2)             │   │
│  │ Branch on exit: │ │     │  └──────────────────────────┘   │
│  │  0 → stop       │ │     │                                  │
│  │  1 → stop       │ │     │  ┌──────────────────────────┐   │
│  │  2 → invoke     │─┼─────┼─→│ claude --print            │   │
│  │     Claude      │ │     │  │ + client.py               │   │
│  └─────────────────┘ │     │  └──────────────────────────┘   │
│                      │     │                                  │
│  ┌─────────────────┐ │     │  SSH keys:                       │
│  │ Uptime Kuma     │ │     │  - homelab_rsa (→ target servers)│
│  │ webhook trigger │ │     │  - n8n_runner_key (← N8N)        │
│  └─────────────────┘ │     └──────────────────────────────────┘
└──────────────────────┘
         │ SSH to target servers
         ▼
┌────────────────┐  ┌────────────────┐  ┌────────────────┐
│ arr-stack      │  │ gitea          │  │ uptime-kuma    │
│ 10.10.0.221    │  │ 10.10.0.225    │  │ 10.10.0.227    │
│ Docker: sonarr │  │ systemd: gitea │  │ Docker: kuma   │
│ radarr, etc.   │  │ Docker: runner │  │                │
└────────────────┘  └────────────────┘  └────────────────┘
```

## Cost Model

- **Exit 0** (healthy): $0 — pure Python, no API call
- **Exit 1** (auto-remediated): $0 — Python restarts container + Discord webhook
- **Exit 2** (escalation): ~$0.10-0.15 — Claude Sonnet invoked via `claude --print`

At 96 checks/day (every 15 min), typical cost is near $0 unless something
actually breaks and can't be auto-fixed.

## Repository

**Gitea:** `cal/claude-runner-monitoring`
**Deployed to:** `/root/.claude` on CT 302
**SSH alias:** `claude-runner` (root@10.10.0.148, defined in `~/.ssh/config`)
**Update method:** `ssh claude-runner "cd /root/.claude && git pull"`

### Git Auth on CT 302

CT 302 pushes to Gitea via HTTPS with a token auth header (embedded-credential URLs are rejected by Gitea). The token is stored locally in `~/.claude/secrets/claude_runner_monitoring_gitea_token` and configured on CT 302 via:

```
git config http.https://git.manticorum.com/.extraHeader 'Authorization: token <token>'
```

CT 302 does **not** have an SSH key registered with Gitea, so SSH git remotes won't work.

## Files

| File | Purpose |
|------|---------|
| `CLAUDE.md` | Runner-specific instructions for Claude |
| `settings.json` | Locked-down permissions (read-only + restart only) |
| `skills/server-diagnostics/health_check.py` | Tier 1: automated health checks |
| `skills/server-diagnostics/client.py` | Tier 2: Claude's diagnostic toolkit |
| `skills/server-diagnostics/notifier.py` | Discord webhook notifications |
| `skills/server-diagnostics/config.yaml` | Server inventory + security rules |
| `skills/server-diagnostics/SKILL.md` | Skill reference |
| `skills/server-diagnostics/CLAUDE.md` | Remediation methodology |

## Adding a New Server

1. Add entry to `config.yaml` under `servers:` with hostname, containers, etc.
2. Ensure CT 302 can SSH: `ssh -i /root/.ssh/homelab_rsa root@<ip> hostname`
3. Commit to Gitea, pull on CT 302
4. Add Uptime Kuma monitors if desired

## Health Check Thresholds

Thresholds are evaluated in `health_check.py`. All load thresholds use **per-core** metrics
to avoid false positives from LXC containers (which see the Proxmox host's aggregate load).

### Load Average

| Metric | Value | Rationale |
|--------|-------|-----------|
| `LOAD_WARN_PER_CORE` | `0.7` | Elevated — investigate if sustained |
| `LOAD_CRIT_PER_CORE` | `1.0` | Saturated — CPU is a bottleneck |
| Sample window | 5-minute | Filters transient spikes (not 1-minute) |

**Formula**: `load_per_core = load_5m / nproc`

**Why per-core?** Proxmox LXC containers see the host's aggregate load average via the
shared kernel. A 32-core Proxmox host at load 9 is at 0.28/core (healthy), but a naive
absolute threshold of 2× would trigger at 9 for a 4-core LXC. Using `load_5m / nproc`
where `nproc` returns the host's visible core count gives the correct ratio.

**Validation examples**:
- Proxmox host: load 9 / 32 cores = 0.28/core → no alert ✓
- VM 116 at 0.75/core → warning ✓ (above 0.7 threshold)
- VM at 1.1/core → critical ✓

### Other Thresholds

| Check | Threshold | Notes |
|-------|-----------|-------|
| Zombie processes | 5 | Single zombies are transient noise; alert only if ≥ 5 |
| Swap usage | 30% of total swap | Percentage-based to handle varied swap sizes across hosts |
| Disk warning | 85% | |
| Disk critical | 95% | |
| Memory | 90% | |
| Uptime alert | Non-urgent Discord post | Not a page-level alert |

## Related

- [monitoring/CONTEXT.md](../CONTEXT.md) — Overall monitoring architecture
- [productivity/n8n/CONTEXT.md](../../productivity/n8n/CONTEXT.md) — N8N deployment
- Uptime Kuma status page: https://status.manticorum.com