All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
Adds title, description, type, domain, and tags frontmatter to every doc for improved KB semantic search. The description field is prepended to every search chunk, and domain/type/tags enable filtered queries. Type values: context, guide, runbook, reference, troubleshooting Domain values match directory structure (networking, docker, etc.) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
100 lines
5.4 KiB
Markdown
100 lines
5.4 KiB
Markdown
---
|
|
title: "Server Diagnostics Architecture"
|
|
description: "Deployment and architecture docs for the CT 302 claude-runner two-tier health monitoring system. Covers n8n integration, cost model, repository layout, SSH auth, and adding new servers to monitoring."
|
|
type: reference
|
|
domain: monitoring
|
|
tags: [claude-runner, server-diagnostics, n8n, ssh, health-check, ct-302, gitea]
|
|
---
|
|
|
|
# Server Diagnostics — Deployment & Architecture
|
|
|
|
## Overview
|
|
|
|
Automated server health monitoring running on CT 302 (claude-runner, 10.10.0.148).
|
|
Two-tier system: Python health checks handle 99% of issues autonomously; Claude
|
|
is only invoked for complex failures that scripts can't resolve.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌──────────────────────┐ ┌──────────────────────────────────┐
|
|
│ N8N (LXC 210) │ │ CT 302 — claude-runner │
|
|
│ 10.10.0.210 │ │ 10.10.0.148 │
|
|
│ │ │ │
|
|
│ ┌─────────────────┐ │ SSH │ ┌──────────────────────────┐ │
|
|
│ │ Cron: */15 min │─┼─────┼─→│ health_check.py │ │
|
|
│ │ │ │ │ │ (exit 0/1/2) │ │
|
|
│ │ Branch on exit: │ │ │ └──────────────────────────┘ │
|
|
│ │ 0 → stop │ │ │ │
|
|
│ │ 1 → stop │ │ │ ┌──────────────────────────┐ │
|
|
│ │ 2 → invoke │─┼─────┼─→│ claude --print │ │
|
|
│ │ Claude │ │ │ │ + client.py │ │
|
|
│ └─────────────────┘ │ │ └──────────────────────────┘ │
|
|
│ │ │ │
|
|
│ ┌─────────────────┐ │ │ SSH keys: │
|
|
│ │ Uptime Kuma │ │ │ - homelab_rsa (→ target servers)│
|
|
│ │ webhook trigger │ │ │ - n8n_runner_key (← N8N) │
|
|
│ └─────────────────┘ │ └──────────────────────────────────┘
|
|
└──────────────────────┘
|
|
│ SSH to target servers
|
|
▼
|
|
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
|
|
│ arr-stack │ │ gitea │ │ uptime-kuma │
|
|
│ 10.10.0.221 │ │ 10.10.0.225 │ │ 10.10.0.227 │
|
|
│ Docker: sonarr │ │ systemd: gitea │ │ Docker: kuma │
|
|
│ radarr, etc. │ │ Docker: runner │ │ │
|
|
└────────────────┘ └────────────────┘ └────────────────┘
|
|
```
|
|
|
|
## Cost Model
|
|
|
|
- **Exit 0** (healthy): $0 — pure Python, no API call
|
|
- **Exit 1** (auto-remediated): $0 — Python restarts container + Discord webhook
|
|
- **Exit 2** (escalation): ~$0.10-0.15 — Claude Sonnet invoked via `claude --print`
|
|
|
|
At 96 checks/day (every 15 min), typical cost is near $0 unless something
|
|
actually breaks and can't be auto-fixed.
|
|
|
|
## Repository
|
|
|
|
**Gitea:** `cal/claude-runner-monitoring`
|
|
**Deployed to:** `/root/.claude` on CT 302
|
|
**SSH alias:** `claude-runner` (root@10.10.0.148, defined in `~/.ssh/config`)
|
|
**Update method:** `ssh claude-runner "cd /root/.claude && git pull"`
|
|
|
|
### Git Auth on CT 302
|
|
|
|
CT 302 pushes to Gitea via HTTPS with a token auth header (embedded-credential URLs are rejected by Gitea). The token is stored locally in `~/.claude/secrets/claude_runner_monitoring_gitea_token` and configured on CT 302 via:
|
|
|
|
```
|
|
git config http.https://git.manticorum.com/.extraHeader 'Authorization: token <token>'
|
|
```
|
|
|
|
CT 302 does **not** have an SSH key registered with Gitea, so SSH git remotes won't work.
|
|
|
|
## Files
|
|
|
|
| File | Purpose |
|
|
|------|---------|
|
|
| `CLAUDE.md` | Runner-specific instructions for Claude |
|
|
| `settings.json` | Locked-down permissions (read-only + restart only) |
|
|
| `skills/server-diagnostics/health_check.py` | Tier 1: automated health checks |
|
|
| `skills/server-diagnostics/client.py` | Tier 2: Claude's diagnostic toolkit |
|
|
| `skills/server-diagnostics/notifier.py` | Discord webhook notifications |
|
|
| `skills/server-diagnostics/config.yaml` | Server inventory + security rules |
|
|
| `skills/server-diagnostics/SKILL.md` | Skill reference |
|
|
| `skills/server-diagnostics/CLAUDE.md` | Remediation methodology |
|
|
|
|
## Adding a New Server
|
|
|
|
1. Add entry to `config.yaml` under `servers:` with hostname, containers, etc.
|
|
2. Ensure CT 302 can SSH: `ssh -i /root/.ssh/homelab_rsa root@<ip> hostname`
|
|
3. Commit to Gitea, pull on CT 302
|
|
4. Add Uptime Kuma monitors if desired
|
|
|
|
## Related
|
|
|
|
- [monitoring/CONTEXT.md](../CONTEXT.md) — Overall monitoring architecture
|
|
- [productivity/n8n/CONTEXT.md](../../productivity/n8n/CONTEXT.md) — N8N deployment
|
|
- Uptime Kuma status page: https://status.manticorum.com
|