claude-home/monitoring/server-diagnostics/CONTEXT.md
Cal Corum 4b7eca8a46
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
docs: add YAML frontmatter to all 151 markdown files
Adds title, description, type, domain, and tags frontmatter to every
doc for improved KB semantic search. The description field is prepended
to every search chunk, and domain/type/tags enable filtered queries.

Type values: context, guide, runbook, reference, troubleshooting
Domain values match directory structure (networking, docker, etc.)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 09:00:44 -05:00

5.4 KiB

title description type domain tags
Server Diagnostics Architecture Deployment and architecture docs for the CT 302 claude-runner two-tier health monitoring system. Covers n8n integration, cost model, repository layout, SSH auth, and adding new servers to monitoring. reference monitoring
claude-runner
server-diagnostics
n8n
ssh
health-check
ct-302
gitea

Server Diagnostics — Deployment & Architecture

Overview

Automated server health monitoring running on CT 302 (claude-runner, 10.10.0.148). Two-tier system: Python health checks handle 99% of issues autonomously; Claude is only invoked for complex failures that scripts can't resolve.

Architecture

┌──────────────────────┐     ┌──────────────────────────────────┐
│  N8N (LXC 210)       │     │  CT 302 — claude-runner          │
│  10.10.0.210         │     │  10.10.0.148                     │
│                      │     │                                  │
│  ┌─────────────────┐ │ SSH │  ┌──────────────────────────┐   │
│  │ Cron: */15 min  │─┼─────┼─→│ health_check.py          │   │
│  │                 │ │     │  │ (exit 0/1/2)             │   │
│  │ Branch on exit: │ │     │  └──────────────────────────┘   │
│  │  0 → stop       │ │     │                                  │
│  │  1 → stop       │ │     │  ┌──────────────────────────┐   │
│  │  2 → invoke     │─┼─────┼─→│ claude --print            │   │
│  │     Claude      │ │     │  │ + client.py               │   │
│  └─────────────────┘ │     │  └──────────────────────────┘   │
│                      │     │                                  │
│  ┌─────────────────┐ │     │  SSH keys:                       │
│  │ Uptime Kuma     │ │     │  - homelab_rsa (→ target servers)│
│  │ webhook trigger │ │     │  - n8n_runner_key (← N8N)        │
│  └─────────────────┘ │     └──────────────────────────────────┘
└──────────────────────┘
         │ SSH to target servers
         ▼
┌────────────────┐  ┌────────────────┐  ┌────────────────┐
│ arr-stack      │  │ gitea          │  │ uptime-kuma    │
│ 10.10.0.221    │  │ 10.10.0.225    │  │ 10.10.0.227    │
│ Docker: sonarr │  │ systemd: gitea │  │ Docker: kuma   │
│ radarr, etc.   │  │ Docker: runner │  │                │
└────────────────┘  └────────────────┘  └────────────────┘

Cost Model

  • Exit 0 (healthy): $0 — pure Python, no API call
  • Exit 1 (auto-remediated): $0 — Python restarts container + Discord webhook
  • Exit 2 (escalation): ~$0.10-0.15 — Claude Sonnet invoked via claude --print

At 96 checks/day (every 15 min), typical cost is near $0 unless something actually breaks and can't be auto-fixed.

Repository

Gitea: cal/claude-runner-monitoring Deployed to: /root/.claude on CT 302 SSH alias: claude-runner (root@10.10.0.148, defined in ~/.ssh/config) Update method: ssh claude-runner "cd /root/.claude && git pull"

Git Auth on CT 302

CT 302 pushes to Gitea via HTTPS with a token auth header (embedded-credential URLs are rejected by Gitea). The token is stored locally in ~/.claude/secrets/claude_runner_monitoring_gitea_token and configured on CT 302 via:

git config http.https://git.manticorum.com/.extraHeader 'Authorization: token <token>'

CT 302 does not have an SSH key registered with Gitea, so SSH git remotes won't work.

Files

File Purpose
CLAUDE.md Runner-specific instructions for Claude
settings.json Locked-down permissions (read-only + restart only)
skills/server-diagnostics/health_check.py Tier 1: automated health checks
skills/server-diagnostics/client.py Tier 2: Claude's diagnostic toolkit
skills/server-diagnostics/notifier.py Discord webhook notifications
skills/server-diagnostics/config.yaml Server inventory + security rules
skills/server-diagnostics/SKILL.md Skill reference
skills/server-diagnostics/CLAUDE.md Remediation methodology

Adding a New Server

  1. Add entry to config.yaml under servers: with hostname, containers, etc.
  2. Ensure CT 302 can SSH: ssh -i /root/.ssh/homelab_rsa root@<ip> hostname
  3. Commit to Gitea, pull on CT 302
  4. Add Uptime Kuma monitors if desired