claude-home/monitoring/server-diagnostics/CONTEXT.md
Cal Corum ed16fee9f7 docs: add CT 302 SSH alias and git auth details to server-diagnostics
Documents the claude-runner SSH alias, HTTPS token auth method,
and notes that SSH git remotes don't work from CT 302.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 09:04:33 -06:00

5.1 KiB

Server Diagnostics — Deployment & Architecture

Overview

Automated server health monitoring running on CT 302 (claude-runner, 10.10.0.148). Two-tier system: Python health checks handle 99% of issues autonomously; Claude is only invoked for complex failures that scripts can't resolve.

Architecture

┌──────────────────────┐     ┌──────────────────────────────────┐
│  N8N (LXC 210)       │     │  CT 302 — claude-runner          │
│  10.10.0.210         │     │  10.10.0.148                     │
│                      │     │                                  │
│  ┌─────────────────┐ │ SSH │  ┌──────────────────────────┐   │
│  │ Cron: */15 min  │─┼─────┼─→│ health_check.py          │   │
│  │                 │ │     │  │ (exit 0/1/2)             │   │
│  │ Branch on exit: │ │     │  └──────────────────────────┘   │
│  │  0 → stop       │ │     │                                  │
│  │  1 → stop       │ │     │  ┌──────────────────────────┐   │
│  │  2 → invoke     │─┼─────┼─→│ claude --print            │   │
│  │     Claude      │ │     │  │ + client.py               │   │
│  └─────────────────┘ │     │  └──────────────────────────┘   │
│                      │     │                                  │
│  ┌─────────────────┐ │     │  SSH keys:                       │
│  │ Uptime Kuma     │ │     │  - homelab_rsa (→ target servers)│
│  │ webhook trigger │ │     │  - n8n_runner_key (← N8N)        │
│  └─────────────────┘ │     └──────────────────────────────────┘
└──────────────────────┘
         │ SSH to target servers
         ▼
┌────────────────┐  ┌────────────────┐  ┌────────────────┐
│ arr-stack      │  │ gitea          │  │ uptime-kuma    │
│ 10.10.0.221    │  │ 10.10.0.225    │  │ 10.10.0.227    │
│ Docker: sonarr │  │ systemd: gitea │  │ Docker: kuma   │
│ radarr, etc.   │  │ Docker: runner │  │                │
└────────────────┘  └────────────────┘  └────────────────┘

Cost Model

  • Exit 0 (healthy): $0 — pure Python, no API call
  • Exit 1 (auto-remediated): $0 — Python restarts container + Discord webhook
  • Exit 2 (escalation): ~$0.10-0.15 — Claude Sonnet invoked via claude --print

At 96 checks/day (every 15 min), typical cost is near $0 unless something actually breaks and can't be auto-fixed.

Repository

Gitea: cal/claude-runner-monitoring Deployed to: /root/.claude on CT 302 SSH alias: claude-runner (root@10.10.0.148, defined in ~/.ssh/config) Update method: ssh claude-runner "cd /root/.claude && git pull"

Git Auth on CT 302

CT 302 pushes to Gitea via HTTPS with a token auth header (embedded-credential URLs are rejected by Gitea). The token is stored locally in ~/.claude/secrets/claude_runner_monitoring_gitea_token and configured on CT 302 via:

git config http.https://git.manticorum.com/.extraHeader 'Authorization: token <token>'

CT 302 does not have an SSH key registered with Gitea, so SSH git remotes won't work.

Files

File Purpose
CLAUDE.md Runner-specific instructions for Claude
settings.json Locked-down permissions (read-only + restart only)
skills/server-diagnostics/health_check.py Tier 1: automated health checks
skills/server-diagnostics/client.py Tier 2: Claude's diagnostic toolkit
skills/server-diagnostics/notifier.py Discord webhook notifications
skills/server-diagnostics/config.yaml Server inventory + security rules
skills/server-diagnostics/SKILL.md Skill reference
skills/server-diagnostics/CLAUDE.md Remediation methodology

Adding a New Server

  1. Add entry to config.yaml under servers: with hostname, containers, etc.
  2. Ensure CT 302 can SSH: ssh -i /root/.ssh/homelab_rsa root@<ip> hostname
  3. Commit to Gitea, pull on CT 302
  4. Add Uptime Kuma monitors if desired