feat: right-size VM 115 config and add --hosts flag to audit script

Reduce VM 115 (docker-sba) from 16 vCPUs (2×8) to 8 vCPUs (1×8) to match actual workload (0.06 load/core). Add --hosts flag to homelab-audit.sh for targeted post-change audits. Closes #18 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
docs: sync KB — maintenance-reboot.md
2026-04-03 16:03:39 -05:00 · 2026-04-03 16:00:22 -05:00 · 2026-04-03 20:22:24 +00:00 · 2026-04-03 20:08:07 +00:00 · 2026-04-03 20:01:27 +00:00 · 2026-04-03 20:01:13 +00:00
11 changed files with 704 additions and 19 deletions
--- a/legacy/headless-claude/n8n-workflow-import.json
+++ b/legacy/headless-claude/n8n-workflow-import.json
@ -21,7 +21,7 @@
    {
      "parameters": {
        "operation": "executeCommand",
-        "command": "/root/.local/bin/claude -p \"Run python3 ~/.claude/skills/server-diagnostics/client.py health paper-dynasty and analyze the results. If any containers are not running or there are critical issues, summarize them. Otherwise just say 'All systems healthy'.\" --output-format json --json-schema '{\"type\":\"object\",\"properties\":{\"status\":{\"type\":\"string\",\"enum\":[\"healthy\",\"issues_found\"]},\"summary\":{\"type\":\"string\"},\"root_cause\":{\"type\":\"string\"},\"severity\":{\"type\":\"string\",\"enum\":[\"low\",\"medium\",\"high\",\"critical\"]},\"affected_services\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}},\"actions_taken\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}}},\"required\":[\"status\",\"severity\",\"summary\"]}' --allowedTools \"Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)\"",
+        "command": "/root/.local/bin/claude -p \"Run python3 ~/.claude/skills/server-diagnostics/client.py health paper-dynasty and analyze the results. If any containers are not running or there are critical issues, summarize them. Otherwise just say 'All systems healthy'.\" --output-format json --json-schema '{\"type\":\"object\",\"properties\":{\"status\":{\"type\":\"string\",\"enum\":[\"healthy\",\"issues_found\"]},\"summary\":{\"type\":\"string\"},\"root_cause\":{\"type\":\"string\"},\"severity\":{\"type\":\"string\",\"enum\":[\"low\",\"medium\",\"high\",\"critical\"]},\"affected_services\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}},\"actions_taken\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}}},\"required\":[\"status\",\"severity\",\"summary\"]}' --allowedTools \"Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)\" --append-system-prompt \"You are a server diagnostics agent. Use the server-diagnostics skill client.py for all operations. Never run destructive commands.\"",
        "options": {}
      },
      "id": "ssh-claude-code",
@ -75,20 +75,48 @@
      "typeVersion": 2,
      "position": [660, 0]
    },
+    {
+      "parameters": {
+        "operation": "executeCommand",
+        "command": "=/root/.local/bin/claude -p \"The previous health check found issues. Investigate deeper: check container logs, resource usage, and recent events. Provide a detailed root cause analysis and recommended remediation steps.\" --resume \"{{ $json.session_id }}\" --output-format json --json-schema '{\"type\":\"object\",\"properties\":{\"root_cause_detail\":{\"type\":\"string\"},\"container_logs\":{\"type\":\"string\"},\"resource_status\":{\"type\":\"string\"},\"remediation_steps\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}},\"requires_human\":{\"type\":\"boolean\"}},\"required\":[\"root_cause_detail\",\"remediation_steps\",\"requires_human\"]}' --allowedTools \"Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)\" --max-turns 15 --append-system-prompt \"You are a server diagnostics agent performing a follow-up investigation. The initial health check found issues. Dig deeper into logs and metrics. Never run destructive commands.\"",
+        "options": {}
+      },
+      "id": "ssh-followup",
+      "name": "Follow Up Diagnostics",
+      "type": "n8n-nodes-base.ssh",
+      "typeVersion": 1,
+      "position": [880, -200],
+      "credentials": {
+        "sshPassword": {
+          "id": "REPLACE_WITH_CREDENTIAL_ID",
+          "name": "Claude Code LXC"
+        }
+      }
+    },
+    {
+      "parameters": {
+        "jsCode": "// Parse follow-up diagnostics response\nconst stdout = $input.first().json.stdout || '';\nconst initial = $('Parse Claude Response').first().json;\n\ntry {\n  const response = JSON.parse(stdout);\n  const data = response.structured_output || JSON.parse(response.result || '{}');\n  \n  return [{\n    json: {\n      ...initial,\n      followup: {\n        root_cause_detail: data.root_cause_detail || 'No detail available',\n        container_logs: data.container_logs || '',\n        resource_status: data.resource_status || '',\n        remediation_steps: data.remediation_steps || [],\n        requires_human: data.requires_human || false,\n        cost_usd: response.total_cost_usd,\n        session_id: response.session_id\n      },\n      total_cost_usd: (initial.cost_usd || 0) + (response.total_cost_usd || 0)\n    }\n  }];\n} catch (e) {\n  return [{\n    json: {\n      ...initial,\n      followup: {\n        error: e.message,\n        root_cause_detail: 'Follow-up parse failed',\n        remediation_steps: [],\n        requires_human: true\n      },\n      total_cost_usd: initial.cost_usd || 0\n    }\n  }];\n}"
+      },
+      "id": "parse-followup",
+      "name": "Parse Follow-up Response",
+      "type": "n8n-nodes-base.code",
+      "typeVersion": 2,
+      "position": [1100, -200]
+    },
    {
      "parameters": {
        "method": "POST",
        "url": "https://discord.com/api/webhooks/1451783909409816763/O9PMDiNt6ZIWRf8HKocIZ_E4vMGV_lEwq50aAiZ9HVFR2UGwO6J1N9_wOm82p0MetIqT",
        "sendBody": true,
        "specifyBody": "json",
-        "jsonBody": "={\n  \"embeds\": [{\n    \"title\": \"{{ $json.severity === 'critical' ? '🔴' : $json.severity === 'high' ? '🟠' : '🟡' }} Server Alert\",\n    \"description\": {{ JSON.stringify($json.summary) }},\n    \"color\": {{ $json.severity === 'critical' ? 15158332 : $json.severity === 'high' ? 15105570 : 16776960 }},\n    \"fields\": [\n      {\n        \"name\": \"Severity\",\n        \"value\": \"{{ $json.severity.toUpperCase() }}\",\n        \"inline\": true\n      },\n      {\n        \"name\": \"Server\",\n        \"value\": \"paper-dynasty (10.10.0.88)\",\n        \"inline\": true\n      },\n      {\n        \"name\": \"Cost\",\n        \"value\": \"${{ $json.cost_usd ? $json.cost_usd.toFixed(4) : '0.0000' }}\",\n        \"inline\": true\n      },\n      {\n        \"name\": \"Root Cause\",\n        \"value\": \"{{ $json.root_cause || 'N/A' }}\",\n        \"inline\": false\n      },\n      {\n        \"name\": \"Affected Services\",\n        \"value\": \"{{ $json.affected_services.length ? $json.affected_services.join(', ') : 'None' }}\",\n        \"inline\": false\n      },\n      {\n        \"name\": \"Actions Taken\",\n        \"value\": \"{{ $json.actions_taken.length ? $json.actions_taken.join('\\n') : 'None' }}\",\n        \"inline\": false\n      }\n    ],\n    \"timestamp\": \"{{ new Date().toISOString() }}\"\n  }]\n}",
+        "jsonBody": "={\n  \"embeds\": [{\n    \"title\": \"{{ $json.severity === 'critical' ? '🔴' : $json.severity === 'high' ? '🟠' : '🟡' }} Server Alert\",\n    \"description\": {{ JSON.stringify($json.summary) }},\n    \"color\": {{ $json.severity === 'critical' ? 15158332 : $json.severity === 'high' ? 15105570 : 16776960 }},\n    \"fields\": [\n      {\n        \"name\": \"Severity\",\n        \"value\": \"{{ $json.severity.toUpperCase() }}\",\n        \"inline\": true\n      },\n      {\n        \"name\": \"Server\",\n        \"value\": \"paper-dynasty (10.10.0.88)\",\n        \"inline\": true\n      },\n      {\n        \"name\": \"Cost\",\n        \"value\": \"${{ $json.total_cost_usd ? $json.total_cost_usd.toFixed(4) : '0.0000' }}\",\n        \"inline\": true\n      },\n      {\n        \"name\": \"Root Cause\",\n        \"value\": {{ JSON.stringify(($json.followup && $json.followup.root_cause_detail) || $json.root_cause || 'N/A') }},\n        \"inline\": false\n      },\n      {\n        \"name\": \"Affected Services\",\n        \"value\": \"{{ $json.affected_services.length ? $json.affected_services.join(', ') : 'None' }}\",\n        \"inline\": false\n      },\n      {\n        \"name\": \"Remediation Steps\",\n        \"value\": {{ JSON.stringify(($json.followup && $json.followup.remediation_steps.length) ? $json.followup.remediation_steps.map((s, i) => (i+1) + '. ' + s).join('\\n') : ($json.actions_taken.length ? $json.actions_taken.join('\\n') : 'None')) }},\n        \"inline\": false\n      },\n      {\n        \"name\": \"Requires Human?\",\n        \"value\": \"{{ ($json.followup && $json.followup.requires_human) ? '⚠️ Yes' : '✅ No' }}\",\n        \"inline\": true\n      }\n    ],\n    \"timestamp\": \"{{ new Date().toISOString() }}\"\n  }]\n}",
        "options": {}
      },
      "id": "discord-alert",
      "name": "Discord Alert",
      "type": "n8n-nodes-base.httpRequest",
      "typeVersion": 4.2,
-      "position": [880, -100]
+      "position": [1320, -200]
    },
    {
      "parameters": {
@ -145,7 +173,7 @@
      "main": [
        [
          {
-            "node": "Discord Alert",
+            "node": "Follow Up Diagnostics",
            "type": "main",
            "index": 0
          }
@ -158,6 +186,28 @@
          }
        ]
      ]
+    },
+    "Follow Up Diagnostics": {
+      "main": [
+        [
+          {
+            "node": "Parse Follow-up Response",
+            "type": "main",
+            "index": 0
+          }
+        ]
+      ]
+    },
+    "Parse Follow-up Response": {
+      "main": [
+        [
+          {
+            "node": "Discord Alert",
+            "type": "main",
+            "index": 0
+          }
+        ]
+      ]
    }
  },
  "settings": {
--- a/monitoring/scripts/homelab-audit.sh
+++ b/monitoring/scripts/homelab-audit.sh
@ -5,7 +5,7 @@
 # to collect system metrics, then generates a summary report.
 #
 # Usage:
-#   homelab-audit.sh [--output-dir DIR]
+#   homelab-audit.sh [--output-dir DIR] [--hosts label:ip,label:ip,...]
 #
 # Environment overrides:
 #   STUCK_PROC_CPU_WARN  CPU% at which a D-state process is flagged (default: 10)
@ -30,6 +30,9 @@ MEM_WARN=85
 ZOMBIE_WARN=1
 SWAP_WARN=512

+HOSTS_FILTER="" # comma-separated host list from --hosts; empty = audit all
+JSON_OUTPUT=0   # set to 1 by --json
+
 while [[ $# -gt 0 ]]; do
  case "$1" in
    --output-dir)
@ -40,6 +43,18 @@ while [[ $# -gt 0 ]]; do
      REPORT_DIR="$2"
      shift 2
      ;;
+    --hosts)
+      if [[ $# -lt 2 ]]; then
+        echo "Error: --hosts requires an argument (label:ip,label:ip,...)" >&2
+        exit 1
+      fi
+      HOSTS_FILTER="$2"
+      shift 2
+      ;;
+    --json)
+      JSON_OUTPUT=1
+      shift
+      ;;
    *)
      echo "Unknown option: $1" >&2
      exit 1
@ -50,6 +65,7 @@ done
 mkdir -p "$REPORT_DIR"
 SSH_FAILURES_LOG="$REPORT_DIR/ssh-failures.log"
 FINDINGS_FILE="$REPORT_DIR/findings.txt"
+AUDITED_HOSTS=() # populated in main; used by generate_summary for per-host counts

 # ---------------------------------------------------------------------------
 # Remote collector script
@ -281,6 +297,18 @@ generate_summary() {
  printf "  Critical      : %d\n" "$crit_count"
  echo "=============================="

+  if [[ ${#AUDITED_HOSTS[@]} -gt 0 ]] && ((warn_count + crit_count > 0)); then
+    echo ""
+    printf "  %-30s %8s %8s\n" "Host" "Warnings" "Critical"
+    printf "  %-30s %8s %8s\n" "----" "--------" "--------"
+    for host in "${AUDITED_HOSTS[@]}"; do
+      local hw hc
+      hw=$(grep -c "^WARN  ${host}:" "$FINDINGS_FILE" 2>/dev/null || true)
+      hc=$(grep -c "^CRIT  ${host}:" "$FINDINGS_FILE" 2>/dev/null || true)
+      ((hw + hc > 0)) && printf "  %-30s %8d %8d\n" "$host" "$hw" "$hc"
+    done
+  fi
+
  if ((warn_count + crit_count > 0)); then
    echo ""
    echo "Findings:"
@ -293,6 +321,9 @@ generate_summary() {
    grep '^SSH_FAILURE' "$SSH_FAILURES_LOG" | awk '{print "  " $2 " (" $3 ")"}'
  fi

+  echo ""
+  printf "Total: %d warning(s), %d critical across %d host(s)\n" \
+    "$warn_count" "$crit_count" "$host_count"
  echo ""
  echo "Reports: $REPORT_DIR"
 }
@ -383,6 +414,69 @@ check_cert_expiry() {
  done
 }

+# ---------------------------------------------------------------------------
+# JSON report — writes findings.json to $REPORT_DIR when --json is used
+# ---------------------------------------------------------------------------
+write_json_report() {
+  local host_count="$1"
+  local json_file="$REPORT_DIR/findings.json"
+  local ssh_failure_count=0
+  local warn_count=0
+  local crit_count=0
+
+  [[ -f "$SSH_FAILURES_LOG" ]] &&
+    ssh_failure_count=$(grep -c '^SSH_FAILURE' "$SSH_FAILURES_LOG" 2>/dev/null || true)
+  [[ -f "$FINDINGS_FILE" ]] &&
+    warn_count=$(grep -c '^WARN' "$FINDINGS_FILE" 2>/dev/null || true)
+  [[ -f "$FINDINGS_FILE" ]] &&
+    crit_count=$(grep -c '^CRIT' "$FINDINGS_FILE" 2>/dev/null || true)
+
+  python3 - "$json_file" "$host_count" "$ssh_failure_count" \
+    "$warn_count" "$crit_count" "$FINDINGS_FILE" <<'PYEOF'
+import sys, json, datetime
+
+json_file = sys.argv[1]
+host_count = int(sys.argv[2])
+ssh_failure_count = int(sys.argv[3])
+warn_count = int(sys.argv[4])
+crit_count = int(sys.argv[5])
+findings_file = sys.argv[6]
+
+findings = []
+try:
+    with open(findings_file) as f:
+        for line in f:
+            line = line.strip()
+            if not line:
+                continue
+            parts = line.split(None, 2)
+            if len(parts) < 3:
+                continue
+            severity, host_colon, message = parts[0], parts[1], parts[2]
+            findings.append({
+                "severity": severity,
+                "host": host_colon.rstrip(":"),
+                "message": message,
+            })
+except FileNotFoundError:
+    pass
+
+output = {
+    "timestamp": datetime.datetime.utcnow().isoformat() + "Z",
+    "hosts_audited": host_count,
+    "warnings": warn_count,
+    "critical": crit_count,
+    "ssh_failures": ssh_failure_count,
+    "total_findings": warn_count + crit_count,
+    "findings": findings,
+}
+
+with open(json_file, "w") as f:
+    json.dump(output, f, indent=2)
+print(f"JSON report: {json_file}")
+PYEOF
+}
+
 # ---------------------------------------------------------------------------
 # Main
 # ---------------------------------------------------------------------------
@ -390,22 +484,50 @@ main() {
  echo "Starting homelab audit — $(date)"
  echo "Report dir: $REPORT_DIR"
  echo "STUCK_PROC_CPU_WARN threshold: ${STUCK_PROC_CPU_WARN}%"
+  [[ -n "$HOSTS_FILTER" ]] && echo "Host filter: $HOSTS_FILTER"
  echo ""

  >"$FINDINGS_FILE"

-  echo "  Checking Proxmox backup recency..."
-  check_backup_recency
-
  local host_count=0
-  while read -r label addr; do
-    echo "  Auditing $label ($addr)..."
-    parse_and_report "$label" "$addr"
-    check_cert_expiry "$label" "$addr"
-    ((host_count++)) || true
-  done < <(collect_inventory)
+
+  if [[ -n "$HOSTS_FILTER" ]]; then
+    # --hosts mode: audit specified hosts directly, skip Proxmox inventory
+    # Accepts comma-separated entries; each entry may be plain hostname or label:ip
+    local check_proxmox=0
+    IFS=',' read -ra filter_hosts <<<"$HOSTS_FILTER"
+    for entry in "${filter_hosts[@]}"; do
+      local label="${entry%%:*}"
+      [[ "$label" == "proxmox" ]] && check_proxmox=1
+    done
+    if ((check_proxmox)); then
+      echo "  Checking Proxmox backup recency..."
+      check_backup_recency
+    fi
+    for entry in "${filter_hosts[@]}"; do
+      local label="${entry%%:*}"
+      local addr="${entry#*:}"
+      echo "  Auditing $label ($addr)..."
+      parse_and_report "$label" "$addr"
+      check_cert_expiry "$label" "$addr"
+      AUDITED_HOSTS+=("$label")
+      ((host_count++)) || true
+    done
+  else
+    echo "  Checking Proxmox backup recency..."
+    check_backup_recency
+
+    while read -r label addr; do
+      echo "  Auditing $label ($addr)..."
+      parse_and_report "$label" "$addr"
+      check_cert_expiry "$label" "$addr"
+      AUDITED_HOSTS+=("$label")
+      ((host_count++)) || true
+    done < <(collect_inventory)
+  fi

  generate_summary "$host_count"
+  [[ "$JSON_OUTPUT" -eq 1 ]] && write_json_report "$host_count"
 }

 main "$@"
--- a/monitoring/scripts/test-audit-collectors.sh
+++ b/monitoring/scripts/test-audit-collectors.sh
@ -93,6 +93,34 @@ else
  fail "disk_usage" "expected 'N /path', got: '$result'"
 fi

+# --- --hosts flag parsing ---
+echo ""
+echo "=== --hosts argument parsing tests ==="
+
+# Single host
+input="vm-115:10.10.0.88"
+IFS=',' read -ra entries <<<"$input"
+label="${entries[0]%%:*}"
+addr="${entries[0]#*:}"
+if [[ "$label" == "vm-115" && "$addr" == "10.10.0.88" ]]; then
+  pass "--hosts single entry parsed: $label $addr"
+else
+  fail "--hosts single" "expected 'vm-115 10.10.0.88', got: '$label $addr'"
+fi
+
+# Multiple hosts
+input="vm-115:10.10.0.88,lxc-225:10.10.0.225"
+IFS=',' read -ra entries <<<"$input"
+label1="${entries[0]%%:*}"
+addr1="${entries[0]#*:}"
+label2="${entries[1]%%:*}"
+addr2="${entries[1]#*:}"
+if [[ "$label1" == "vm-115" && "$addr1" == "10.10.0.88" && "$label2" == "lxc-225" && "$addr2" == "10.10.0.225" ]]; then
+  pass "--hosts multi entry parsed: $label1 $addr1, $label2 $addr2"
+else
+  fail "--hosts multi" "unexpected parse result"
+fi
+
 echo ""
 echo "=== Results: $PASS passed, $FAIL failed ==="
 ((FAIL == 0))
--- a/monitoring/server-diagnostics/CONTEXT.md
+++ b/monitoring/server-diagnostics/CONTEXT.md
@ -92,6 +92,42 @@ CT 302 does **not** have an SSH key registered with Gitea, so SSH git remotes wo
 3. Commit to Gitea, pull on CT 302
 4. Add Uptime Kuma monitors if desired

+## Health Check Thresholds
+
+Thresholds are evaluated in `health_check.py`. All load thresholds use **per-core** metrics
+to avoid false positives from LXC containers (which see the Proxmox host's aggregate load).
+
+### Load Average
+
+| Metric | Value | Rationale |
+|--------|-------|-----------|
+| `LOAD_WARN_PER_CORE` | `0.7` | Elevated — investigate if sustained |
+| `LOAD_CRIT_PER_CORE` | `1.0` | Saturated — CPU is a bottleneck |
+| Sample window | 5-minute | Filters transient spikes (not 1-minute) |
+
+**Formula**: `load_per_core = load_5m / nproc`
+
+**Why per-core?** Proxmox LXC containers see the host's aggregate load average via the
+shared kernel. A 32-core Proxmox host at load 9 is at 0.28/core (healthy), but a naive
+absolute threshold of 2× would trigger at 9 for a 4-core LXC. Using `load_5m / nproc`
+where `nproc` returns the host's visible core count gives the correct ratio.
+
+**Validation examples**:
+- Proxmox host: load 9 / 32 cores = 0.28/core → no alert ✓
+- VM 116 at 0.75/core → warning ✓ (above 0.7 threshold)
+- VM at 1.1/core → critical ✓
+
+### Other Thresholds
+
+| Check | Threshold | Notes |
+|-------|-----------|-------|
+| Zombie processes | 5 | Single zombies are transient noise; alert only if ≥ 5 |
+| Swap usage | 30% of total swap | Percentage-based to handle varied swap sizes across hosts |
+| Disk warning | 85% | |
+| Disk critical | 95% | |
+| Memory | 90% | |
+| Uptime alert | Non-urgent Discord post | Not a page-level alert |
+
 ## Related

 - [monitoring/CONTEXT.md](../CONTEXT.md) — Overall monitoring architecture
--- a/scheduled-tasks/CONTEXT.md
+++ b/scheduled-tasks/CONTEXT.md
@ -158,6 +158,23 @@ ls -t ~/.local/share/claude-scheduled/logs/backlog-triage/ | head -1
 ~/.config/claude-scheduled/runner.sh backlog-triage
 ```

+## Session Resumption
+
+Tasks can opt into session persistence for multi-step workflows:
+
+```json
+{
+  "session_resumable": true,
+  "resume_last_session": true
+}
+```
+
+When `session_resumable` is `true`, runner.sh saves the `session_id` to `$LOG_DIR/last_session_id` after each run. When `resume_last_session` is also `true`, the next run resumes that session with `--resume`.
+
+Issue-poller and PR-reviewer capture `session_id` in logs and result JSON for manual follow-up.
+
+See also: [Agent SDK Evaluation](agent-sdk-evaluation.md) for CLI vs SDK comparison.
+
 ## Cost Safety

 - Per-task `max_budget_usd` cap — runner.sh detects `error_max_budget_usd` and warns
--- a/scheduled-tasks/agent-sdk-evaluation.md
+++ b/scheduled-tasks/agent-sdk-evaluation.md
@ -0,0 +1,175 @@
+---
+title: "Agent SDK Evaluation — CLI vs Python/TypeScript SDK"
+description: "Comparison of Claude Code CLI invocation (claude -p) vs the native Agent SDK for programmatic use in the headless-claude and claude-scheduled systems."
+type: context
+domain: scheduled-tasks
+tags: [claude-code, sdk, agent-sdk, python, typescript, headless, automation, evaluation]
+---
+
+# Agent SDK Evaluation: CLI vs Python/TypeScript SDK
+
+**Date:** 2026-04-03
+**Status:** Evaluation complete — recommendation below
+**Related:** Issue #3 (headless-claude: Additional Agent SDK improvements)
+
+## 1. Current Approach — CLI via `claude -p`
+
+All headless Claude invocations use the CLI subprocess pattern:
+
+```bash
+claude -p "<prompt>" \
+  --model sonnet \
+  --output-format json \
+  --allowedTools "Read,Grep,Glob" \
+  --append-system-prompt "..." \
+  --max-budget-usd 2.00
+```
+
+**Pros:**
+- Simple to invoke from any language (bash, n8n SSH nodes, systemd units)
+- Uses Claude Max OAuth — no API key needed, no per-token billing
+- Mature and battle-tested in our scheduled-tasks framework
+- CLAUDE.md and settings.json are loaded automatically
+- No runtime dependencies beyond the CLI binary
+
+**Cons:**
+- Structured output requires parsing JSON from stdout
+- Error handling is exit-code-based with stderr parsing
+- No mid-stream observability (streaming requires JSONL parsing)
+- Tool approval is allowlist-only — no dynamic per-call decisions
+- Session resumption requires manual `--resume` flag plumbing
+
+## 2. Python Agent SDK
+
+**Package:** `claude-agent-sdk` (renamed from `claude-code`)
+**Install:** `pip install claude-agent-sdk`
+**Requires:** Python 3.10+, `ANTHROPIC_API_KEY` env var
+
+```python
+from claude_agent_sdk import query, ClaudeAgentOptions
+
+async for message in query(
+    prompt="Diagnose server health",
+    options=ClaudeAgentOptions(
+        allowed_tools=["Read", "Grep", "Bash(python3 *)"],
+        output_format={"type": "json_schema", "schema": {...}},
+        max_budget_usd=2.00,
+    ),
+):
+    if hasattr(message, "result"):
+        print(message.result)
+```
+
+**Key features:**
+- Async generator with typed `SDKMessage` objects (User, Assistant, Result, System)
+- `ClaudeSDKClient` for stateful multi-turn conversations
+- `can_use_tool` callback for dynamic per-call tool approval
+- In-process hooks (`PreToolUse`, `PostToolUse`, `Stop`, etc.)
+- `rewindFiles()` to restore filesystem to any prior message point
+- Typed exception hierarchy (`CLINotFoundError`, `ProcessError`, etc.)
+
+**Limitation:** Shells out to the Claude Code CLI binary — it is NOT a pure HTTP client. The binary must be installed.
+
+## 3. TypeScript Agent SDK
+
+**Package:** `@anthropic-ai/claude-agent-sdk` (renamed from `@anthropic-ai/claude-code`)
+**Install:** `npm install @anthropic-ai/claude-agent-sdk`
+**Requires:** Node 18+, `ANTHROPIC_API_KEY` env var
+
+```typescript
+import { query } from "@anthropic-ai/claude-agent-sdk";
+
+for await (const message of query({
+  prompt: "Diagnose server health",
+  options: {
+    allowedTools: ["Read", "Grep", "Bash(python3 *)"],
+    maxBudgetUsd: 2.00,
+  }
+})) {
+  if ("result" in message) console.log(message.result);
+}
+```
+
+**Key features (superset of Python):**
+- Same async generator pattern
+- `"auto"` permission mode (model classifier per tool call) — TS-only
+- `spawnClaudeCodeProcess` hook for remote/containerized execution
+- `setMcpServers()` for dynamic MCP server swapping mid-session
+- V2 preview: `send()` / `stream()` patterns for simpler multi-turn
+- Bundles the Claude Code binary — no separate install needed
+
+## 4. Comparison Matrix
+
+| Capability | `claude -p` CLI | Python SDK | TypeScript SDK |
+|---|---|---|---|
+| **Auth** | OAuth (Claude Max) | API key only | API key only |
+| **Invocation** | Shell subprocess | Async generator | Async generator |
+| **Structured output** | `--json-schema` flag | Schema in options | Schema in options |
+| **Streaming** | JSONL parsing | Typed messages | Typed messages |
+| **Tool approval** | `--allowedTools` only | `can_use_tool` callback | `canUseTool` callback + auto mode |
+| **Session resume** | `--resume` flag | `resume: sessionId` | `resume: sessionId` |
+| **Cost tracking** | Parse result JSON | `ResultMessage.total_cost_usd` | Same + per-model breakdown |
+| **Error handling** | Exit codes + stderr | Typed exceptions | Typed exceptions |
+| **Hooks** | External shell scripts | In-process callbacks | In-process callbacks |
+| **Custom tools** | Not available | `tool()` decorator | `tool()` + Zod schemas |
+| **Subagents** | Not programmatic | `agents` option | `agents` option |
+| **File rewind** | Not available | `rewindFiles()` | `rewindFiles()` |
+| **MCP servers** | `--mcp-config` file | Inline config object | Inline + dynamic swap |
+| **CLAUDE.md loading** | Automatic | Must opt-in (`settingSources`) | Must opt-in |
+| **Dependencies** | CLI binary | CLI binary + Python | Node 18+ (bundles CLI) |
+
+## 5. Integration Paths
+
+### A. n8n Code Nodes
+
+The n8n Code node supports JavaScript (not TypeScript directly, but the SDK's JS output works). This would replace the current SSH → CLI pattern:
+
+```
+Schedule Trigger → Code Node (JS, uses SDK) → IF → Discord
+```
+
+**Trade-off:** Eliminates the SSH hop to CT 300, but requires `ANTHROPIC_API_KEY` and n8n to have the npm package installed. Current n8n runs in a Docker container on CT 210 — would need the SDK and CLI binary in the image.
+
+### B. Standalone Python Scripts
+
+Replace `claude -p` subprocess calls in custom dispatchers with the Python SDK:
+
+```python
+# Instead of: subprocess.run(["claude", "-p", prompt, ...])
+async for msg in query(prompt=prompt, options=opts):
+    ...
+```
+
+**Trade-off:** Richer error handling and streaming, but our dispatchers are bash scripts, not Python. Would require rewriting `runner.sh` and dispatchers in Python.
+
+### C. Systemd-triggered Tasks (Current Architecture)
+
+Keep systemd timers → bash scripts, but optionally invoke a thin Python wrapper that uses the SDK instead of `claude -p` directly.
+
+**Trade-off:** Adds Python as a dependency for scheduled tasks that currently only need bash + the CLI binary. Marginal benefit unless we need hooks or dynamic tool approval.
+
+## 6. Recommendation
+
+**Stay with CLI invocation for now. Revisit the Python SDK when we need dynamic tool approval or in-process hooks.**
+
+### Rationale
+
+1. **Auth is the blocker.** The SDK requires `ANTHROPIC_API_KEY` (API billing). Our entire scheduled-tasks framework runs on Claude Max OAuth at zero marginal cost. Switching to the SDK means paying per-token for every scheduled task, issue-worker, and PR-reviewer invocation. This alone makes the SDK non-viable for our current architecture.
+
+2. **The CLI covers our needs.** With `--append-system-prompt` (done), `--resume` (this PR), `--json-schema`, and `--allowedTools`, the CLI provides everything we currently need. Session resumption was the last missing piece.
+
+3. **Bash scripts are the right abstraction.** Our runners are launched by systemd timers. Bash + CLI is the natural fit — no runtime dependencies, no async event loops, no package management.
+
+### When to Revisit
+
+- If Anthropic adds OAuth support to the SDK (eliminating the billing difference)
+- If we need dynamic tool approval (e.g., "allow this Bash command but deny that one" at runtime)
+- If we build a long-running Python service that orchestrates multiple Claude sessions (the `ClaudeSDKClient` stateful pattern would be valuable there)
+- If we move to n8n custom nodes written in TypeScript (the TS SDK bundles the CLI binary)
+
+### Migration Path (If Needed Later)
+
+1. Start with the Python SDK in a single task (e.g., `backlog-triage`) as a proof of concept
+2. Create a thin `sdk-runner.py` wrapper that reads the same `settings.json` and `prompt.md` files
+3. Swap the systemd unit's `ExecStart` from `runner.sh` to `sdk-runner.py`
+4. Expand to other tasks if the POC proves valuable
--- a/server-configs/hosts.yml
+++ b/server-configs/hosts.yml
@ -245,11 +245,25 @@ hosts:
      - sqlite-major-domo
      - temp-postgres

+  # Docker Home Servers VM (Proxmox) - decommission candidate
+  # VM 116: Only Jellyfin remains after 2026-04-03 cleanup (watchstate removed — duplicate of manticore's canonical instance)
+  # Jellyfin on manticore already covers this service. VM 116 + VM 110 are candidates to reclaim 8 vCPUs + 16 GB RAM.
+  # See issue #31 for cleanup details.
+  docker-home-servers:
+    type: docker
+    ip: 10.10.0.124
+    vmid: 116
+    user: cal
+    description: "Legacy home servers VM — Jellyfin only, decommission candidate"
+    config_paths:
+      docker-compose: /home/cal/container-data
+    services:
+      - jellyfin  # only remaining service; duplicate of ubuntu-manticore jellyfin
+    decommission_candidate: true
+    notes: "watchstate removed 2026-04-03 (duplicate of manticore); 3.36 GB images pruned; see issue #31"
+
 # Decommissioned hosts (kept for reference)
 # decommissioned:
 #   tdarr-old:
 #     ip: 10.10.0.43
 #     note: "Replaced by ubuntu-manticore tdarr"
-#   docker-home:
-#     ip: 10.10.0.124
-#     note: "Decommissioned"
--- a/server-configs/proxmox/maintenance-reboot.md
+++ b/server-configs/proxmox/maintenance-reboot.md
@ -0,0 +1,210 @@
+---
+title: "Proxmox Monthly Maintenance Reboot"
+description: "Runbook for the first-Sunday-of-the-month Proxmox host reboot — dependency-aware shutdown/startup order, validation checklist, and Ansible automation."
+type: runbook
+domain: server-configs
+tags: [proxmox, maintenance, reboot, ansible, operations, systemd]
+---
+
+# Proxmox Monthly Maintenance Reboot
+
+## Overview
+
+| Detail | Value |
+|--------|-------|
+| **Schedule** | 1st Sunday of every month, 3:00 AM ET (08:00 UTC) |
+| **Expected downtime** | ~15 minutes (host reboot + VM/LXC startup) |
+| **Orchestration** | Ansible playbook on LXC 304 (ansible-controller) |
+| **Calendar** | Google Calendar recurring event: "Proxmox Monthly Maintenance Reboot" |
+| **HA DNS** | ubuntu-manticore (10.10.0.226) provides Pi-hole 2 during Proxmox downtime |
+
+## Why
+
+- Kernel updates accumulate without reboot and never take effect
+- Long uptimes allow memory leaks and process state drift (e.g., avahi busy-loops)
+- Validates that all VMs/LXCs auto-start cleanly with `onboot: 1`
+
+## Prerequisites (Before Maintenance)
+
+- [ ] Verify no active Tdarr transcodes on ubuntu-manticore
+- [ ] Verify no running database backups
+- [ ] Switch workstation DNS to `1.1.1.1` (Pi-hole 1 on VM 106 will be offline)
+- [ ] Confirm ubuntu-manticore Pi-hole 2 is healthy: `ssh manticore "docker exec pihole pihole status"`
+
+## `onboot` Audit
+
+All production VMs and LXCs must have `onboot: 1` so they restart automatically if the playbook fails mid-sequence.
+
+**Check VMs:**
+```bash
+ssh proxmox "for id in \$(qm list | awk 'NR>1{print \$1}'); do \
+  name=\$(qm config \$id | grep '^name:' | awk '{print \$2}'); \
+  onboot=\$(qm config \$id | grep '^onboot:'); \
+  echo \"VM \$id (\$name): \${onboot:-onboot NOT SET}\"; \
+done"
+```
+
+**Check LXCs:**
+```bash
+ssh proxmox "for id in \$(pct list | awk 'NR>1{print \$1}'); do \
+  name=\$(pct config \$id | grep '^hostname:' | awk '{print \$2}'); \
+  onboot=\$(pct config \$id | grep '^onboot:'); \
+  echo \"LXC \$id (\$name): \${onboot:-onboot NOT SET}\"; \
+done"
+```
+
+**Audit results (2026-04-03):**
+
+| ID | Name | Type | `onboot` | Action needed |
+|----|------|------|----------|---------------|
+| 106 | docker-home | VM | 1 | OK |
+| 109 | homeassistant | VM | NOT SET | **Add `onboot: 1`** |
+| 110 | discord-bots | VM | 1 | OK |
+| 112 | databases-bots | VM | 1 | OK |
+| 115 | docker-sba | VM | 1 | OK |
+| 116 | docker-home-servers | VM | 1 | OK |
+| 210 | docker-n8n-lxc | LXC | 1 | OK |
+| 221 | arr-stack | LXC | NOT SET | **Add `onboot: 1`** |
+| 222 | memos | LXC | 1 | OK |
+| 223 | foundry-lxc | LXC | NOT SET | **Add `onboot: 1`** |
+| 225 | gitea | LXC | 1 | OK |
+| 227 | uptime-kuma | LXC | 1 | OK |
+| 301 | claude-discord-coordinator | LXC | 1 | OK |
+| 302 | claude-runner | LXC | 1 | OK |
+| 303 | mcp-gateway | LXC | 0 | Intentional (on-demand) |
+| 304 | ansible-controller | LXC | 1 | OK |
+
+**Fix missing `onboot`:**
+```bash
+ssh proxmox "qm set 109 --onboot 1"
+ssh proxmox "pct set 221 --onboot 1"
+ssh proxmox "pct set 223 --onboot 1"
+```
+
+## Shutdown Order (Dependency-Aware)
+
+Reverse of the validated startup sequence. Stop consumers before their dependencies.
+
+```
+Tier 4 — Media & Others (no downstream dependents)
+  VM 109  homeassistant
+  LXC 221 arr-stack
+  LXC 222 memos
+  LXC 223 foundry-lxc
+  LXC 302 claude-runner
+  LXC 303 mcp-gateway (if running)
+
+Tier 3 — Applications (depend on databases + infra)
+  VM 115  docker-sba (Paper Dynasty, Major Domo)
+  VM 110  discord-bots
+  LXC 301 claude-discord-coordinator
+
+Tier 2 — Infrastructure + DNS (depend on databases)
+  VM 106  docker-home (Pi-hole 1, NPM)
+  LXC 225 gitea
+  LXC 210 docker-n8n-lxc
+  LXC 227 uptime-kuma
+  VM 116  docker-home-servers
+
+Tier 1 — Databases (no dependencies, shut down last)
+  VM 112  databases-bots
+
+Tier 0 — Ansible controller shuts itself down last
+  LXC 304 ansible-controller
+
+→ Proxmox host reboots
+```
+
+**Known quirks:**
+- VM 112 (databases-bots) may ignore ACPI shutdown — use `--forceStop` after timeout
+- VM 109 (homeassistant) is self-managed via HA Supervisor, excluded from Ansible inventory
+
+## Startup Order (Staggered)
+
+After the Proxmox host reboots, guests with `onboot: 1` will auto-start. The Ansible playbook overrides this with a controlled sequence:
+
+```
+Tier 1 — Databases first
+  VM 112  databases-bots
+  → wait 30s for DB to accept connections
+
+Tier 2 — Infrastructure + DNS
+  VM 106  docker-home (Pi-hole 1, NPM)
+  LXC 225 gitea
+  LXC 210 docker-n8n-lxc
+  LXC 227 uptime-kuma
+  VM 116  docker-home-servers
+  → wait 30s
+
+Tier 3 — Applications
+  VM 115  docker-sba
+  VM 110  discord-bots
+  LXC 301 claude-discord-coordinator
+  → wait 30s
+
+Pi-hole fix — restart container to clear UDP DNS bug
+  qm guest exec 106 -- docker restart pihole
+  → wait 10s
+
+Tier 4 — Media & Others
+  VM 109  homeassistant
+  LXC 221 arr-stack
+  LXC 222 memos
+  LXC 223 foundry-lxc
+```
+
+## Post-Reboot Validation
+
+- [ ] Pi-hole 1 DNS resolving: `ssh docker-home "docker exec pihole dig google.com @127.0.0.1"`
+- [ ] Gitea accessible: `curl -sf https://git.manticorum.com/api/v1/version`
+- [ ] n8n healthy: `ssh docker-n8n-lxc "docker ps --filter name=n8n --format '{{.Status}}'"`
+- [ ] Discord bots responding (check Discord)
+- [ ] Uptime Kuma dashboard green: `curl -sf http://10.10.0.227:3001/api/status-page/homelab`
+- [ ] Home Assistant running: `curl -sf http://10.10.0.109:8123/api/ -H 'Authorization: Bearer <token>'`
+- [ ] Switch workstation DNS back from `1.1.1.1` to Pi-hole
+
+## Automation
+
+### Ansible Playbook
+
+Located at `/opt/ansible/playbooks/monthly-reboot.yml` on LXC 304.
+
+```bash
+# Dry run (check mode)
+ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --check"
+
+# Manual execution
+ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml"
+
+# Limit to shutdown only (skip reboot)
+ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --tags shutdown"
+```
+
+### Systemd Timer
+
+The playbook runs automatically via systemd timer on LXC 304:
+
+```bash
+# Check timer status
+ssh ansible "systemctl status ansible-monthly-reboot.timer"
+
+# Next scheduled run
+ssh ansible "systemctl list-timers ansible-monthly-reboot.timer"
+
+# Disable for a month (e.g., during an incident)
+ssh ansible "systemctl stop ansible-monthly-reboot.timer"
+```
+
+## Rollback
+
+If a guest fails to start after reboot:
+1. Check Proxmox web UI or `pvesh get /nodes/proxmox/qemu/<VMID>/status/current`
+2. Review guest logs: `ssh proxmox "journalctl -u pve-guests -n 50"`
+3. Manual start: `ssh proxmox "pvesh create /nodes/proxmox/qemu/<VMID>/status/start"`
+4. If guest is corrupted, restore from the pre-reboot Proxmox snapshot
+
+## Related Documentation
+
+- [Ansible Controller Setup](../../vm-management/ansible-controller-setup.md) — LXC 304 details and inventory
+- [Proxmox 7→9 Upgrade Plan](../../vm-management/proxmox-upgrades/proxmox-7-to-9-upgrade-plan.md) — original startup order and Phase 1 lessons
+- [VM Decommission Runbook](../../vm-management/vm-decommission-runbook.md) — removing VMs from the rotation
--- a/server-configs/proxmox/qemu/115.conf
+++ b/server-configs/proxmox/qemu/115.conf
@ -12,5 +12,5 @@ ostype: l26
 scsi0: local-lvm:vm-115-disk-0,size=256G
 scsihw: virtio-scsi-pci
 smbios1: uuid=19be98ee-f60d-473d-acd2-9164717fcd11
-sockets: 2
+sockets: 1
 vmgenid: 682dfeab-8c63-4f0b-8ed2-8828c2f808ef
--- a/vm-management/proxmox-upgrades/proxmox-7-to-9-upgrade-plan.md
+++ b/vm-management/proxmox-upgrades/proxmox-7-to-9-upgrade-plan.md
@ -28,7 +28,7 @@ tags: [proxmox, upgrade, pve, backup, rollback, infrastructure]
 **Production Services** (7 LXC + 7 VMs) — cleaned up 2026-02-19:
 - **Critical**: Paper Dynasty/Major Domo (VM 115), Discord bots (VM 110), Gitea (LXC 225), n8n (LXC 210), Home Assistant (VM 109), Databases (VM 112), docker-home/Pi-hole 1 (VM 106)
 - **Important**: Claude Discord Coordinator (LXC 301), arr-stack (LXC 221), Uptime Kuma (LXC 227), Foundry VTT (LXC 223), Memos (LXC 222)
- **Stopped/Investigate**: docker-home-servers (VM 116, needs investigation)
+- **Decommission Candidate**: docker-home-servers (VM 116) — Jellyfin-only after 2026-04-03 cleanup; watchstate removed (duplicate of manticore); see issue #31
 - **Removed (2026-02-19)**: 108 (ansible), 224 (openclaw), 300 (openclaw-migrated), 101/102/104/111/211 (game servers), 107 (plex), 113 (tdarr - moved to .226), 114 (duplicate arr-stack), 117 (unused), 100/103 (old templates), 105 (docker-vpn - decommissioned 2026-04)

 **Key Constraints**:
--- a/workstation/troubleshooting.md
+++ b/workstation/troubleshooting.md
@ -0,0 +1,33 @@
+---
+title: "Workstation Troubleshooting"
+description: "Troubleshooting notes for Nobara/KDE Wayland workstation issues."
+type: troubleshooting
+domain: workstation
+tags: [troubleshooting, wayland, kde]
+---
+
+# Workstation Troubleshooting
+
+## Discord screen sharing shows no windows on KDE Wayland (2026-04-03)
+
+**Severity:** Medium — cannot share screen via Discord desktop app
+
+**Problem:** Clicking "Share Your Screen" in Discord desktop app (v0.0.131, Electron 37) opens the Discord picker but shows zero windows/screens. Same behavior in both the desktop app and the web app when using Discord's own picker. Affects both native Wayland and XWayland modes.
+
+**Root Cause:** Discord's built-in screen picker uses Electron's `desktopCapturer.getSources()` which relies on X11 window enumeration. On KDE Wayland:
+- In native Wayland mode: no X11 windows exist, so the picker is empty
+- In forced X11/XWayland mode (`ELECTRON_OZONE_PLATFORM_HINT=x11`): Discord can only see other XWayland windows (itself, Android emulator), not native Wayland apps
+- Discord ignores `--use-fake-ui-for-media-stream` and other Chromium flags that should force portal usage
+- The `discord-flags.conf` file is **not read** by the Nobara/RPM Discord package — flags must go in the `.desktop` file `Exec=` line
+
+**Fix:** Use **Discord web app in Firefox** for screen sharing. Firefox natively delegates to the XDG Desktop Portal via PipeWire, which shows the KDE screen picker with all windows. The desktop app's own picker remains broken on Wayland as of v0.0.131.
+
+Configuration applied (for general Discord Wayland support):
+- `~/.local/share/applications/discord.desktop` — overrides system `.desktop` with Wayland flags
+- `~/.config/discord-flags.conf` — created but not read by this Discord build
+
+**Lesson:**
+- Discord desktop on Linux Wayland cannot do screen sharing through its own picker — always use the web app in Firefox for this
+- Electron's `desktopCapturer` API is fundamentally X11-only; the PipeWire/portal path requires the app to use `getDisplayMedia()` instead, which Discord's desktop app does not do
+- `discord-flags.conf` is unreliable across distros — always verify flags landed in `/proc/<pid>/cmdline`
+- Vesktop (community client) is an alternative that properly implements portal-based screen sharing, if the web app is insufficient
Author	SHA1	Message	Date
Cal Corum	ffb1eaef7f	feat: right-size VM 115 config and add --hosts flag to audit script All checks were successful Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s Details Reduce VM 115 (docker-sba) from 16 vCPUs (2×8) to 8 vCPUs (1×8) to match actual workload (0.06 load/core). Add --hosts flag to homelab-audit.sh for targeted post-change audits. Closes #18 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 16:03:39 -05:00
Cal Corum	64f299aa1a	docs: sync KB — maintenance-reboot.md All checks were successful Reindex Knowledge Base / reindex (push) Successful in 2s Details	2026-04-03 16:00:22 -05:00
cal	a9a778f53c	Merge pull request 'feat: dynamic summary, --hosts filter, and --json output (#24 )' (#38 ) from issue/24-homelab-audit-sh-dynamic-summary-and-hosts-filter into main	2026-04-03 20:22:24 +00:00
Cal Corum	1a3785f01a	feat: dynamic summary, --hosts filter, and --json output (#24 ) All checks were successful Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s Details Closes #24 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-03 20:08:07 +00:00
cal	938240e1f9	Merge pull request 'fix: clean up VM 116 watchstate duplicate and document decommission candidacy (#31 )' (#41 ) from issue/31-vm-116-resolve-watchstate-duplicate-and-clean-up-r into main All checks were successful Reindex Knowledge Base / reindex (push) Successful in 1s Details Reviewed-on: #41	2026-04-03 20:01:27 +00:00
Cal Corum	66143f6090	fix: clean up VM 116 watchstate duplicate and document decommission candidacy (#31 ) All checks were successful Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s Details - Removed stopped watchstate container from VM 116 (duplicate of manticore's canonical instance) - Pruned 5 orphan images (watchstate, freetube, pihole, hello-world): 3.36 GB reclaimed - Confirmed manticore watchstate is healthy and syncing Jellyfin state - VM 116 now runs only Jellyfin (also runs on manticore) - Added VM 116 (docker-home-servers) to hosts.yml as decommission candidate - Updated proxmox-7-to-9-upgrade-plan.md status from Stopped/Investigate to Decommission Candidate Closes #31 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-03 20:01:13 +00:00
cal	13483157a9	Merge pull request 'feat: session resumption + Agent SDK evaluation' (#43 ) from feature/3-agent-sdk-improvements into main All checks were successful Reindex Knowledge Base / reindex (push) Successful in 3s Details Reviewed-on: #43	2026-04-03 20:00:12 +00:00
Cal Corum	e321e7bd47	feat: add session resumption and Agent SDK evaluation All checks were successful Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s Details - runner.sh: opt-in session persistence via session_resumable and resume_last_session settings; fix read_setting to normalize booleans - issue-poller.sh: capture and log session_id from worker invocations, include in result JSON - pr-reviewer-dispatcher.sh: capture and log session_id from reviews - n8n workflow: add --append-system-prompt to initial SSH node, add Follow Up Diagnostics node using --resume for deeper investigation, update Discord Alert with remediation details - Add Agent SDK evaluation doc (CLI vs Python/TS SDK comparison) - Update CONTEXT.md with session resumption documentation Closes #3 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 19:59:44 +00:00
cal	4e33e1cae3	Merge pull request 'fix: document per-core load threshold policy for health monitoring (#22 )' (#42 ) from issue/22-tune-n8n-alert-thresholds-to-per-core-load-metrics into main All checks were successful Reindex Knowledge Base / reindex (push) Successful in 2s Details	2026-04-03 18:36:14 +00:00
Cal Corum	193ae68f96	docs: document per-core load threshold policy for server health monitoring (#22 ) All checks were successful Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 5s Details Closes #22 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-03 13:35:23 -05:00
Cal Corum	7c9c96eb52	docs: sync KB — troubleshooting.md All checks were successful Reindex Knowledge Base / reindex (push) Successful in 3s Details	2026-04-03 12:00:22 -05:00