docs: sync KB — autonomous-nightly-2026-04-10-run2.md

docs: sync KB — pr-reviewer-ai-reviewing-label-stuck.md
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 12:00:47 -05:00 · 2026-04-10 10:38:45 -05:00 · 2026-04-10 10:38:45 -05:00 · 2026-04-10 10:38:45 -05:00 · 2026-04-10 10:38:45 -05:00 · 2026-04-10 10:38:45 -05:00
45 changed files with 4910 additions and 56 deletions
--- a/ansible/playbooks/docker-prune.yml
+++ b/ansible/playbooks/docker-prune.yml
@ -0,0 +1,55 @@
+---
+# Monthly Docker Prune — Deploy Cleanup Cron to All Docker Hosts
+#
+# Deploys /etc/cron.monthly/docker-prune to each VM running Docker.
+# The script prunes stopped containers, unused images, and orphaned volumes
+# older than 30 days (720h). Volumes labeled `keep` are exempt.
+#
+# Resolves accumulated disk waste from stopped containers and stale images.
+# The `--filter "until=720h"` age gate prevents removing recently-pulled
+# images that haven't started yet. `docker image prune -a` only removes
+# images not referenced by any container (running or stopped), so the
+# age filter adds an extra safety margin.
+#
+# Hosts: VM 106 (docker-home), VM 110 (discord-bots), VM 112 (databases-bots),
+#        VM 115 (docker-sba), VM 116 (docker-home-servers), manticore
+#
+# Controller: LXC 304 (ansible-controller) at 10.10.0.232
+#
+# Usage:
+#   # Dry run (shows what would change, skips writes)
+#   ansible-playbook /opt/ansible/playbooks/docker-prune.yml --check
+#
+#   # Single host
+#   ansible-playbook /opt/ansible/playbooks/docker-prune.yml --limit docker-sba
+#
+#   # All Docker hosts
+#   ansible-playbook /opt/ansible/playbooks/docker-prune.yml
+#
+# To undo: rm /etc/cron.monthly/docker-prune on target hosts
+
+- name: Deploy Docker monthly prune cron to all Docker hosts
+  hosts: docker-home:discord-bots:databases-bots:docker-sba:docker-home-servers:manticore
+  become: true
+
+  tasks:
+    - name: Deploy docker-prune cron script
+      ansible.builtin.copy:
+        dest: /etc/cron.monthly/docker-prune
+        owner: root
+        group: root
+        mode: "0755"
+        content: |
+          #!/bin/bash
+          # Monthly Docker cleanup — deployed by Ansible (issue #29)
+          # Prunes stopped containers, unused images (>30 days), and orphaned volumes.
+          # Volumes labeled `keep` are exempt from volume pruning.
+          set -euo pipefail
+
+          docker container prune -f --filter "until=720h"
+          docker image prune -a -f --filter "until=720h"
+          docker volume prune -f --filter "label!=keep"
+
+    - name: Verify docker-prune script is executable
+      ansible.builtin.command: test -x /etc/cron.monthly/docker-prune
+      changed_when: false
--- a/ansible/playbooks/gitea-cleanup.yml
+++ b/ansible/playbooks/gitea-cleanup.yml
@ -0,0 +1,80 @@
+---
+# gitea-cleanup.yml — Weekly cleanup of Gitea server disk space
+#
+# Removes stale Docker buildx volumes, unused images, Gitea repo-archive
+# cache, and vacuums journal logs to prevent disk exhaustion on LXC 225.
+#
+# Schedule: Weekly via systemd timer on LXC 304 (ansible-controller)
+#
+# Usage:
+#   ansible-playbook /opt/ansible/playbooks/gitea-cleanup.yml          # full run
+#   ansible-playbook /opt/ansible/playbooks/gitea-cleanup.yml --check  # dry run
+
+- name: Gitea server disk cleanup
+  hosts: gitea
+  gather_facts: false
+
+  tasks:
+    - name: Check current disk usage
+      ansible.builtin.shell: df --output=pcent / | tail -1
+      register: disk_before
+      changed_when: false
+
+    - name: Display current disk usage
+      ansible.builtin.debug:
+        msg: "Disk usage before cleanup: {{ disk_before.stdout | trim }}"
+
+    - name: Clear Gitea repo-archive cache
+      ansible.builtin.find:
+        paths: /var/lib/gitea/data/repo-archive
+        file_type: any
+      register: repo_archive_files
+
+    - name: Remove repo-archive files
+      ansible.builtin.file:
+        path: "{{ item.path }}"
+        state: absent
+      loop: "{{ repo_archive_files.files }}"
+      loop_control:
+        label: "{{ item.path | basename }}"
+      when: repo_archive_files.files | length > 0
+
+    - name: Remove orphaned Docker buildx volumes
+      ansible.builtin.shell: |
+        volumes=$(docker volume ls -q --filter name=buildx_buildkit)
+        if [ -n "$volumes" ]; then
+          echo "$volumes" | xargs docker volume rm 2>&1
+        else
+          echo "No buildx volumes to remove"
+        fi
+      register: buildx_cleanup
+      changed_when: "'No buildx volumes' not in buildx_cleanup.stdout"
+
+    - name: Prune unused Docker images
+      ansible.builtin.command: docker image prune -af
+      register: image_prune
+      changed_when: "'Total reclaimed space: 0B' not in image_prune.stdout"
+
+    - name: Prune unused Docker volumes
+      ansible.builtin.command: docker volume prune -f
+      register: volume_prune
+      changed_when: "'Total reclaimed space: 0B' not in volume_prune.stdout"
+
+    - name: Vacuum journal logs to 500M
+      ansible.builtin.command: journalctl --vacuum-size=500M
+      register: journal_vacuum
+      changed_when: "'freed 0B' not in journal_vacuum.stderr"
+
+    - name: Check disk usage after cleanup
+      ansible.builtin.shell: df --output=pcent / | tail -1
+      register: disk_after
+      changed_when: false
+
+    - name: Display cleanup summary
+      ansible.builtin.debug:
+        msg: >-
+          Cleanup complete.
+          Disk: {{ disk_before.stdout | default('N/A') | trim }} → {{ disk_after.stdout | default('N/A') | trim }}.
+          Buildx: {{ (buildx_cleanup.stdout_lines | default(['N/A'])) | last }}.
+          Images: {{ (image_prune.stdout_lines | default(['N/A'])) | last }}.
+          Journal: {{ (journal_vacuum.stderr_lines | default(['N/A'])) | last }}.
--- a/ansible/playbooks/mask-avahi.yml
+++ b/ansible/playbooks/mask-avahi.yml
@ -0,0 +1,43 @@
+---
+# Mask avahi-daemon on all Ubuntu hosts
+#
+# Avahi (mDNS/Bonjour) is not needed in a static-IP homelab with Pi-hole DNS.
+# A kernel busy-loop bug in avahi-daemon was found consuming ~1.7 CPU cores
+# across 5 VMs. Masking prevents it from ever starting again, surviving reboots.
+#
+# Targets: vms + physical (all Ubuntu QEMU VMs and ubuntu-manticore)
+# Controller: ansible-controller (LXC 304 at 10.10.0.232)
+#
+# Usage:
+#   # Dry run
+#   ansible-playbook /opt/ansible/playbooks/mask-avahi.yml --check
+#
+#   # Test on a single host first
+#   ansible-playbook /opt/ansible/playbooks/mask-avahi.yml --limit discord-bots
+#
+#   # Roll out to all Ubuntu hosts
+#   ansible-playbook /opt/ansible/playbooks/mask-avahi.yml
+#
+# To undo: systemctl unmask avahi-daemon
+
+- name: Mask avahi-daemon on all Ubuntu hosts
+  hosts: vms:physical
+  become: true
+
+  tasks:
+    - name: Stop avahi-daemon
+      ansible.builtin.systemd:
+        name: avahi-daemon
+        state: stopped
+      ignore_errors: true
+
+    - name: Mask avahi-daemon
+      ansible.builtin.systemd:
+        name: avahi-daemon
+        masked: true
+
+    - name: Verify avahi is masked
+      ansible.builtin.command: systemctl is-enabled avahi-daemon
+      register: avahi_status
+      changed_when: false
+      failed_when: avahi_status.stdout | trim != 'masked'
--- a/ansible/playbooks/monthly-reboot.yml
+++ b/ansible/playbooks/monthly-reboot.yml
@ -0,0 +1,265 @@
+---
+# Monthly Proxmox Maintenance Reboot — Shutdown & Reboot
+#
+# Orchestrates a graceful shutdown of all guests in dependency order,
+# then issues a fire-and-forget reboot to the Proxmox host.
+#
+# After the host reboots, LXC 304 auto-starts via onboot:1 and the
+# post-reboot-startup.yml playbook runs automatically via the
+# ansible-post-reboot.service systemd unit (triggered by @reboot).
+#
+# Schedule: 1st Sunday of each month, 08:00 UTC (3 AM ET)
+# Controller: LXC 304 (ansible-controller) at 10.10.0.232
+#
+# Usage:
+#   # Dry run
+#   ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --check
+#
+#   # Full execution
+#   ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml
+#
+#   # Shutdown only (skip the host reboot)
+#   ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --tags shutdown
+#
+# Note: VM 109 (homeassistant) is excluded from Ansible inventory
+# (self-managed via HA Supervisor) but is included in pvesh start/stop.
+
+- name: Pre-reboot health check and snapshots
+  hosts: pve-node
+  gather_facts: false
+  tags: [pre-reboot, shutdown]
+
+  tasks:
+    - name: Check Proxmox cluster health
+      ansible.builtin.command: pvesh get /cluster/status --output-format json
+      register: cluster_status
+      changed_when: false
+
+    - name: Get list of running QEMU VMs
+      ansible.builtin.shell: >
+        pvesh get /nodes/proxmox/qemu --output-format json |
+        python3 -c "import sys,json; [print(vm['vmid']) for vm in json.load(sys.stdin) if vm.get('status')=='running']"
+      register: running_vms
+      changed_when: false
+
+    - name: Get list of running LXC containers
+      ansible.builtin.shell: >
+        pvesh get /nodes/proxmox/lxc --output-format json |
+        python3 -c "import sys,json; [print(ct['vmid']) for ct in json.load(sys.stdin) if ct.get('status')=='running']"
+      register: running_lxcs
+      changed_when: false
+
+    - name: Display running guests
+      ansible.builtin.debug:
+        msg: "Running VMs: {{ running_vms.stdout_lines }} | Running LXCs: {{ running_lxcs.stdout_lines }}"
+
+    - name: Snapshot running VMs
+      ansible.builtin.command: >
+        pvesh create /nodes/proxmox/qemu/{{ item }}/snapshot
+        --snapname pre-maintenance-{{ lookup('pipe', 'date +%Y-%m-%d') }}
+        --description "Auto snapshot before monthly maintenance reboot"
+      loop: "{{ running_vms.stdout_lines }}"
+      when: running_vms.stdout_lines | length > 0
+      ignore_errors: true
+
+    - name: Snapshot running LXCs
+      ansible.builtin.command: >
+        pvesh create /nodes/proxmox/lxc/{{ item }}/snapshot
+        --snapname pre-maintenance-{{ lookup('pipe', 'date +%Y-%m-%d') }}
+        --description "Auto snapshot before monthly maintenance reboot"
+      loop: "{{ running_lxcs.stdout_lines }}"
+      when: running_lxcs.stdout_lines | length > 0
+      ignore_errors: true
+
+- name: "Shutdown Tier 4 — Media & Others"
+  hosts: pve-node
+  gather_facts: false
+  tags: [shutdown]
+
+  vars:
+    tier4_vms: [109]
+    # LXC 303 (mcp-gateway) is onboot=0 and operator-managed — not included here
+    tier4_lxcs: [221, 222, 223, 302]
+
+  tasks:
+    - name: Shutdown Tier 4 VMs
+      ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/shutdown
+      loop: "{{ tier4_vms }}"
+      ignore_errors: true
+
+    - name: Shutdown Tier 4 LXCs
+      ansible.builtin.command: pvesh create /nodes/proxmox/lxc/{{ item }}/status/shutdown
+      loop: "{{ tier4_lxcs }}"
+      ignore_errors: true
+
+    - name: Wait for Tier 4 VMs to stop
+      ansible.builtin.shell: >
+        pvesh get /nodes/proxmox/qemu/{{ item }}/status/current --output-format json |
+        python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
+      register: t4_vm_status
+      until: t4_vm_status.stdout.strip() == "stopped"
+      retries: 12
+      delay: 5
+      loop: "{{ tier4_vms }}"
+      ignore_errors: true
+
+    - name: Wait for Tier 4 LXCs to stop
+      ansible.builtin.shell: >
+        pvesh get /nodes/proxmox/lxc/{{ item }}/status/current --output-format json |
+        python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
+      register: t4_lxc_status
+      until: t4_lxc_status.stdout.strip() == "stopped"
+      retries: 12
+      delay: 5
+      loop: "{{ tier4_lxcs }}"
+      ignore_errors: true
+
+- name: "Shutdown Tier 3 — Applications"
+  hosts: pve-node
+  gather_facts: false
+  tags: [shutdown]
+
+  vars:
+    tier3_vms: [115, 110]
+    tier3_lxcs: [301]
+
+  tasks:
+    - name: Shutdown Tier 3 VMs
+      ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/shutdown
+      loop: "{{ tier3_vms }}"
+      ignore_errors: true
+
+    - name: Shutdown Tier 3 LXCs
+      ansible.builtin.command: pvesh create /nodes/proxmox/lxc/{{ item }}/status/shutdown
+      loop: "{{ tier3_lxcs }}"
+      ignore_errors: true
+
+    - name: Wait for Tier 3 VMs to stop
+      ansible.builtin.shell: >
+        pvesh get /nodes/proxmox/qemu/{{ item }}/status/current --output-format json |
+        python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
+      register: t3_vm_status
+      until: t3_vm_status.stdout.strip() == "stopped"
+      retries: 12
+      delay: 5
+      loop: "{{ tier3_vms }}"
+      ignore_errors: true
+
+    - name: Wait for Tier 3 LXCs to stop
+      ansible.builtin.shell: >
+        pvesh get /nodes/proxmox/lxc/{{ item }}/status/current --output-format json |
+        python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
+      register: t3_lxc_status
+      until: t3_lxc_status.stdout.strip() == "stopped"
+      retries: 12
+      delay: 5
+      loop: "{{ tier3_lxcs }}"
+      ignore_errors: true
+
+- name: "Shutdown Tier 2 — Infrastructure"
+  hosts: pve-node
+  gather_facts: false
+  tags: [shutdown]
+
+  vars:
+    tier2_vms: [106, 116]
+    tier2_lxcs: [225, 210, 227]
+
+  tasks:
+    - name: Shutdown Tier 2 VMs
+      ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/shutdown
+      loop: "{{ tier2_vms }}"
+      ignore_errors: true
+
+    - name: Shutdown Tier 2 LXCs
+      ansible.builtin.command: pvesh create /nodes/proxmox/lxc/{{ item }}/status/shutdown
+      loop: "{{ tier2_lxcs }}"
+      ignore_errors: true
+
+    - name: Wait for Tier 2 VMs to stop
+      ansible.builtin.shell: >
+        pvesh get /nodes/proxmox/qemu/{{ item }}/status/current --output-format json |
+        python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
+      register: t2_vm_status
+      until: t2_vm_status.stdout.strip() == "stopped"
+      retries: 12
+      delay: 5
+      loop: "{{ tier2_vms }}"
+      ignore_errors: true
+
+    - name: Wait for Tier 2 LXCs to stop
+      ansible.builtin.shell: >
+        pvesh get /nodes/proxmox/lxc/{{ item }}/status/current --output-format json |
+        python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
+      register: t2_lxc_status
+      until: t2_lxc_status.stdout.strip() == "stopped"
+      retries: 12
+      delay: 5
+      loop: "{{ tier2_lxcs }}"
+      ignore_errors: true
+
+- name: "Shutdown Tier 1 — Databases"
+  hosts: pve-node
+  gather_facts: false
+  tags: [shutdown]
+
+  vars:
+    tier1_vms: [112]
+
+  tasks:
+    - name: Shutdown database VMs
+      ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/shutdown
+      loop: "{{ tier1_vms }}"
+      ignore_errors: true
+
+    - name: Wait for database VMs to stop (up to 90s)
+      ansible.builtin.shell: >
+        pvesh get /nodes/proxmox/qemu/{{ item }}/status/current --output-format json |
+        python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
+      register: t1_vm_status
+      until: t1_vm_status.stdout.strip() == "stopped"
+      retries: 18
+      delay: 5
+      loop: "{{ tier1_vms }}"
+      ignore_errors: true
+
+    - name: Force stop database VMs if still running
+      ansible.builtin.shell: >
+        status=$(pvesh get /nodes/proxmox/qemu/{{ item }}/status/current --output-format json |
+        python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))");
+        if [ "$status" = "running" ]; then
+          pvesh create /nodes/proxmox/qemu/{{ item }}/status/stop;
+          echo "Force stopped VM {{ item }}";
+        else
+          echo "VM {{ item }} already stopped";
+        fi
+      loop: "{{ tier1_vms }}"
+      register: force_stop_result
+      changed_when: force_stop_result.results | default([]) | selectattr('stdout', 'defined') | selectattr('stdout', 'search', 'Force stopped') | list | length > 0
+
+- name: "Verify and reboot Proxmox host"
+  hosts: pve-node
+  gather_facts: false
+  tags: [reboot]
+
+  tasks:
+    - name: Verify all guests are stopped (excluding LXC 304)
+      ansible.builtin.shell: >
+        running_vms=$(pvesh get /nodes/proxmox/qemu --output-format json |
+        python3 -c "import sys,json; vms=[v for v in json.load(sys.stdin) if v.get('status')=='running']; print(len(vms))");
+        running_lxcs=$(pvesh get /nodes/proxmox/lxc --output-format json |
+        python3 -c "import sys,json; cts=[c for c in json.load(sys.stdin) if c.get('status')=='running' and c['vmid'] != 304]; print(len(cts))");
+        echo "Running VMs: $running_vms, Running LXCs: $running_lxcs";
+        if [ "$running_vms" != "0" ] || [ "$running_lxcs" != "0" ]; then exit 1; fi
+      register: verify_stopped
+
+    - name: Issue fire-and-forget reboot (controller will be killed)
+      ansible.builtin.shell: >
+        nohup bash -c 'sleep 10 && reboot' &>/dev/null &
+        echo "Reboot scheduled in 10 seconds"
+      register: reboot_issued
+      when: not ansible_check_mode
+
+    - name: Log reboot issued
+      ansible.builtin.debug:
+        msg: "{{ reboot_issued.stdout }} — Ansible process will terminate when host reboots. Post-reboot startup handled by ansible-post-reboot.service on LXC 304."
--- a/ansible/playbooks/post-reboot-startup.yml
+++ b/ansible/playbooks/post-reboot-startup.yml
@ -0,0 +1,214 @@
+---
+# Post-Reboot Startup — Controlled Guest Startup After Proxmox Reboot
+#
+# Starts all guests in dependency order with staggered delays to avoid
+# I/O storms. Runs automatically via ansible-post-reboot.service on
+# LXC 304 after the Proxmox host reboots.
+#
+# Can also be run manually:
+#   ansible-playbook /opt/ansible/playbooks/post-reboot-startup.yml
+#
+# Note: VM 109 (homeassistant) is excluded from Ansible inventory
+# (self-managed via HA Supervisor) but is included in pvesh start/stop.
+
+- name: Wait for Proxmox API to be ready
+  hosts: pve-node
+  gather_facts: false
+  tags: [startup]
+
+  tasks:
+    - name: Wait for Proxmox API
+      ansible.builtin.command: pvesh get /version --output-format json
+      register: pve_version
+      until: pve_version.rc == 0
+      retries: 30
+      delay: 10
+      changed_when: false
+
+    - name: Display Proxmox version
+      ansible.builtin.debug:
+        msg: "Proxmox API ready: {{ pve_version.stdout | from_json | json_query('version') | default('unknown') }}"
+
+- name: "Startup Tier 1 — Databases"
+  hosts: pve-node
+  gather_facts: false
+  tags: [startup]
+
+  tasks:
+    - name: Start database VM (112)
+      ansible.builtin.command: pvesh create /nodes/proxmox/qemu/112/status/start
+      ignore_errors: true
+
+    - name: Wait for VM 112 to be running
+      ansible.builtin.shell: >
+        pvesh get /nodes/proxmox/qemu/112/status/current --output-format json |
+        python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
+      register: db_status
+      until: db_status.stdout.strip() == "running"
+      retries: 12
+      delay: 5
+      changed_when: false
+
+    - name: Wait for database services to initialize
+      ansible.builtin.pause:
+        seconds: 30
+
+- name: "Startup Tier 2 — Infrastructure"
+  hosts: pve-node
+  gather_facts: false
+  tags: [startup]
+
+  vars:
+    tier2_vms: [106, 116]
+    tier2_lxcs: [225, 210, 227]
+
+  tasks:
+    - name: Start Tier 2 VMs
+      ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/start
+      loop: "{{ tier2_vms }}"
+      ignore_errors: true
+
+    - name: Start Tier 2 LXCs
+      ansible.builtin.command: pvesh create /nodes/proxmox/lxc/{{ item }}/status/start
+      loop: "{{ tier2_lxcs }}"
+      ignore_errors: true
+
+    - name: Wait for infrastructure to come up
+      ansible.builtin.pause:
+        seconds: 30
+
+- name: "Startup Tier 3 — Applications"
+  hosts: pve-node
+  gather_facts: false
+  tags: [startup]
+
+  vars:
+    tier3_vms: [115, 110]
+    tier3_lxcs: [301]
+
+  tasks:
+    - name: Start Tier 3 VMs
+      ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/start
+      loop: "{{ tier3_vms }}"
+      ignore_errors: true
+
+    - name: Start Tier 3 LXCs
+      ansible.builtin.command: pvesh create /nodes/proxmox/lxc/{{ item }}/status/start
+      loop: "{{ tier3_lxcs }}"
+      ignore_errors: true
+
+    - name: Wait for applications to start
+      ansible.builtin.pause:
+        seconds: 30
+
+    - name: Restart Pi-hole container via SSH (UDP DNS fix)
+      ansible.builtin.command: ssh docker-home "docker restart pihole"
+      ignore_errors: true
+
+    - name: Wait for Pi-hole to stabilize
+      ansible.builtin.pause:
+        seconds: 10
+
+- name: "Startup Tier 4 — Media & Others"
+  hosts: pve-node
+  gather_facts: false
+  tags: [startup]
+
+  vars:
+    tier4_vms: [109]
+    tier4_lxcs: [221, 222, 223, 302]
+
+  tasks:
+    - name: Start Tier 4 VMs
+      ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/start
+      loop: "{{ tier4_vms }}"
+      ignore_errors: true
+
+    - name: Start Tier 4 LXCs
+      ansible.builtin.command: pvesh create /nodes/proxmox/lxc/{{ item }}/status/start
+      loop: "{{ tier4_lxcs }}"
+      ignore_errors: true
+
+- name: Post-reboot validation
+  hosts: pve-node
+  gather_facts: false
+  tags: [startup, validate]
+
+  tasks:
+    - name: Wait for all services to initialize
+      ansible.builtin.pause:
+        seconds: 60
+
+    - name: Check all expected VMs are running
+      ansible.builtin.shell: >
+        pvesh get /nodes/proxmox/qemu --output-format json |
+        python3 -c "
+        import sys, json
+        vms = json.load(sys.stdin)
+        expected = {106, 109, 110, 112, 115, 116}
+        running = {v['vmid'] for v in vms if v.get('status') == 'running'}
+        missing = expected - running
+        if missing:
+            print(f'WARN: VMs not running: {missing}')
+            sys.exit(1)
+        print(f'All expected VMs running: {running & expected}')
+        "
+      register: vm_check
+      ignore_errors: true
+
+    - name: Check all expected LXCs are running
+      ansible.builtin.shell: >
+        pvesh get /nodes/proxmox/lxc --output-format json |
+        python3 -c "
+        import sys, json
+        cts = json.load(sys.stdin)
+        # LXC 303 (mcp-gateway) intentionally excluded — onboot=0, operator-managed
+        expected = {210, 221, 222, 223, 225, 227, 301, 302, 304}
+        running = {c['vmid'] for c in cts if c.get('status') == 'running'}
+        missing = expected - running
+        if missing:
+            print(f'WARN: LXCs not running: {missing}')
+            sys.exit(1)
+        print(f'All expected LXCs running: {running & expected}')
+        "
+      register: lxc_check
+      ignore_errors: true
+
+    - name: Clean up old maintenance snapshots (older than 7 days)
+      ansible.builtin.shell: >
+        cutoff=$(date -d '7 days ago' +%s);
+        for vmid in $(pvesh get /nodes/proxmox/qemu --output-format json |
+          python3 -c "import sys,json; [print(v['vmid']) for v in json.load(sys.stdin)]"); do
+          for snap in $(pvesh get /nodes/proxmox/qemu/$vmid/snapshot --output-format json |
+            python3 -c "import sys,json; [print(s['name']) for s in json.load(sys.stdin) if s['name'].startswith('pre-maintenance-')]" 2>/dev/null); do
+            snap_date=$(echo $snap | sed 's/pre-maintenance-//');
+            snap_epoch=$(date -d "$snap_date" +%s 2>/dev/null);
+            if [ -z "$snap_epoch" ]; then
+              echo "WARN: could not parse date for snapshot $snap on VM $vmid";
+            elif [ "$snap_epoch" -lt "$cutoff" ]; then
+              pvesh delete /nodes/proxmox/qemu/$vmid/snapshot/$snap && echo "Deleted $snap from VM $vmid";
+            fi
+          done
+        done;
+        for ctid in $(pvesh get /nodes/proxmox/lxc --output-format json |
+          python3 -c "import sys,json; [print(c['vmid']) for c in json.load(sys.stdin)]"); do
+          for snap in $(pvesh get /nodes/proxmox/lxc/$ctid/snapshot --output-format json |
+            python3 -c "import sys,json; [print(s['name']) for s in json.load(sys.stdin) if s['name'].startswith('pre-maintenance-')]" 2>/dev/null); do
+            snap_date=$(echo $snap | sed 's/pre-maintenance-//');
+            snap_epoch=$(date -d "$snap_date" +%s 2>/dev/null);
+            if [ -z "$snap_epoch" ]; then
+              echo "WARN: could not parse date for snapshot $snap on LXC $ctid";
+            elif [ "$snap_epoch" -lt "$cutoff" ]; then
+              pvesh delete /nodes/proxmox/lxc/$ctid/snapshot/$snap && echo "Deleted $snap from LXC $ctid";
+            fi
+          done
+        done;
+        echo "Snapshot cleanup complete"
+      ignore_errors: true
+
+    - name: Display validation results
+      ansible.builtin.debug:
+        msg:
+          - "VM status: {{ vm_check.stdout }}"
+          - "LXC status: {{ lxc_check.stdout }}"
+          - "Maintenance reboot complete — post-reboot startup finished"
--- a/ansible/systemd/ansible-monthly-reboot.service
+++ b/ansible/systemd/ansible-monthly-reboot.service
@ -0,0 +1,15 @@
+[Unit]
+Description=Monthly Proxmox maintenance reboot (Ansible)
+After=network-online.target
+Wants=network-online.target
+
+[Service]
+Type=oneshot
+User=cal
+WorkingDirectory=/opt/ansible
+ExecStart=/usr/bin/ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml
+StandardOutput=append:/opt/ansible/logs/monthly-reboot.log
+StandardError=append:/opt/ansible/logs/monthly-reboot.log
+TimeoutStartSec=900
+
+# No [Install] section — this service is activated exclusively by ansible-monthly-reboot.timer
--- a/ansible/systemd/ansible-monthly-reboot.timer
+++ b/ansible/systemd/ansible-monthly-reboot.timer
@ -0,0 +1,13 @@
+[Unit]
+Description=Monthly Proxmox maintenance reboot timer
+Documentation=https://git.manticorum.com/cal/claude-home/src/branch/main/server-configs/proxmox/maintenance-reboot.md
+
+[Timer]
+# First Sunday of the month at 08:00 UTC (3:00 AM ET during EDT)
+# Day range 01-07 ensures it's always the first occurrence of that weekday
+OnCalendar=Sun *-*-01..07 08:00:00
+Persistent=true
+RandomizedDelaySec=600
+
+[Install]
+WantedBy=timers.target
--- a/ansible/systemd/ansible-post-reboot.service
+++ b/ansible/systemd/ansible-post-reboot.service
@ -0,0 +1,21 @@
+[Unit]
+Description=Post-reboot controlled guest startup (Ansible)
+After=network-online.target
+Wants=network-online.target
+# Only run after a fresh boot — not on service restart
+ConditionUpTimeSec=600
+
+[Service]
+Type=oneshot
+User=cal
+WorkingDirectory=/opt/ansible
+# Delay 120s to let Proxmox API stabilize and onboot guests settle
+ExecStartPre=/bin/sleep 120
+ExecStart=/usr/bin/ansible-playbook /opt/ansible/playbooks/post-reboot-startup.yml
+StandardOutput=append:/opt/ansible/logs/post-reboot-startup.log
+StandardError=append:/opt/ansible/logs/post-reboot-startup.log
+TimeoutStartSec=1800
+
+[Install]
+# Runs automatically on every boot of LXC 304
+WantedBy=multi-user.target
--- a/docs/superpowers/plans/2026-04-08-home-network-review.md
+++ b/docs/superpowers/plans/2026-04-08-home-network-review.md
--- a/docs/superpowers/specs/2026-04-08-home-network-review-design.md
+++ b/docs/superpowers/specs/2026-04-08-home-network-review-design.md
@ -0,0 +1,297 @@
+# Home Network Review — Design Spec
+
+**Date:** 2026-04-08
+**Approach:** Hybrid Layer-by-Layer (discover-then-fix per layer, bottom-up)
+**Execution model:** Sub-agent driven — parallel agents within each layer's discovery/analysis phases, sequential remediation
+
+## Context
+
+### Current Infrastructure
+- **Router/Gateway:** UniFi UDM Pro
+- **Switch:** US-24-PoE (250W)
+- **Access Points:** 3x UAP-AC-Lite (Office, First Floor, Upper Floor)
+- **Hypervisor:** Proxmox at `10.10.0.10`
+- **Physical server:** ubuntu-manticore (`10.10.0.226`) — Pi-hole, Jellyfin, Tdarr, KB RAG stack
+- **VM 115:** docker-sba (`10.10.0.88`) — Paper Dynasty, SBA services
+- **NAS:** TrueNAS at `10.10.0.35`
+- **Reverse proxy:** Nginx Proxy Manager — external access via `*.manticorum.com`
+- **DNS:** Dual Pi-hole HA — primary `10.10.0.16` (npm-pihole LXC), secondary `10.10.0.226` (manticore), synced via Orbital Sync + NPM DNS sync cron
+
+### Current Network Topology
+| Network | Subnet | Purpose |
+|---------|--------|---------|
+| Home | `10.0.0.0/23` | Personal devices |
+| Lab | `10.10.0.0/24` | Homelab infrastructure |
+
+### Known Issues & Goals (Priority Order)
+1. **Performance (C):** Roku on Upper Floor AP has 6 Mbps Rx rate despite -44 dBm signal. 1x1 MIMO, AP/Client Signal Balance: Poor. Likely AP TX power asymmetry with weak client radio.
+2. **Cleanup (D):** Handful of custom firewall rules, need sanity check. Internal `.homelab.local` domain may not be functional — `.local` conflicts with mDNS (RFC 6762).
+3. **Security (A):** Many services exposed via `*.manticorum.com` through NPM. Need WAN exposure audit.
+4. **Reliability (B):** Validate Pi-hole HA failover, identify single points of failure.
+5. **Expansion (E):** Add guest WiFi, expand Tailscale to full mesh, build smart home foundation.
+
+### Additional Requirements
+- **Guest WiFi:** New VLAN, isolated, internet-only
+- **Tailscale:** Currently on phones with exit nodes on both networks. Goal: universal reachability — all devices can reach each other whether on home/lab network, cellular, or cloud
+- **Smart Home:** Home Assistant antenna installed, not migrated. Previous Matter/HomeKit attempts failed. Want solid network foundation (IoT VLAN, mDNS) before going deeper
+- **IoT VLAN:** Default-deny internet access. Per-device exceptions if needed.
+
+## Design
+
+### Agent Assignments
+
+| Layer | Lead Agent(s) | Support |
+|-------|---------------|---------|
+| 1. WiFi & Physical | `network-engineer` | |
+| 2. Network Architecture | `network-engineer` | `it-ops-orchestrator` |
+| 3. DNS | `network-engineer` | |
+| 4. Firewall & Security | `security-engineer`, `security-auditor` | |
+| 5. Overlay & Remote Access | `network-engineer` | |
+| 6. Smart Home Foundation | `iot-engineer` | `network-engineer` |
+| Final Pass | `security-auditor` | `pentester` |
+
+### Per-Layer Workflow
+Each layer follows the same three-phase cycle:
+1. **Discover** — export configs, scan current state, document baseline (parallel sub-agents)
+2. **Analyze** — review findings, identify issues, produce recommendations (parallel sub-agents)
+3. **Remediate** — implement changes, validate, document new state (sequential)
+
+---
+
+### Layer 1: WiFi & Physical
+
+**Goal:** Optimize wireless performance, diagnose Roku issue, establish baseline RF environment.
+
+**Discovery (parallel):**
+- Export AP configs from UniFi (channels, power levels, band steering, DTIM, minimum RSSI)
+- Pull client device list with signal/rate/retry stats
+- Document AP placement (floor, room, mounting)
+- Check for channel conflicts — 3 APs on 5GHz 80MHz channels could overlap
+
+**Analysis (parallel):**
+- Evaluate channel plan — non-overlapping channels? DFS channels available?
+- Review AP power levels — high TX power on AC Lites causes asymmetry with weak client radios
+- Assess band steering config — is 2.4GHz available as fallback?
+- Roku-specific: determine if lowering AP-Upper Floor TX power or moving Roku to 2.4GHz improves Rx rate
+
+**Remediation (sequential):**
+- Apply optimized channel plan
+- Adjust TX power levels per AP
+- Configure minimum RSSI thresholds if not set
+- Validate Roku improvement
+- Document new baseline
+
+**Key insight:** The Roku's 1x1 radio with 6 Mbps Rx rate at -44 dBm signal strongly suggests AP TX power is too high relative to what the Roku can transmit back. Lowering AP power or moving to 2.4GHz are the likely fixes.
+
+---
+
+### Layer 2: Network Architecture
+
+**Goal:** Expand from 2 VLANs to 4, supporting guest WiFi and IoT isolation.
+
+**Target VLAN layout:**
+
+| VLAN | Name | Subnet | Purpose |
+|------|------|--------|---------|
+| Existing | Home | `10.0.0.0/23` | Trusted personal devices |
+| Existing | Lab | `10.10.0.0/24` | Homelab servers, Proxmox, infrastructure |
+| New | Guest | TBD (e.g., `10.20.0.0/24`) | Guest WiFi — internet only, no local access |
+| New | IoT | TBD (e.g., `10.30.0.0/24`) | Smart devices — no internet by default |
+
+**Discovery (parallel):**
+- Export current VLAN config (VLAN IDs, DHCP scopes, assignments)
+- Inventory all devices and current network placement
+- Document inter-VLAN routing rules
+- Check switch port VLAN assignments (tagged/untagged)
+
+**Analysis (parallel):**
+- Determine which devices move to IoT VLAN (Roku, smart bulbs, switches, HA hub)
+- Design DHCP scopes for new VLANs
+- Plan inter-VLAN access: IoT reaches HA only, HA reaches into IoT, no IoT internet
+- WiFi SSIDs: one per VLAN or shared SSID with VLAN assignment?
+
+**Remediation (sequential):**
+- Create Guest and IoT VLANs in UniFi
+- Configure DHCP for new VLANs
+- Create WiFi networks (Guest SSID, IoT SSID)
+- Migrate devices to appropriate VLANs
+- Validate connectivity per VLAN
+- Document new topology
+
+---
+
+### Layer 3: DNS
+
+**Goal:** Validate Pi-hole HA, plan mDNS for smart home, ensure DNS works across all four VLANs.
+
+**Discovery (parallel):**
+- Validate Orbital Sync (matching blocklists, custom entries on both Pi-holes)
+- Check NPM DNS sync cron — is `custom.list` consistent?
+- Document current DNS records in `homelab.local` zone
+- Check DHCP DNS server advertisements on both existing VLANs
+
+**Analysis (parallel):**
+- Verify failover: what happens when primary (`10.10.0.16`) goes down?
+- DNS per VLAN: Guest gets Pi-hole (ad blocking) but NOT internal name resolution. IoT resolves HA only.
+- mDNS for smart home — Matter/HomeKit use mDNS for discovery, doesn't cross VLANs. Options:
+  - UniFi mDNS reflector (built-in, simple, reflects everything)
+  - Avahi reflector on a host (more granular)
+  - Explicit HA configuration for IoT VLAN discovery
+- Check if iOS DNS bypass issue (from KB) is still relevant
+
+**Remediation (sequential):**
+- Configure DNS for Guest and IoT VLANs
+- Set up mDNS reflection (method TBD)
+- Fix any Orbital Sync or failover gaps
+- Validate DNS resolution from each VLAN
+- Document DNS architecture
+
+---
+
+### Layer 4: Firewall & Security
+
+**Goal:** Clean up rules, audit WAN exposure, validate internal domain, harden perimeter.
+
+**Discovery (parallel):**
+- Export all UniFi firewall rules (WAN/LAN/Guest, in/out/local)
+- Inventory all NPM proxy hosts — which services exposed on `*.manticorum.com`
+- Test internal domain resolution: does `.homelab.local` work from each network?
+- Check NPM SSL cert status and auto-renewal
+- Document port forwards on UDM Pro
+- Check UDM Pro WAN-facing services (remote management, STUN, UPnP)
+
+**Analysis (parallel):**
+- **Firewall rule audit:** Redundant, conflicting, or overly broad rules? Missing rules (e.g., IoT→Lab block)?
+- **NPM exposure review:** Per proxy host — does it need to be internet-facing? Auth configured? Security headers (HSTS, X-Frame-Options, CSP)?
+- **Internal domain strategy:** `.local` conflicts with mDNS. Options:
+  - Keep `.homelab.local` with Pi-hole handling (risk of mDNS collision)
+  - Switch to `lab.manticorum.com` with split DNS (recommended — you own the domain, no mDNS conflict, clean)
+  - Use `.home.arpa` (RFC 8375, purpose-built for home networks)
+- **Inter-VLAN rules:** Guest = internet-only. IoT = no internet, HA access only. Lab = reachable from Home, not from Guest/IoT.
+- **WAN hardening:** UPnP status, unnecessary exposure
+
+**Remediation (sequential):**
+- Remove/consolidate stale firewall rules
+- Harden NPM proxy hosts (auth, headers, prune unnecessary exposure)
+- Implement chosen internal domain strategy (recommendation: `lab.manticorum.com` split DNS)
+- Create inter-VLAN firewall rules for Guest and IoT
+- Disable UPnP if enabled, close unnecessary WAN exposure
+- External port scan validation
+- Document final ruleset and NPM inventory
+
+---
+
+### Layer 5: Overlay & Remote Access
+
+**Goal:** Tailscale full mesh — universal reachability across home, cellular, and cloud.
+
+**Discovery (parallel):**
+- Document current Tailscale setup (devices, exit nodes, ACL policy)
+- Check for subnet router usage vs exit-node-only
+- Identify all devices for the mesh (workstation, phones, laptops, servers, cloud VMs)
+- Check if OpenVPN is active or legacy
+
+**Analysis (parallel):**
+- **Architecture options:**
+  - Subnet routers: Tailscale on 1-2 hosts advertising home + lab subnets. Simpler, fewer installs.
+  - Full mesh: Tailscale on every server. Direct reachability, no SPOF, more to manage.
+  - Hybrid (recommended): Tailscale on key servers + subnet router for the rest.
+- **DNS integration:** Tailscale MagicDNS vs Pi-hole coexistence
+- **ACL policy:** Which devices reach which? Phones get everything? Cloud VMs lab-only?
+- **Exit node strategy:** Keep current phone exit nodes? Add workstation?
+- **OpenVPN decommission:** If Tailscale covers all use cases, remove it
+
+**Remediation (sequential):**
+- Install/configure Tailscale on chosen devices
+- Set up subnet routes or direct mesh
+- Configure Tailscale ACLs
+- Integrate DNS (MagicDNS + Pi-hole)
+- Test: home→cloud, cellular→lab, cloud→home
+- Decommission OpenVPN if replaced
+- Document mesh topology and ACLs
+
+---
+
+### Layer 6: Smart Home Foundation
+
+**Goal:** IoT VLAN ready (from Layer 2), Home Assistant deployed, Matter/Thread infrastructure in place.
+
+**Discovery (parallel):**
+- Inventory smart devices — protocols (WiFi, Zigbee, Z-Wave, Matter, Thread)
+- Document HA hardware (antenna type — Zigbee coordinator? Thread border router? SkyConnect?)
+- Document previous HomeKit/Matter attempts — what failed and why
+- Identify devices for HA migration
+
+**Analysis (parallel):**
+- **Protocol strategy:**
+  - Which devices support Matter (firmware update path)?
+  - WiFi-only devices → IoT VLAN, managed through HA
+  - Zigbee/Thread devices → HA radio, no VLAN needed
+- **HA network placement:** Must reach IoT VLAN, be reachable from Home VLAN (UI), handle mDNS. Options: dedicated VM, container on manticore, dedicated hardware.
+- **Matter/Thread specifics:**
+  - Thread border routers: same segment as HA coordinator
+  - Matter commissioning uses BLE + WiFi — which VLAN?
+  - Apple Home: HA HomeKit bridge vs replace HomeKit entirely
+- **Migration path:** Phased, validate each batch
+
+**Remediation (sequential):**
+- Deploy Home Assistant (if not already running)
+- Configure HA network access (IoT VLAN reach, Home VLAN UI)
+- Set up Zigbee/Thread coordinator
+- Migrate devices in phases
+- Test Matter commissioning end-to-end
+- Document device inventory, protocols, HA architecture
+
+---
+
+### Final Pass: Cross-Cutting Security Audit
+
+**Goal:** Holistic review after all layers complete — catch anything missed or introduced.
+
+**Agent:** `security-auditor` lead, `pentester` assist.
+
+**Tasks:**
+- Port scan from WAN — verify only intended services reachable
+- Inter-VLAN isolation verification — Guest can't reach Lab/Home/IoT, IoT can't reach internet or Lab
+- NPM proxy hosts: SSL + headers validated
+- No default credentials on network gear or exposed services
+- Tailscale ACLs match actual reachability
+- Produce final network topology document
+
+---
+
+## Dependencies
+
+```
+Layer 1 (WiFi) ─────────────────────────────────────────────┐
+    │                                                        │
+Layer 2 (VLANs) ────────────────────────────────────────────┤
+    │                                                        │
+Layer 3 (DNS) ──────────────────────────────────────────────┤
+    │                                                        │
+Layer 4 (Firewall) ─────────────────────────────────────────┤
+    │                                                        │
+Layer 5 (Tailscale) ────────────────────────────────────────┤
+    │                                                        │
+Layer 6 (Smart Home) ───────────────────────────────────────┤
+                                                             │
+                                                    Final Pass
+```
+
+Layers are sequential — each builds on the one below. Within each layer, discovery and analysis phases run parallel sub-agents. Remediation is sequential within a layer.
+
+## Deliverables
+
+Per layer:
+- Baseline snapshot (current state before changes)
+- Changes made (with rationale)
+- Validation results
+- Updated documentation
+
+Final:
+- Complete network topology document
+- Firewall rule inventory
+- NPM proxy host inventory with security status
+- Tailscale mesh diagram and ACL policy
+- Smart home device inventory and protocol map
+- Security audit report
--- a/legacy/headless-claude/n8n-workflow-import.json
+++ b/legacy/headless-claude/n8n-workflow-import.json
@ -21,7 +21,7 @@
    {
      "parameters": {
        "operation": "executeCommand",
-        "command": "/root/.local/bin/claude -p \"Run python3 ~/.claude/skills/server-diagnostics/client.py health paper-dynasty and analyze the results. If any containers are not running or there are critical issues, summarize them. Otherwise just say 'All systems healthy'.\" --output-format json --json-schema '{\"type\":\"object\",\"properties\":{\"status\":{\"type\":\"string\",\"enum\":[\"healthy\",\"issues_found\"]},\"summary\":{\"type\":\"string\"},\"root_cause\":{\"type\":\"string\"},\"severity\":{\"type\":\"string\",\"enum\":[\"low\",\"medium\",\"high\",\"critical\"]},\"affected_services\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}},\"actions_taken\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}}},\"required\":[\"status\",\"severity\",\"summary\"]}' --allowedTools \"Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)\"",
+        "command": "/root/.local/bin/claude -p \"Run python3 ~/.claude/skills/server-diagnostics/client.py health paper-dynasty and analyze the results. If any containers are not running or there are critical issues, summarize them. Otherwise just say 'All systems healthy'.\" --output-format json --json-schema '{\"type\":\"object\",\"properties\":{\"status\":{\"type\":\"string\",\"enum\":[\"healthy\",\"issues_found\"]},\"summary\":{\"type\":\"string\"},\"root_cause\":{\"type\":\"string\"},\"severity\":{\"type\":\"string\",\"enum\":[\"low\",\"medium\",\"high\",\"critical\"]},\"affected_services\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}},\"actions_taken\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}}},\"required\":[\"status\",\"severity\",\"summary\"]}' --allowedTools \"Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)\" --append-system-prompt \"You are a server diagnostics agent. Use the server-diagnostics skill client.py for all operations. Never run destructive commands.\"",
        "options": {}
      },
      "id": "ssh-claude-code",
@ -75,20 +75,48 @@
      "typeVersion": 2,
      "position": [660, 0]
    },
+    {
+      "parameters": {
+        "operation": "executeCommand",
+        "command": "=/root/.local/bin/claude -p \"The previous health check found issues. Investigate deeper: check container logs, resource usage, and recent events. Provide a detailed root cause analysis and recommended remediation steps.\" --resume \"{{ $json.session_id }}\" --output-format json --json-schema '{\"type\":\"object\",\"properties\":{\"root_cause_detail\":{\"type\":\"string\"},\"container_logs\":{\"type\":\"string\"},\"resource_status\":{\"type\":\"string\"},\"remediation_steps\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}},\"requires_human\":{\"type\":\"boolean\"}},\"required\":[\"root_cause_detail\",\"remediation_steps\",\"requires_human\"]}' --allowedTools \"Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)\" --max-turns 15 --append-system-prompt \"You are a server diagnostics agent performing a follow-up investigation. The initial health check found issues. Dig deeper into logs and metrics. Never run destructive commands.\"",
+        "options": {}
+      },
+      "id": "ssh-followup",
+      "name": "Follow Up Diagnostics",
+      "type": "n8n-nodes-base.ssh",
+      "typeVersion": 1,
+      "position": [880, -200],
+      "credentials": {
+        "sshPassword": {
+          "id": "REPLACE_WITH_CREDENTIAL_ID",
+          "name": "Claude Code LXC"
+        }
+      }
+    },
+    {
+      "parameters": {
+        "jsCode": "// Parse follow-up diagnostics response\nconst stdout = $input.first().json.stdout || '';\nconst initial = $('Parse Claude Response').first().json;\n\ntry {\n  const response = JSON.parse(stdout);\n  const data = response.structured_output || JSON.parse(response.result || '{}');\n  \n  return [{\n    json: {\n      ...initial,\n      followup: {\n        root_cause_detail: data.root_cause_detail || 'No detail available',\n        container_logs: data.container_logs || '',\n        resource_status: data.resource_status || '',\n        remediation_steps: data.remediation_steps || [],\n        requires_human: data.requires_human || false,\n        cost_usd: response.total_cost_usd,\n        session_id: response.session_id\n      },\n      total_cost_usd: (initial.cost_usd || 0) + (response.total_cost_usd || 0)\n    }\n  }];\n} catch (e) {\n  return [{\n    json: {\n      ...initial,\n      followup: {\n        error: e.message,\n        root_cause_detail: 'Follow-up parse failed',\n        remediation_steps: [],\n        requires_human: true\n      },\n      total_cost_usd: initial.cost_usd || 0\n    }\n  }];\n}"
+      },
+      "id": "parse-followup",
+      "name": "Parse Follow-up Response",
+      "type": "n8n-nodes-base.code",
+      "typeVersion": 2,
+      "position": [1100, -200]
+    },
    {
      "parameters": {
        "method": "POST",
        "url": "https://discord.com/api/webhooks/1451783909409816763/O9PMDiNt6ZIWRf8HKocIZ_E4vMGV_lEwq50aAiZ9HVFR2UGwO6J1N9_wOm82p0MetIqT",
        "sendBody": true,
        "specifyBody": "json",
-        "jsonBody": "={\n  \"embeds\": [{\n    \"title\": \"{{ $json.severity === 'critical' ? '🔴' : $json.severity === 'high' ? '🟠' : '🟡' }} Server Alert\",\n    \"description\": {{ JSON.stringify($json.summary) }},\n    \"color\": {{ $json.severity === 'critical' ? 15158332 : $json.severity === 'high' ? 15105570 : 16776960 }},\n    \"fields\": [\n      {\n        \"name\": \"Severity\",\n        \"value\": \"{{ $json.severity.toUpperCase() }}\",\n        \"inline\": true\n      },\n      {\n        \"name\": \"Server\",\n        \"value\": \"paper-dynasty (10.10.0.88)\",\n        \"inline\": true\n      },\n      {\n        \"name\": \"Cost\",\n        \"value\": \"${{ $json.cost_usd ? $json.cost_usd.toFixed(4) : '0.0000' }}\",\n        \"inline\": true\n      },\n      {\n        \"name\": \"Root Cause\",\n        \"value\": \"{{ $json.root_cause || 'N/A' }}\",\n        \"inline\": false\n      },\n      {\n        \"name\": \"Affected Services\",\n        \"value\": \"{{ $json.affected_services.length ? $json.affected_services.join(', ') : 'None' }}\",\n        \"inline\": false\n      },\n      {\n        \"name\": \"Actions Taken\",\n        \"value\": \"{{ $json.actions_taken.length ? $json.actions_taken.join('\\n') : 'None' }}\",\n        \"inline\": false\n      }\n    ],\n    \"timestamp\": \"{{ new Date().toISOString() }}\"\n  }]\n}",
+        "jsonBody": "={\n  \"embeds\": [{\n    \"title\": \"{{ $json.severity === 'critical' ? '🔴' : $json.severity === 'high' ? '🟠' : '🟡' }} Server Alert\",\n    \"description\": {{ JSON.stringify($json.summary) }},\n    \"color\": {{ $json.severity === 'critical' ? 15158332 : $json.severity === 'high' ? 15105570 : 16776960 }},\n    \"fields\": [\n      {\n        \"name\": \"Severity\",\n        \"value\": \"{{ $json.severity.toUpperCase() }}\",\n        \"inline\": true\n      },\n      {\n        \"name\": \"Server\",\n        \"value\": \"paper-dynasty (10.10.0.88)\",\n        \"inline\": true\n      },\n      {\n        \"name\": \"Cost\",\n        \"value\": \"${{ $json.total_cost_usd ? $json.total_cost_usd.toFixed(4) : '0.0000' }}\",\n        \"inline\": true\n      },\n      {\n        \"name\": \"Root Cause\",\n        \"value\": {{ JSON.stringify(($json.followup && $json.followup.root_cause_detail) || $json.root_cause || 'N/A') }},\n        \"inline\": false\n      },\n      {\n        \"name\": \"Affected Services\",\n        \"value\": \"{{ $json.affected_services.length ? $json.affected_services.join(', ') : 'None' }}\",\n        \"inline\": false\n      },\n      {\n        \"name\": \"Remediation Steps\",\n        \"value\": {{ JSON.stringify(($json.followup && $json.followup.remediation_steps.length) ? $json.followup.remediation_steps.map((s, i) => (i+1) + '. ' + s).join('\\n') : ($json.actions_taken.length ? $json.actions_taken.join('\\n') : 'None')) }},\n        \"inline\": false\n      },\n      {\n        \"name\": \"Requires Human?\",\n        \"value\": \"{{ ($json.followup && $json.followup.requires_human) ? '⚠️ Yes' : '✅ No' }}\",\n        \"inline\": true\n      }\n    ],\n    \"timestamp\": \"{{ new Date().toISOString() }}\"\n  }]\n}",
        "options": {}
      },
      "id": "discord-alert",
      "name": "Discord Alert",
      "type": "n8n-nodes-base.httpRequest",
      "typeVersion": 4.2,
-      "position": [880, -100]
+      "position": [1320, -200]
    },
    {
      "parameters": {
@ -145,7 +173,7 @@
      "main": [
        [
          {
-            "node": "Discord Alert",
+            "node": "Follow Up Diagnostics",
            "type": "main",
            "index": 0
          }
@ -158,6 +186,28 @@
          }
        ]
      ]
+    },
+    "Follow Up Diagnostics": {
+      "main": [
+        [
+          {
+            "node": "Parse Follow-up Response",
+            "type": "main",
+            "index": 0
+          }
+        ]
+      ]
+    },
+    "Parse Follow-up Response": {
+      "main": [
+        [
+          {
+            "node": "Discord Alert",
+            "type": "main",
+            "index": 0
+          }
+        ]
+      ]
    }
  },
  "settings": {
--- a/major-domo/database-release-2026.4.7.md
+++ b/major-domo/database-release-2026.4.7.md
@ -0,0 +1,69 @@
+---
+title: "Database API Release — 2026.4.7"
+description: "Major cleanup: middleware connection management, security hardening, performance fixes, and Pydantic/Docker upgrades."
+type: reference
+domain: major-domo
+tags: [release-notes, deployment, database, major-domo]
+---
+
+# Database API Release — 2026.4.7
+
+**Date:** 2026-04-07
+**Tag:** TBD (next CalVer tag after `2026.4.5`)
+**Image:** `manticorum67/major-domo-database:{tag}` + `:latest`
+**Server:** `ssh akamai` (`~/container-data/sba-database`)
+**Deploy method:** `git tag -a YYYY.M.BUILD -m "description" && git push origin YYYY.M.BUILD` → CI builds Docker image → pull + restart on akamai
+
+## Release Summary
+
+Large batch merge of 22 PRs covering connection management, security hardening, query performance, code cleanup, and infrastructure upgrades. The headline change is middleware-based DB connection management replacing 177+ manual `db.close()` calls across all routers.
+
+## Changes
+
+### Architecture
+- **Middleware connection management** — replaced all manual `db.close()` calls with HTTP middleware that opens connections before requests and closes after responses (PR #97)
+- **Disabled autoconnect + pool timeout** — `PooledPostgresqlDatabase` now uses `autoconnect=False` and `timeout=5` for tighter connection lifecycle control (PR #87)
+- **Migration tracking system** — new system for tracking applied database migrations (PR #96)
+
+### Security
+- **Removed hardcoded webhook URL** — Discord webhook URL moved to `DISCORD_WEBHOOK_URL` env var (PR #83). Old token is in git history — rotate it.
+- **Removed hardcoded fallback DB password** — no more default password in `db_engine.py` (PR #55)
+- **Removed token from log warnings** — Bad Token log messages no longer include the raw token value (PR #85)
+
+### Performance
+- **Batch standings updates** — eliminated N+1 queries in `recalculate_standings` (PR #93)
+- **Bulk DELETE in career recalculation** — replaced row-by-row DELETE with single bulk operation (PR #92)
+- **Added missing FK indexes** — indexes on FK columns in `stratplay` and `stratgame` tables (PR #95)
+- **Fixed total_count in get_totalstats** — count no longer overwritten with page length (PR #102)
+
+### Bug Fixes
+- **Boolean field comparisons** — replaced integer comparisons (`== 1`) with proper `True`/`False` (PR #94)
+- **CustomCommandCreator.discord_id** — aligned model field with BIGINT column type (PR #88)
+- **Literal validation on sort param** — `GET /api/v3/players` now validates sort values (PR #68)
+- **PitchingStat combined_season** — added missing classmethod for combined season stats (PR #67)
+
+### Code Cleanup
+- Removed SQLite fallback code from `db_engine.py` (PR #89)
+- Replaced deprecated `.dict()` with `.model_dump()` across all Pydantic models (PR #90)
+- Added type annotations to untyped query parameters (PR #86)
+- Removed commented-out dead code blocks (PR #48)
+- Replaced `print()` debug statements with `logger` calls in `db_engine.py` (PR #53)
+- Removed unimplemented `is_trade` parameter from transactions endpoint (PR #57)
+- Eliminated N+1 queries in `get_custom_commands` (PR #51)
+
+### Infrastructure
+- **Docker base image upgraded** from Python 3.11 to 3.12 (PR #91)
+- **CI switched to tag-triggered builds** (PR #107)
+
+## Known Issues
+
+- ~20 unit tests broken by SQLite fallback removal — tests relied on SQLite that no longer exists (issue #108)
+- `test_get_nonexistent_play` returns 500 instead of 404 (issue #109)
+- `test_batting_sbaplayer_career_totals` returns 422 instead of 200 (issue #110)
+
+## Deployment Notes
+
+- **New env var required:** `DISCORD_WEBHOOK_URL` must be set in the container environment. Check `docker-compose.yml` passes it through.
+- **Rotate webhook token** — the old hardcoded token is in git history.
+- **Migration tracking:** new migration table will be created on first run.
+- **Rollback:** `docker compose pull manticorum67/major-domo-database:2026.4.5 && docker compose up -d`
--- a/major-domo/release-2026.4.7.md
+++ b/major-domo/release-2026.4.7.md
@ -0,0 +1,37 @@
+---
+title: "Discord Bot Release — 2026.4.7"
+description: "Minor fix: add missing logger to SubmitConfirmationModal."
+type: reference
+domain: major-domo
+tags: [release-notes, deployment, discord, major-domo]
+---
+
+# Discord Bot Release — 2026.4.7
+
+**Date:** 2026-04-07
+**Tag:** TBD (next CalVer tag after `2026.3.13`)
+**Image:** `manticorum67/major-domo-discordapp:{tag}` + `:production`
+**Server:** `ssh akamai` (`~/container-data/major-domo`)
+**Deploy method:** `.scripts/release.sh` → CI builds Docker image → `.scripts/deploy.sh`
+
+## Release Summary
+
+Minimal release with a single logging fix. Previous releases (2026.3.12–2026.3.13) included the larger performance and feature work (FA lock enforcement, trade view optimization, parallel lookups).
+
+## Changes
+
+### Bug Fixes
+- **Missing logger in SubmitConfirmationModal** — added logger initialization that was absent, preventing proper error logging in transaction confirmation flows
+
+## Not Included (PR #120)
+
+PR #120 (caching for stable data) remains open with two unfixed issues:
+1. `_channel_color_cache` cross-user contamination — cache keyed by channel only, user-specific colors bleed across users
+2. `recalculate_standings()` doesn't invalidate standings cache
+
+These must be addressed before PR #120 can merge.
+
+## Deployment Notes
+
+- No new env vars or config changes required
+- **Rollback:** `.scripts/deploy.sh` with previous image tag, or `ssh akamai` → `docker compose pull manticorum67/major-domo-discordapp:2026.3.13 && docker compose up -d`
--- a/monitoring/apcupsd-ups-monitoring.md
+++ b/monitoring/apcupsd-ups-monitoring.md
@ -0,0 +1,128 @@
+---
+title: "APC UPS Monitoring with apcupsd and Discord Alerts"
+description: "Setup guide for apcupsd on nobara-pc workstation with Discord webhook alerts for power events (on battery, off battery, battery replace, comm failure/restore)."
+type: guide
+domain: monitoring
+tags: [apcupsd, ups, discord, webhook, power, alerts, usb]
+---
+
+# APC UPS Monitoring with apcupsd
+
+## Overview
+
+apcupsd monitors the APC Back-UPS RS 1500MS2 connected via USB to the workstation (nobara-pc). Discord alerts fire automatically on power events via webhook scripts in `/etc/apcupsd/`.
+
+## Hardware
+
+- **UPS Model**: Back-UPS RS 1500MS2
+- **Connection**: USB (vendor ID `051d:0002`)
+- **Nominal Power**: 900W
+- **Nominal Battery Voltage**: 24V
+- **Serial**: 0B2544L30372
+
+## Configuration
+
+**Config file**: `/etc/apcupsd/apcupsd.conf`
+
+Key settings:
+```
+UPSNAME  WS-UPS
+UPSCABLE usb
+UPSTYPE  usb
+DEVICE                    # blank = USB autodetect
+POLLTIME 15               # poll every 15 seconds
+SENSE    Medium           # UPS-side sensitivity (set in EEPROM)
+LOTRANS  88.0 Volts       # switch to battery below this
+HITRANS  144.0 Volts      # switch to battery above this
+BATTERYLEVEL 5            # shutdown at 5% charge
+MINUTES  3                # shutdown at 3 min remaining
+```
+
+## Service
+
+```bash
+sudo systemctl enable --now apcupsd
+systemctl status apcupsd
+```
+
+## Useful Commands
+
+```bash
+# Full status dump
+apcaccess status
+
+# Single field (no parsing needed)
+apcaccess -p LINEV
+apcaccess -p LASTXFER
+apcaccess -p BCHARGE
+
+# View event log
+cat /var/log/apcupsd.events
+
+# Watch events in real-time
+tail -f /var/log/apcupsd.events
+```
+
+## Discord Alerts
+
+Five event scripts in `/etc/apcupsd/` send Discord embeds to the `#homelab-alerts` webhook:
+
+| Script | Trigger | Embed Color |
+|--------|---------|-------------|
+| `onbattery` | UPS switches to battery | Red (0xFF6B6B) |
+| `offbattery` | Line power restored | Green (0x57F287) |
+| `changeme` | Battery needs replacement | Yellow (0xFFFF00) |
+| `commfailure` | USB communication lost | Red (0xFF6B6B) |
+| `commok` | USB communication restored | Green (0x57F287) |
+
+All scripts use the same webhook URL as other monitoring scripts (jellyfin_gpu_monitor, nvidia_update_checker).
+
+The `onbattery` alert includes line voltage, load percentage, battery charge, and time remaining — useful for diagnosing whether transfers are caused by voltage sags vs other issues.
+
+## Troubleshooting
+
+### UPS not detected
+```bash
+# Check USB connection
+lsusb | grep 051d
+
+# If missing, try a different USB port or cable
+# The UPS uses vendor ID 051d:0002
+```
+
+### No Discord alerts on power event
+```bash
+# Test the script manually
+sudo /etc/apcupsd/onbattery WS-UPS
+
+# Check that curl is available at /usr/bin/curl
+which curl
+
+# Verify webhook URL is still valid
+curl -s -o /dev/null -w "%{http_code}" -H "Content-Type: application/json" \
+  -X POST "WEBHOOK_URL" -d '{"content":"test"}'
+# Should return 204
+```
+
+### LASTXFER shows "Low line voltage"
+This means input voltage is dropping below the LOTRANS threshold (88V). Common causes:
+- Heavy appliance on the same circuit (HVAC, fridge compressor)
+- Loose wiring/outlet connection
+- Utility-side voltage sags
+- Overloaded circuit
+
+Correlate event timestamps from `/var/log/apcupsd.events` with appliance cycling to identify the source.
+
+### Frequent unnecessary transfers
+If sensitivity is too high, the UPS transfers on minor sags that don't affect equipment:
+- Check current: `apcaccess -p SENSE`
+- Lower via `apctest` EEPROM menu (requires stopping apcupsd first)
+- Options: High → Medium → Low
+
+## Initial Diagnostics (2026-04-06)
+
+- Two different APC UPS units exhibited the same on_batt/on_line bouncing behavior
+- `LASTXFER: Low line voltage` confirmed voltage sags as the cause
+- Sensitivity already at Medium — transfers are from real sags below 88V
+- Load at 49% (441W of 900W capacity) — not overloaded
+- Next steps: correlate event timestamps with appliance activity, try different circuit, electrician inspection
--- a/monitoring/scripts/CONTEXT.md
+++ b/monitoring/scripts/CONTEXT.md
@ -1,9 +1,9 @@
 ---
 title: "Monitoring Scripts Context"
-description: "Operational context for all monitoring scripts: Jellyfin GPU health monitor, NVIDIA driver update checker, Tdarr API/file monitors, and Windows reboot detection. Includes cron schedules, Discord integration patterns, and troubleshooting."
+description: "Operational context for all monitoring scripts: Proxmox backup checker, CT 302 self-health, Jellyfin GPU health monitor, NVIDIA driver update checker, Tdarr API/file monitors, and Windows reboot detection. Includes cron schedules, Discord integration patterns, and troubleshooting."
 type: context
 domain: monitoring
-tags: [jellyfin, gpu, nvidia, tdarr, discord, cron, python, windows, scripts]
+tags: [proxmox, backup, jellyfin, gpu, nvidia, tdarr, discord, cron, python, bash, windows, scripts]
 ---

 # Monitoring Scripts - Operational Context
@ -13,6 +13,77 @@ This directory contains active operational scripts for system monitoring, health

 ## Core Monitoring Scripts

+### Proxmox Backup Verification
+**Script**: `proxmox-backup-check.sh`
+**Purpose**: Weekly check that every running VM/CT has a successful vzdump backup within 7 days. Posts a color-coded Discord embed with per-guest status.
+
+**Key Features**:
+- SSHes to Proxmox host and queries `pvesh` task history + guest lists via API
+- Categorizes each guest: 🟢 green (backed up), 🟡 yellow (overdue), 🔴 red (no backup)
+- Sorts output by VMID; only posts to Discord — no local side effects
+- `--dry-run` mode prints the Discord payload without sending
+- `--days N` overrides the default 7-day window
+
+**Schedule**: Weekly on Monday 08:00 UTC (CT 302 cron)
+```bash
+0 8 * * 1 DISCORD_WEBHOOK="<url>" /root/scripts/proxmox-backup-check.sh >> /var/log/proxmox-backup-check.log 2>&1
+```
+
+**Usage**:
+```bash
+# Dry run (no Discord)
+proxmox-backup-check.sh --dry-run
+
+# Post to Discord
+DISCORD_WEBHOOK="https://discord.com/api/webhooks/..." proxmox-backup-check.sh
+
+# Custom window
+proxmox-backup-check.sh --days 14 --discord-webhook "https://..."
+```
+
+**Dependencies**: `jq`, `curl`, SSH access to Proxmox host alias `proxmox`
+
+**Install on CT 302**:
+```bash
+cp proxmox-backup-check.sh /root/scripts/
+chmod +x /root/scripts/proxmox-backup-check.sh
+```
+
+### CT 302 Self-Health Monitor
+**Script**: `ct302-self-health.sh`
+**Purpose**: Monitors disk usage on CT 302 (claude-runner) itself. Alerts to Discord when any filesystem exceeds the threshold (default 80%). Runs silently when healthy — no Discord spam on green.
+
+**Key Features**:
+- Checks all non-virtual filesystems (`df`, excludes tmpfs/devtmpfs/overlay)
+- Only sends a Discord alert when a filesystem is at or above threshold
+- `--always-post` flag forces a post even when healthy (useful for testing)
+- `--dry-run` mode prints payload without sending
+
+**Schedule**: Daily at 07:00 UTC (CT 302 cron)
+```bash
+0 7 * * * DISCORD_WEBHOOK="<url>" /root/scripts/ct302-self-health.sh >> /var/log/ct302-self-health.log 2>&1
+```
+
+**Usage**:
+```bash
+# Check and alert if over 80%
+DISCORD_WEBHOOK="https://discord.com/api/webhooks/..." ct302-self-health.sh
+
+# Lower threshold test
+ct302-self-health.sh --threshold 50 --dry-run
+
+# Always post (weekly status report pattern)
+ct302-self-health.sh --always-post --discord-webhook "https://..."
+```
+
+**Dependencies**: `jq`, `curl`, `df`
+
+**Install on CT 302**:
+```bash
+cp ct302-self-health.sh /root/scripts/
+chmod +x /root/scripts/ct302-self-health.sh
+```
+
 ### Jellyfin GPU Health Monitor
 **Script**: `jellyfin_gpu_monitor.py`
 **Purpose**: Monitor Jellyfin container GPU access with Discord alerts and auto-restart capability
@ -235,6 +306,17 @@ python3 tdarr_file_monitor.py >> /mnt/NV2/Development/claude-home/logs/tdarr-fil
 0 9 * * 1 /usr/bin/python3 /home/cal/scripts/nvidia_update_checker.py --check --discord-alerts >> /home/cal/logs/nvidia-update-checker.log 2>&1
 ```

+**Active Cron Jobs** (on CT 302 / claude-runner, root user):
+```bash
+# Proxmox backup verification - Weekly (Mondays at 8 AM UTC)
+0 8 * * 1 DISCORD_WEBHOOK="<homelab-alerts-webhook>" /root/scripts/proxmox-backup-check.sh >> /var/log/proxmox-backup-check.log 2>&1
+
+# CT 302 self-health disk check - Daily at 7 AM UTC (alerts only when >80%)
+0 7 * * * DISCORD_WEBHOOK="<homelab-alerts-webhook>" /root/scripts/ct302-self-health.sh >> /var/log/ct302-self-health.log 2>&1
+```
+
+**Note**: Scripts must be installed manually on CT 302. Source of truth is `monitoring/scripts/` in this repo — copy to `/root/scripts/` on CT 302 to deploy.
+
 **Manual/On-Demand**:
 - `tdarr_monitor.py` - Run as needed for Tdarr health checks
 - `tdarr_file_monitor.py` - Can be scheduled if automatic backup needed
--- a/monitoring/scripts/ct302-self-health.sh
+++ b/monitoring/scripts/ct302-self-health.sh
@ -0,0 +1,158 @@
+#!/usr/bin/env bash
+# ct302-self-health.sh — CT 302 (claude-runner) disk self-check → Discord
+#
+# Monitors disk usage on CT 302 itself and alerts to Discord when any
+# filesystem exceeds the threshold. Closes the blind spot where the
+# monitoring system cannot monitor itself via external health checks.
+#
+# Designed to run silently when healthy (no Discord spam on green).
+# Only posts when a filesystem is at or above THRESHOLD.
+#
+# Usage:
+#   ct302-self-health.sh [--discord-webhook URL] [--threshold N] [--dry-run] [--always-post]
+#
+# Environment overrides:
+#   DISCORD_WEBHOOK   Discord webhook URL (required unless --dry-run)
+#   DISK_THRESHOLD    Disk usage % alert threshold (default: 80)
+#
+# Install on CT 302 (daily, 07:00 UTC):
+#   0 7 * * * /root/scripts/ct302-self-health.sh >> /var/log/ct302-self-health.log 2>&1
+
+set -uo pipefail
+
+DISK_THRESHOLD="${DISK_THRESHOLD:-80}"
+DISCORD_WEBHOOK="${DISCORD_WEBHOOK:-}"
+DRY_RUN=0
+ALWAYS_POST=0
+
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --discord-webhook)
+      if [[ $# -lt 2 ]]; then
+        echo "Error: --discord-webhook requires a value" >&2
+        exit 1
+      fi
+      DISCORD_WEBHOOK="$2"
+      shift 2
+      ;;
+    --threshold)
+      if [[ $# -lt 2 ]]; then
+        echo "Error: --threshold requires a value" >&2
+        exit 1
+      fi
+      DISK_THRESHOLD="$2"
+      shift 2
+      ;;
+    --dry-run)
+      DRY_RUN=1
+      shift
+      ;;
+    --always-post)
+      ALWAYS_POST=1
+      shift
+      ;;
+    *)
+      echo "Unknown option: $1" >&2
+      exit 1
+      ;;
+  esac
+done
+
+if [[ "$DRY_RUN" -eq 0 && -z "$DISCORD_WEBHOOK" ]]; then
+  echo "Error: DISCORD_WEBHOOK not set. Use --discord-webhook URL or set env var." >&2
+  exit 1
+fi
+
+log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; }
+
+# ---------------------------------------------------------------------------
+# Check disk usage on all real filesystems
+# ---------------------------------------------------------------------------
+# df output: Filesystem Use% Mounted-on (skipping tmpfs, devtmpfs, overlay)
+TRIGGERED=()
+ALL_FS=()
+
+while IFS= read -r line; do
+  fs=$(echo "$line" | awk '{print $1}')
+  pct=$(echo "$line" | awk '{print $5}' | tr -d '%')
+  mount=$(echo "$line" | awk '{print $6}')
+  ALL_FS+=("${pct}% ${mount} (${fs})")
+  if [[ "$pct" -ge "$DISK_THRESHOLD" ]]; then
+    TRIGGERED+=("${pct}% used — ${mount} (${fs})")
+  fi
+done < <(df -h --output=source,size,used,avail,pcent,target |
+  tail -n +2 |
+  awk '$1 !~ /^(tmpfs|devtmpfs|overlay|udev)/' |
+  awk '{print $1, $5, $6}')
+
+HOSTNAME=$(hostname -s)
+TRIGGERED_COUNT=${#TRIGGERED[@]}
+
+log "Disk check complete: ${TRIGGERED_COUNT} filesystem(s) above ${DISK_THRESHOLD}%"
+
+# Exit cleanly with no Discord post if everything is healthy
+if [[ "$TRIGGERED_COUNT" -eq 0 && "$ALWAYS_POST" -eq 0 && "$DRY_RUN" -eq 0 ]]; then
+  log "All filesystems healthy — no alert needed."
+  exit 0
+fi
+
+# ---------------------------------------------------------------------------
+# Build Discord payload
+# ---------------------------------------------------------------------------
+if [[ "$TRIGGERED_COUNT" -gt 0 ]]; then
+  EMBED_COLOR=15548997 # 0xED4245 red
+  TITLE="🔴 ${HOSTNAME}: Disk usage above ${DISK_THRESHOLD}%"
+  alert_lines=$(printf '⚠️ %s\n' "${TRIGGERED[@]}")
+  FIELDS=$(jq -n \
+    --arg name "Filesystems Over Threshold" \
+    --arg value "$alert_lines" \
+    '[{"name": $name, "value": $value, "inline": false}]')
+else
+  EMBED_COLOR=5763719 # 0x57F287 green
+  TITLE="🟢 ${HOSTNAME}: All filesystems healthy"
+  FIELDS='[]'
+fi
+
+# Add summary of all filesystems
+all_lines=$(printf '%s\n' "${ALL_FS[@]}")
+FIELDS=$(echo "$FIELDS" | jq \
+  --arg name "All Filesystems" \
+  --arg value "$all_lines" \
+  '. + [{"name": $name, "value": $value, "inline": false}]')
+
+FOOTER="$(date -u '+%Y-%m-%d %H:%M UTC') · CT 302 self-health · threshold: ${DISK_THRESHOLD}%"
+
+PAYLOAD=$(jq -n \
+  --arg title "$TITLE" \
+  --argjson color "$EMBED_COLOR" \
+  --argjson fields "$FIELDS" \
+  --arg footer "$FOOTER" \
+  '{
+    "embeds": [{
+      "title": $title,
+      "color": $color,
+      "fields": $fields,
+      "footer": {"text": $footer}
+    }]
+  }')
+
+if [[ "$DRY_RUN" -eq 1 ]]; then
+  log "DRY RUN — Discord payload:"
+  echo "$PAYLOAD" | jq .
+  exit 0
+fi
+
+log "Posting to Discord..."
+HTTP_STATUS=$(curl -s -o /tmp/ct302-self-health-discord.out \
+  -w "%{http_code}" \
+  -X POST "$DISCORD_WEBHOOK" \
+  -H "Content-Type: application/json" \
+  -d "$PAYLOAD")
+
+if [[ "$HTTP_STATUS" -ge 200 && "$HTTP_STATUS" -lt 300 ]]; then
+  log "Discord notification sent (HTTP ${HTTP_STATUS})."
+else
+  log "Warning: Discord returned HTTP ${HTTP_STATUS}."
+  cat /tmp/ct302-self-health-discord.out >&2
+  exit 1
+fi
--- a/monitoring/scripts/homelab-audit.sh
+++ b/monitoring/scripts/homelab-audit.sh
@ -5,7 +5,7 @@
 # to collect system metrics, then generates a summary report.
 #
 # Usage:
-#   homelab-audit.sh [--output-dir DIR]
+#   homelab-audit.sh [--output-dir DIR] [--hosts label:ip,label:ip,...]
 #
 # Environment overrides:
 #   STUCK_PROC_CPU_WARN  CPU% at which a D-state process is flagged (default: 10)
@ -29,6 +29,8 @@ LOAD_WARN=2.0
 MEM_WARN=85
 ZOMBIE_WARN=1
 SWAP_WARN=512
+HOSTS_FILTER="" # comma-separated host list from --hosts; empty = audit all
+JSON_OUTPUT=0   # set to 1 by --json

 while [[ $# -gt 0 ]]; do
  case "$1" in
@ -40,6 +42,18 @@ while [[ $# -gt 0 ]]; do
      REPORT_DIR="$2"
      shift 2
      ;;
+    --hosts)
+      if [[ $# -lt 2 ]]; then
+        echo "Error: --hosts requires an argument (label:ip,label:ip,...)" >&2
+        exit 1
+      fi
+      HOSTS_FILTER="$2"
+      shift 2
+      ;;
+    --json)
+      JSON_OUTPUT=1
+      shift
+      ;;
    *)
      echo "Unknown option: $1" >&2
      exit 1
@ -50,6 +64,7 @@ done
 mkdir -p "$REPORT_DIR"
 SSH_FAILURES_LOG="$REPORT_DIR/ssh-failures.log"
 FINDINGS_FILE="$REPORT_DIR/findings.txt"
+AUDITED_HOSTS=() # populated in main; used by generate_summary for per-host counts

 # ---------------------------------------------------------------------------
 # Remote collector script
@ -281,6 +296,18 @@ generate_summary() {
  printf "  Critical      : %d\n" "$crit_count"
  echo "=============================="

+  if [[ ${#AUDITED_HOSTS[@]} -gt 0 ]] && ((warn_count + crit_count > 0)); then
+    echo ""
+    printf "  %-30s %8s %8s\n" "Host" "Warnings" "Critical"
+    printf "  %-30s %8s %8s\n" "----" "--------" "--------"
+    for host in "${AUDITED_HOSTS[@]}"; do
+      local hw hc
+      hw=$(grep -c "^WARN  ${host}:" "$FINDINGS_FILE" 2>/dev/null || true)
+      hc=$(grep -c "^CRIT  ${host}:" "$FINDINGS_FILE" 2>/dev/null || true)
+      ((hw + hc > 0)) && printf "  %-30s %8d %8d\n" "$host" "$hw" "$hc"
+    done
+  fi
+
  if ((warn_count + crit_count > 0)); then
    echo ""
    echo "Findings:"
@ -293,6 +320,9 @@ generate_summary() {
    grep '^SSH_FAILURE' "$SSH_FAILURES_LOG" | awk '{print "  " $2 " (" $3 ")"}'
  fi

+  echo ""
+  printf "Total: %d warning(s), %d critical across %d host(s)\n" \
+    "$warn_count" "$crit_count" "$host_count"
  echo ""
  echo "Reports: $REPORT_DIR"
 }
@ -383,6 +413,69 @@ check_cert_expiry() {
  done
 }

+# ---------------------------------------------------------------------------
+# JSON report — writes findings.json to $REPORT_DIR when --json is used
+# ---------------------------------------------------------------------------
+write_json_report() {
+  local host_count="$1"
+  local json_file="$REPORT_DIR/findings.json"
+  local ssh_failure_count=0
+  local warn_count=0
+  local crit_count=0
+
+  [[ -f "$SSH_FAILURES_LOG" ]] &&
+    ssh_failure_count=$(grep -c '^SSH_FAILURE' "$SSH_FAILURES_LOG" 2>/dev/null || true)
+  [[ -f "$FINDINGS_FILE" ]] &&
+    warn_count=$(grep -c '^WARN' "$FINDINGS_FILE" 2>/dev/null || true)
+  [[ -f "$FINDINGS_FILE" ]] &&
+    crit_count=$(grep -c '^CRIT' "$FINDINGS_FILE" 2>/dev/null || true)
+
+  python3 - "$json_file" "$host_count" "$ssh_failure_count" \
+    "$warn_count" "$crit_count" "$FINDINGS_FILE" <<'PYEOF'
+import sys, json, datetime
+
+json_file = sys.argv[1]
+host_count = int(sys.argv[2])
+ssh_failure_count = int(sys.argv[3])
+warn_count = int(sys.argv[4])
+crit_count = int(sys.argv[5])
+findings_file = sys.argv[6]
+
+findings = []
+try:
+    with open(findings_file) as f:
+        for line in f:
+            line = line.strip()
+            if not line:
+                continue
+            parts = line.split(None, 2)
+            if len(parts) < 3:
+                continue
+            severity, host_colon, message = parts[0], parts[1], parts[2]
+            findings.append({
+                "severity": severity,
+                "host": host_colon.rstrip(":"),
+                "message": message,
+            })
+except FileNotFoundError:
+    pass
+
+output = {
+    "timestamp": datetime.datetime.utcnow().isoformat() + "Z",
+    "hosts_audited": host_count,
+    "warnings": warn_count,
+    "critical": crit_count,
+    "ssh_failures": ssh_failure_count,
+    "total_findings": warn_count + crit_count,
+    "findings": findings,
+}
+
+with open(json_file, "w") as f:
+    json.dump(output, f, indent=2)
+print(f"JSON report: {json_file}")
+PYEOF
+}
+
 # ---------------------------------------------------------------------------
 # Main
 # ---------------------------------------------------------------------------
@ -390,22 +483,50 @@ main() {
  echo "Starting homelab audit — $(date)"
  echo "Report dir: $REPORT_DIR"
  echo "STUCK_PROC_CPU_WARN threshold: ${STUCK_PROC_CPU_WARN}%"
+  [[ -n "$HOSTS_FILTER" ]] && echo "Host filter: $HOSTS_FILTER"
  echo ""

  >"$FINDINGS_FILE"

-  echo "  Checking Proxmox backup recency..."
-  check_backup_recency
-
  local host_count=0
-  while read -r label addr; do
-    echo "  Auditing $label ($addr)..."
-    parse_and_report "$label" "$addr"
-    check_cert_expiry "$label" "$addr"
-    ((host_count++)) || true
-  done < <(collect_inventory)
+
+  if [[ -n "$HOSTS_FILTER" ]]; then
+    # --hosts mode: audit specified hosts directly, skip Proxmox inventory
+    # Accepts comma-separated entries; each entry may be plain hostname or label:ip
+    local check_proxmox=0
+    IFS=',' read -ra filter_hosts <<<"$HOSTS_FILTER"
+    for entry in "${filter_hosts[@]}"; do
+      local label="${entry%%:*}"
+      [[ "$label" == "proxmox" ]] && check_proxmox=1
+    done
+    if ((check_proxmox)); then
+      echo "  Checking Proxmox backup recency..."
+      check_backup_recency
+    fi
+    for entry in "${filter_hosts[@]}"; do
+      local label="${entry%%:*}"
+      local addr="${entry#*:}"
+      echo "  Auditing $label ($addr)..."
+      parse_and_report "$label" "$addr"
+      check_cert_expiry "$label" "$addr"
+      AUDITED_HOSTS+=("$label")
+      ((host_count++)) || true
+    done
+  else
+    echo "  Checking Proxmox backup recency..."
+    check_backup_recency
+
+    while read -r label addr; do
+      echo "  Auditing $label ($addr)..."
+      parse_and_report "$label" "$addr"
+      check_cert_expiry "$label" "$addr"
+      AUDITED_HOSTS+=("$label")
+      ((host_count++)) || true
+    done < <(collect_inventory)
+  fi

  generate_summary "$host_count"
+  [[ "$JSON_OUTPUT" -eq 1 ]] && write_json_report "$host_count"
 }

 main "$@"
--- a/monitoring/scripts/proxmox-backup-check.sh
+++ b/monitoring/scripts/proxmox-backup-check.sh
@ -0,0 +1,230 @@
+#!/usr/bin/env bash
+# proxmox-backup-check.sh — Weekly Proxmox backup verification → Discord
+#
+# SSHes to the Proxmox host and checks that every running VM/CT has a
+# successful vzdump backup within the last 7 days. Posts a color-coded
+# Discord summary with per-guest status.
+#
+# Usage:
+#   proxmox-backup-check.sh [--discord-webhook URL] [--days N] [--dry-run]
+#
+# Environment overrides:
+#   DISCORD_WEBHOOK   Discord webhook URL (required unless --dry-run)
+#   PROXMOX_NODE      Proxmox node name (default: proxmox)
+#   PROXMOX_SSH       SSH alias or host for Proxmox (default: proxmox)
+#   WINDOW_DAYS       Backup recency window in days (default: 7)
+#
+# Install on CT 302 (weekly, Monday 08:00 UTC):
+#   0 8 * * 1 /root/scripts/proxmox-backup-check.sh >> /var/log/proxmox-backup-check.log 2>&1
+
+set -uo pipefail
+
+PROXMOX_NODE="${PROXMOX_NODE:-proxmox}"
+PROXMOX_SSH="${PROXMOX_SSH:-proxmox}"
+WINDOW_DAYS="${WINDOW_DAYS:-7}"
+DISCORD_WEBHOOK="${DISCORD_WEBHOOK:-}"
+DRY_RUN=0
+
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --discord-webhook)
+      if [[ $# -lt 2 ]]; then
+        echo "Error: --discord-webhook requires a value" >&2
+        exit 1
+      fi
+      DISCORD_WEBHOOK="$2"
+      shift 2
+      ;;
+    --days)
+      if [[ $# -lt 2 ]]; then
+        echo "Error: --days requires a value" >&2
+        exit 1
+      fi
+      WINDOW_DAYS="$2"
+      shift 2
+      ;;
+    --dry-run)
+      DRY_RUN=1
+      shift
+      ;;
+    *)
+      echo "Unknown option: $1" >&2
+      exit 1
+      ;;
+  esac
+done
+
+if [[ "$DRY_RUN" -eq 0 && -z "$DISCORD_WEBHOOK" ]]; then
+  echo "Error: DISCORD_WEBHOOK not set. Use --discord-webhook URL or set env var." >&2
+  exit 1
+fi
+
+if ! command -v jq &>/dev/null; then
+  echo "Error: jq is required but not installed." >&2
+  exit 1
+fi
+
+SSH_OPTS="-o StrictHostKeyChecking=accept-new -o ConnectTimeout=10 -o BatchMode=yes"
+CUTOFF=$(date -d "-${WINDOW_DAYS} days" +%s)
+NOW=$(date +%s)
+
+log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; }
+
+# ---------------------------------------------------------------------------
+# Fetch data from Proxmox
+# ---------------------------------------------------------------------------
+log "Fetching VM and CT list from Proxmox node '${PROXMOX_NODE}'..."
+VMS_JSON=$(ssh $SSH_OPTS "$PROXMOX_SSH" \
+  "pvesh get /nodes/${PROXMOX_NODE}/qemu --output-format json 2>/dev/null" || echo "[]")
+CTS_JSON=$(ssh $SSH_OPTS "$PROXMOX_SSH" \
+  "pvesh get /nodes/${PROXMOX_NODE}/lxc --output-format json 2>/dev/null" || echo "[]")
+
+log "Fetching recent vzdump task history (limit 200)..."
+TASKS_JSON=$(ssh $SSH_OPTS "$PROXMOX_SSH" \
+  "pvesh get /nodes/${PROXMOX_NODE}/tasks --typefilter vzdump --limit 200 --output-format json 2>/dev/null" || echo "[]")
+
+# ---------------------------------------------------------------------------
+# Build per-guest backup status
+# ---------------------------------------------------------------------------
+# Merge VMs and CTs into one list: [{vmid, name, type}]
+GUESTS_JSON=$(jq -n \
+  --argjson vms "$VMS_JSON" \
+  --argjson cts "$CTS_JSON" '
+    ($vms | map(select(.status == "running") | {vmid: (.vmid | tostring), name, type: "VM"})) +
+    ($cts  | map(select(.status == "running") | {vmid: (.vmid | tostring), name, type: "CT"}))
+  ')
+
+GUEST_COUNT=$(echo "$GUESTS_JSON" | jq 'length')
+log "Found ${GUEST_COUNT} running guests."
+
+# For each guest, find the most recent successful (status == "OK") vzdump task
+RESULTS=$(jq -n \
+  --argjson guests "$GUESTS_JSON" \
+  --argjson tasks "$TASKS_JSON" \
+  --argjson cutoff "$CUTOFF" \
+  --argjson now "$NOW" \
+  --argjson window "$WINDOW_DAYS" '
+  $guests | map(
+    . as $g |
+    ($tasks | map(
+      select(
+        (.vmid | tostring) == $g.vmid
+        and .status == "OK"
+      ) | .starttime
+    ) | max // 0) as $last_ts |
+    {
+      vmid: $g.vmid,
+      name: $g.name,
+      type: $g.type,
+      last_backup_ts: $last_ts,
+      age_days: (if $last_ts > 0 then (($now - $last_ts) / 86400 | floor) else -1 end),
+      status: (
+        if   $last_ts >= $cutoff then "green"
+        elif $last_ts > 0        then "yellow"
+        else                          "red"
+        end
+      )
+    }
+  ) | sort_by(.vmid | tonumber)
+')
+
+GREEN_GUESTS=$(echo "$RESULTS" | jq '[.[] | select(.status == "green")]')
+YELLOW_GUESTS=$(echo "$RESULTS" | jq '[.[] | select(.status == "yellow")]')
+RED_GUESTS=$(echo "$RESULTS" | jq '[.[] | select(.status == "red")]')
+
+GREEN_COUNT=$(echo "$GREEN_GUESTS" | jq 'length')
+YELLOW_COUNT=$(echo "$YELLOW_GUESTS" | jq 'length')
+RED_COUNT=$(echo "$RED_GUESTS" | jq 'length')
+
+log "Results: ${GREEN_COUNT} green, ${YELLOW_COUNT} yellow, ${RED_COUNT} red"
+
+# ---------------------------------------------------------------------------
+# Build Discord payload
+# ---------------------------------------------------------------------------
+if [[ "$RED_COUNT" -gt 0 ]]; then
+  EMBED_COLOR=15548997 # 0xED4245 red
+  STATUS_LINE="🔴 Backup issues detected — action required"
+elif [[ "$YELLOW_COUNT" -gt 0 ]]; then
+  EMBED_COLOR=16705372 # 0xFF851C orange
+  STATUS_LINE="🟡 Some backups are overdue (>${WINDOW_DAYS}d)"
+else
+  EMBED_COLOR=5763719 # 0x57F287 green
+  STATUS_LINE="🟢 All ${GUEST_COUNT} guests backed up within ${WINDOW_DAYS} days"
+fi
+
+# Format guest lines: "VM 116 (plex) — 2d ago" or "CT 302 (claude-runner) — NO BACKUPS"
+format_guest() {
+  local prefix="$1" guests="$2"
+  echo "$guests" | jq -r '.[] | "\(.type) \(.vmid) (\(.name))"' |
+    while IFS= read -r line; do echo "${prefix} ${line}"; done
+}
+
+format_guest_with_age() {
+  local prefix="$1" guests="$2"
+  echo "$guests" | jq -r '.[] | "\(.type) \(.vmid) (\(.name)) — \(.age_days)d ago"' |
+    while IFS= read -r line; do echo "${prefix} ${line}"; done
+}
+
+# Build fields array
+fields='[]'
+
+if [[ "$GREEN_COUNT" -gt 0 ]]; then
+  green_lines=$(format_guest_with_age "✅" "$GREEN_GUESTS")
+  fields=$(echo "$fields" | jq \
+    --arg name "🟢 Healthy (${GREEN_COUNT})" \
+    --arg value "$green_lines" \
+    '. + [{"name": $name, "value": $value, "inline": false}]')
+fi
+
+if [[ "$YELLOW_COUNT" -gt 0 ]]; then
+  yellow_lines=$(format_guest_with_age "⚠️" "$YELLOW_GUESTS")
+  fields=$(echo "$fields" | jq \
+    --arg name "🟡 Overdue — last backup >${WINDOW_DAYS}d ago (${YELLOW_COUNT})" \
+    --arg value "$yellow_lines" \
+    '. + [{"name": $name, "value": $value, "inline": false}]')
+fi
+
+if [[ "$RED_COUNT" -gt 0 ]]; then
+  red_lines=$(format_guest "❌" "$RED_GUESTS")
+  fields=$(echo "$fields" | jq \
+    --arg name "🔴 No Successful Backups Found (${RED_COUNT})" \
+    --arg value "$red_lines" \
+    '. + [{"name": $name, "value": $value, "inline": false}]')
+fi
+
+FOOTER="$(date -u '+%Y-%m-%d %H:%M UTC') · ${GUEST_COUNT} guests · window: ${WINDOW_DAYS}d"
+
+PAYLOAD=$(jq -n \
+  --arg title "Proxmox Backup Check — ${STATUS_LINE}" \
+  --argjson color "$EMBED_COLOR" \
+  --argjson fields "$fields" \
+  --arg footer "$FOOTER" \
+  '{
+    "embeds": [{
+      "title": $title,
+      "color": $color,
+      "fields": $fields,
+      "footer": {"text": $footer}
+    }]
+  }')
+
+if [[ "$DRY_RUN" -eq 1 ]]; then
+  log "DRY RUN — Discord payload:"
+  echo "$PAYLOAD" | jq .
+  exit 0
+fi
+
+log "Posting to Discord..."
+HTTP_STATUS=$(curl -s -o /tmp/proxmox-backup-check-discord.out \
+  -w "%{http_code}" \
+  -X POST "$DISCORD_WEBHOOK" \
+  -H "Content-Type: application/json" \
+  -d "$PAYLOAD")
+
+if [[ "$HTTP_STATUS" -ge 200 && "$HTTP_STATUS" -lt 300 ]]; then
+  log "Discord notification sent (HTTP ${HTTP_STATUS})."
+else
+  log "Warning: Discord returned HTTP ${HTTP_STATUS}."
+  cat /tmp/proxmox-backup-check-discord.out >&2
+  exit 1
+fi
--- a/monitoring/scripts/test-audit-collectors.sh
+++ b/monitoring/scripts/test-audit-collectors.sh
@ -93,6 +93,34 @@ else
  fail "disk_usage" "expected 'N /path', got: '$result'"
 fi

+# --- --hosts flag parsing ---
+echo ""
+echo "=== --hosts argument parsing tests ==="
+
+# Single host
+input="vm-115:10.10.0.88"
+IFS=',' read -ra entries <<<"$input"
+label="${entries[0]%%:*}"
+addr="${entries[0]#*:}"
+if [[ "$label" == "vm-115" && "$addr" == "10.10.0.88" ]]; then
+  pass "--hosts single entry parsed: $label $addr"
+else
+  fail "--hosts single" "expected 'vm-115 10.10.0.88', got: '$label $addr'"
+fi
+
+# Multiple hosts
+input="vm-115:10.10.0.88,lxc-225:10.10.0.225"
+IFS=',' read -ra entries <<<"$input"
+label1="${entries[0]%%:*}"
+addr1="${entries[0]#*:}"
+label2="${entries[1]%%:*}"
+addr2="${entries[1]#*:}"
+if [[ "$label1" == "vm-115" && "$addr1" == "10.10.0.88" && "$label2" == "lxc-225" && "$addr2" == "10.10.0.225" ]]; then
+  pass "--hosts multi entry parsed: $label1 $addr1, $label2 $addr2"
+else
+  fail "--hosts multi" "unexpected parse result"
+fi
+
 echo ""
 echo "=== Results: $PASS passed, $FAIL failed ==="
 ((FAIL == 0))
--- a/monitoring/server-diagnostics/CONTEXT.md
+++ b/monitoring/server-diagnostics/CONTEXT.md
@ -92,6 +92,42 @@ CT 302 does **not** have an SSH key registered with Gitea, so SSH git remotes wo
 3. Commit to Gitea, pull on CT 302
 4. Add Uptime Kuma monitors if desired

+## Health Check Thresholds
+
+Thresholds are evaluated in `health_check.py`. All load thresholds use **per-core** metrics
+to avoid false positives from LXC containers (which see the Proxmox host's aggregate load).
+
+### Load Average
+
+| Metric | Value | Rationale |
+|--------|-------|-----------|
+| `LOAD_WARN_PER_CORE` | `0.7` | Elevated — investigate if sustained |
+| `LOAD_CRIT_PER_CORE` | `1.0` | Saturated — CPU is a bottleneck |
+| Sample window | 5-minute | Filters transient spikes (not 1-minute) |
+
+**Formula**: `load_per_core = load_5m / nproc`
+
+**Why per-core?** Proxmox LXC containers see the host's aggregate load average via the
+shared kernel. A 32-core Proxmox host at load 9 is at 0.28/core (healthy), but a naive
+absolute threshold of 2× would trigger at 9 for a 4-core LXC. Using `load_5m / nproc`
+where `nproc` returns the host's visible core count gives the correct ratio.
+
+**Validation examples**:
+- Proxmox host: load 9 / 32 cores = 0.28/core → no alert ✓
+- VM 116 at 0.75/core → warning ✓ (above 0.7 threshold)
+- VM at 1.1/core → critical ✓
+
+### Other Thresholds
+
+| Check | Threshold | Notes |
+|-------|-----------|-------|
+| Zombie processes | 5 | Single zombies are transient noise; alert only if ≥ 5 |
+| Swap usage | 30% of total swap | Percentage-based to handle varied swap sizes across hosts |
+| Disk warning | 85% | |
+| Disk critical | 95% | |
+| Memory | 90% | |
+| Uptime alert | Non-urgent Discord post | Not a page-level alert |
+
 ## Related

 - [monitoring/CONTEXT.md](../CONTEXT.md) — Overall monitoring architecture
--- a/networking/examples/server_inventory.yaml
+++ b/networking/examples/server_inventory.yaml
@ -47,12 +47,13 @@ home_network:
      services: ["media", "transcoding"]
      description: "Tdarr media transcoding"
      
-    vpn_docker:
-      hostname: "10.10.0.121"
-      port: 22
-      user: "cal"
-      services: ["vpn", "docker"]
-      description: "VPN and Docker services"
+    # DECOMMISSIONED: vpn_docker (10.10.0.121) - VM 105 destroyed 2026-04
+    # vpn_docker:
+    #   hostname: "10.10.0.121"
+    #   port: 22
+    #   user: "cal"
+    #   services: ["vpn", "docker"]
+    #   description: "VPN and Docker services"

 remote_servers:
  akamai_nano:
--- a/networking/examples/ssh-homelab-setup.md
+++ b/networking/examples/ssh-homelab-setup.md
@ -23,7 +23,7 @@ servers:
  pihole:             10.10.0.16  # Pi-hole DNS and ad blocking
  sba_pd_bots:        10.10.0.88  # SBa and PD bot services
  tdarr:              10.10.0.43  # Media transcoding
-  vpn_docker:         10.10.0.121 # VPN and Docker services
+  # vpn_docker:       10.10.0.121 # DECOMMISSIONED — VM 105 destroyed, migrated to arr-stack LXC 221
 ```

 ### Cloud Servers
@ -175,11 +175,12 @@ Host tdarr media
    Port 22
    IdentityFile ~/.ssh/homelab_rsa

-Host docker-vpn
-    HostName 10.10.0.121
-    User cal
-    Port 22
-    IdentityFile ~/.ssh/homelab_rsa
+# DECOMMISSIONED: docker-vpn (10.10.0.121) - VM 105 destroyed, migrated to arr-stack LXC 221
+# Host docker-vpn
+#     HostName 10.10.0.121
+#     User cal
+#     Port 22
+#     IdentityFile ~/.ssh/homelab_rsa

 # Remote Cloud Servers
 Host akamai-nano akamai
--- a/paper-dynasty/autonomous-nightly-2026-04-10-run2.md
+++ b/paper-dynasty/autonomous-nightly-2026-04-10-run2.md
@ -0,0 +1,95 @@
+---
+title: "Autonomous Nightly Run — 2026-04-10 (run 2)"
+description: "Second autonomous pipeline run of the day: 4 PRs created (1 APPROVED, 3 REQUEST_CHANGES), 11 items queued to pd-plan, 0 rejections"
+type: context
+domain: paper-dynasty
+tags: [autonomous-pipeline, nightly-run]
+---
+
+## Run Metadata
+- Date: 2026-04-10 (second run of the day; see autonomous-nightly-2026-04-10.md for run 1)
+- Duration: ~25 minutes wall clock
+- Slots before: 0/10 S, 0/5 M (no prior autonomous PRs open from run 1)
+- Slots after: 4/10 S, 0/5 M (4 S items in_progress pending merge)
+
+## Findings
+- Analyst produced 5 findings
+- Growth-po produced 10 findings
+- Dedup filtered: 0 duplicates, 0 partial overlaps (haiku call skipped — both comparison lists were empty, making all 15 findings trivially novel)
+
+## PO Decisions
+| Finding ID | PO | Decision | Size | Notes |
+|---|---|---|---|---|
+| analyst-2026-04-10-001 | database-po | approved | M | HTTPException-200 sweep — consumer audit required |
+| analyst-2026-04-10-002 | database-po | reshaped | S | Drop premature empty-table 404s; do NOT materialize large querysets |
+| analyst-2026-04-10-003 | discord-po | approved | S | Bare except narrowing (high severity) |
+| analyst-2026-04-10-004 | database-po | approved | M | Packs beachhead tests — sequence after 001/002 |
+| analyst-2026-04-10-005 | (autonomous) | auto-approved | S | Structured rejection parser |
+| growth-sweep-2026-04-10-001 | discord-po | reshaped | M | Command logging — split into db endpoint + bot middleware |
+| growth-sweep-2026-04-10-002 | database-po | approved | S | Card of the week endpoint |
+| growth-sweep-2026-04-10-003 | discord-po | approved | S | Gauntlet results recap |
+| growth-sweep-2026-04-10-004 | discord-po | approved | S | /compare command |
+| growth-sweep-2026-04-10-005 | discord-po | approved | M | /profile command — needs aggregate endpoint |
+| growth-sweep-2026-04-10-006 | discord-po | approved | S | Rarity celebration embeds — use canonical rarity names |
+| growth-sweep-2026-04-10-007 | discord-po | approved | S | Gauntlet schedule + reminder |
+| growth-sweep-2026-04-10-008 | discord-po | approved | M | Starter pack grant — idempotent, onboarding critical |
+| growth-sweep-2026-04-10-009 | discord-po | approved | M | /pack history with pack_log table |
+| growth-sweep-2026-04-10-010 | database-po | reshaped | M | Webhook infra first, cardset hook as consumer |
+
+## PRs Created
+
+| PR | Repo | Title | Tests | Review |
+|---|---|---|---|---|
+| #163 | discord-app | fix(gameplay): replace bare except with NoResultFound | pre-existing collection failures (testcontainers missing locally) | **REQUEST_CHANGES** — cache_player uses session.get which returns None, not raises; new except NoResultFound is unreachable and caller crashes with AttributeError |
+| #164 | discord-app | feat(gauntlet): auto-post results recap embed | PASS (14 new tests) | **REQUEST_CHANGES** — `loss_max or 99` treats loss_max=0 as falsy, causing perfect-run bonus tier to show ⬜ instead of ❌ on 10-1 finish |
+| #212 | database | feat(api): card of the week featured endpoint | PASS (6 new tests) | **APPROVED** — joins, AI exclusion, tiebreak, 404 handling all correct. Merge via `pd-pr merge --no-approve` |
+| #165 | discord-app | feat(cogs): /compare slash command | PASS (30 new tests) | **REQUEST_CHANGES** — `_is_pitcher` omits CP (Closing Pitcher), silently misclassifies closers as batters |
+
+## Mix Ratio
+- Recent history: insufficient data (first full pipeline run after the 2-PR morning run); skipped the bash ratio check to conserve budget
+- Bias applied this run: none (interleaved stability/feature manually)
+- Dispatched mix: 1 stability (analyst-003) + 3 feature (growth-002/003/004). 1:3 is feature-heavy; balance the next run toward stability if this trend continues
+
+## Wishlist Additions
+None. All Large items were scoped as M or smaller by POs — nothing escalated to the L wishlist this run.
+
+## Queued to pd-plan (waiting for slot)
+Added as `status=active`, `slot=autonomous`:
+- #20: Sweep HTTPException(status_code=200) in routers (M, database)
+- #21: Remove double-count and premature empty-table 404s (S, database)
+- #22: Beachhead integration tests for packs router (M, database)
+- #23: Structured rejection parser for autonomous pipeline (S, autonomous)
+- #24: Command usage logging — bot middleware + db endpoint (M, multi-repo)
+- #25: Player profile command /profile (M, multi-repo)
+- #26: Rarity celebration embeds for pack pulls (S, discord-app)
+- #27: Gauntlet schedule + reminder task (S, discord-app)
+- #28: Starter pack grant for new players (M, multi-repo)
+- #29: Pack opening history command /pack history (M, multi-repo)
+- #30: Outbound webhook dispatcher + cardset publish hook (M, database)
+
+Shipped as in_progress linked to PRs:
+- #31 → discord-app#163
+- #32 → discord-app#164
+- #33 → database#212
+- #34 → discord-app#165
+
+## Rejections
+None. All 15 findings passed PO review (5 reshaped, 10 approved as-is).
+
+## Self-Improvement Notes
+
+1. **pr-reviewer caught 3 of 4 real bugs.** This is exactly the value the review gate is supposed to provide. Worth noting that tests passed on all three REQUEST_CHANGES PRs — the bugs were specifically in code paths the author's own tests didn't exercise:
+   - PR #163: author didn't test a session.get cache-miss; the narrowed exception class doesn't actually match the real "not found" signal in that function
+   - PR #164: test asserted absence of ✅ but not presence of ❌, missing the falsy-zero substitution bug
+   - PR #165: test suite didn't include a CP (closer) case, so the position-gate gap was invisible
+   Engineer prompts should explicitly require adversarial tests that exercise the exact code path the change modifies, including zero/empty/None boundary values.
+
+2. **Worktree contamination on PR #165.** The /compare PR diff included `gauntlets.py` and `tests/test_gauntlet_recap.py` changes from PR #164, plus a `gameplay_queries.py` formatting touch from PR #163. Parallel worktrees branching from the same mainline apparently picked up each other's state. Investigate whether `isolation: "worktree"` in the Agent tool produces a fully isolated checkout or whether engineers need to explicitly branch from `origin/main`. If worktrees share a .git, sequential dispatch may be safer for tighter commit isolation.
+
+3. **Budget headroom tight at scale.** Dispatched only 4 of 15 approved items due to budget caution. 4 engineers + 4 reviewers consumed ~$10 (~$1.20/agent). At this rate, filling all 15 slots would require a ~$30 budget ceiling. Options: (a) use Haiku for engineers on mechanical changes like the HTTPException sweep, (b) batch multiple small fixes into one engineer invocation when they touch the same file, (c) cache common context via a prewarm step.
+
+4. **Rejection parser finding is legit.** analyst-005's observation about rejection markdown blobs in dedup input is correct — when the rejection list grows, raw markdown will poison semantic matching quality. Auto-approved to the queue (#23). Self-improving the pipeline itself is exactly the kind of work the `autonomous` repo scope was added for.
+
+5. **Empty dedup lists mean haiku call was dead weight.** Implement a preflight short-circuit: if `open_autonomous_prs` AND `recent_rejections` are both empty, skip the dedup haiku call entirely. Saves ~$0.05 and a few seconds per clean-slate run.
+
+6. **Database-po reshape was substantive.** Both reshape decisions from database-po (analyst-002, growth-010) were correct and saved bad PRs. The original analyst recommendation for analyst-002 (materialize large querysets) would have regressed performance; the PO catch saved a regression. Growth-010's reshape correctly identified that the real cost of 2.6a is the webhook dispatcher plumbing, not the hook site. Keep POs in the loop for all findings — the cost is justified.
--- a/paper-dynasty/autonomous-nightly-2026-04-10.md
+++ b/paper-dynasty/autonomous-nightly-2026-04-10.md
@ -0,0 +1,95 @@
+---
+title: "Autonomous Nightly Run — 2026-04-10"
+description: "First autonomous nightly run: 2 PRs shipped, 7 items queued, 0 rejections. Budget-constrained dispatch."
+type: context
+domain: paper-dynasty
+tags: [autonomous-pipeline, nightly-run]
+---
+
+## Run Metadata
+- Date: 2026-04-10
+- Slots before: 10/10 S, 5/5 M (no active autonomous work)
+- Slots after: 8/10 S, 5/5 M (2 S slots now in-flight via PRs)
+- Open autonomous PRs before run: 0
+- Recent rejections: 0
+- Budget constraint: run hit the $5 USD ceiling early due to broad analyst sweep; dispatched 2 engineers instead of full slot fill.
+
+## Findings
+- Analyst produced 8 findings across database, discord-app, and autonomous pipeline
+- Growth-po produced 5 findings (all discord-app, all S-sized, all Phase 2 roadmap items)
+- Dedup haiku: **skipped** (0 open PRs + 0 rejections = no possible duplicates; all findings novel by construction)
+
+## PO Decisions
+
+### Database-po (4 findings)
+| Finding ID | Decision | Size | Notes |
+|---|---|---|---|
+| analyst-2026-04-10-002 | approved | S | HTTPException(200) sweep across ~10 routers |
+| analyst-2026-04-10-004 | approved | S | N+1 Paperdex fix; add query-count regression test |
+| analyst-2026-04-10-006 | reshaped | M | Split into 3 S tickets, start with pack-opening tests |
+| analyst-2026-04-10-008 | approved | S | Remove unfiltered pre-count in GET /packs **→ shipped** |
+
+### Discord-po (8 findings)
+| Finding ID | Decision | Size | Notes |
+|---|---|---|---|
+| analyst-2026-04-10-001 | approved | S | Delete dead gameplay_legacy.py **→ shipped** |
+| analyst-2026-04-10-003 | approved | S | Economy tree.on_error override (play-lock bug) — **high priority** |
+| analyst-2026-04-10-005 | reshaped | M | Two-phase cutover for economy_new/packs.py migration |
+| growth-sweep-2026-04-10-001 | approved | S | Rarity celebration embeds — use canonical rarity vocab |
+| growth-sweep-2026-04-10-002 | approved | S | /compare command — ephemeral by default, LHP/RHP split |
+| growth-sweep-2026-04-10-003 | approved | S | Gauntlet results recap embed |
+| growth-sweep-2026-04-10-004 | reshaped | M | Command usage telemetry — cross-repo, needs privacy review |
+| growth-sweep-2026-04-10-005 | reshaped | S+M | Split: /gauntlet schedule (S) first, reminder scheduler (M) after scheduler approach specced |
+
+### Self-improvement (auto-approved, no PO gate)
+| Finding ID | Decision | Size | Notes |
+|---|---|---|---|
+| analyst-2026-04-10-007 | approved | S | Split run-nightly.sh stdout/stderr, write last-run-result.json, voice-notify on failure |
+
+## PRs Created
+- **discord-app#162** — `chore(cogs): remove dead gameplay_legacy cog (4,723 lines, zero references)` — tests PASS (no new failures; 2 pre-existing SQLite path issues unchanged), labels applied, **pr-reviewer dispatch skipped (budget)** — https://git.manticorum.com/cal/paper-dynasty-discord/pulls/162
+- **database#211** — `fix(packs): remove unfiltered pre-count in GET /packs (3 round-trips → 2)` — tests PASS (266 passed, 13 pre-existing failures unchanged), consumer check clean (no 404 handlers in discord-app), labels applied, **pr-reviewer dispatch skipped (budget)** — https://git.manticorum.com/cal/paper-dynasty-database/pulls/211
+  - **Post-run diagnostic:** Pyright flagged 4 `Pack.id` attribute access errors after ruff reformatted the file. These are Peewee ORM false positives (`id` is added dynamically by Peewee's Model metaclass) and are pre-existing elsewhere in the codebase. Not a regression from this change.
+
+## Mix Ratio
+- No prior digests — this is the first autonomous nightly run. Default 1:1 interleave applied.
+- This run shipped 2 stability items and 0 features. Next run should bias toward feature dispatches if budget permits.
+
+## Wishlist Additions
+- None. All approved items are S or M and could fit within a normal slot budget — no L-sized items surfaced in this sweep.
+
+## Queued for Next Run (approved but not dispatched due to budget)
+The following items are **approved and ready to ship** but were not dispatched this run. They should be picked up first thing next run:
+
+**High priority (stability, real user impact):**
+1. `analyst-2026-04-10-003` (S) — Economy cog overwrites global tree.on_error, bypassing play-lock release. **Players are getting stuck due to this bug.** Should be the first item dispatched next run.
+2. `analyst-2026-04-10-002` (S) — HTTPException(200) sweep across ~10 DB routers.
+3. `analyst-2026-04-10-004` (S) — N+1 Paperdex fix in players endpoints.
+
+**Self-improvement:**
+4. `analyst-2026-04-10-007` (S) — run-nightly.sh stdout/stderr split + last-run-result.json. This is a *prerequisite* for reliable future runs; should be prioritized.
+
+**Features (growth):**
+5. `growth-sweep-2026-04-10-001` (S) — Rarity celebration embeds.
+6. `growth-sweep-2026-04-10-003` (S) — Gauntlet results recap embed.
+7. `growth-sweep-2026-04-10-002` (S) — /compare command.
+
+**Reshaped (needs spec work before dispatch):**
+- `analyst-2026-04-10-006` (M) — first of 3 split tickets: pack-opening happy path + insufficient funds + duplicate handling.
+- `analyst-2026-04-10-005` (M) — Phase 1 spec of economy.py vs economy_new/packs.py drift.
+- `growth-sweep-2026-04-10-004` (M) — Cross-repo telemetry; needs privacy posture confirmation.
+- `growth-sweep-2026-04-10-005` Issue A (S) — /gauntlet schedule command (pure read).
+
+## Rejections
+- None this run.
+
+## Self-Improvement Notes
+
+**The pipeline hit its $5 budget ceiling after dispatching analyst + growth-po + 2 POs + 2 engineers.** Breakdown of spend was top-heavy: the analyst agent alone consumed roughly half the budget due to a 411s, 104-tool-use deep audit. Observations for future runs:
+
+1. **Analyst cap**: Consider passing a stricter cap (e.g., "limit to top 5 findings, max 30 tool uses") to the analyst to keep its spend predictable.
+2. **Dedup skip was correct**: With 0 open PRs and 0 rejections, the dedup haiku call would have been pure overhead. Encoding this as an orchestrator shortcut (skip dedup when both inputs are empty) would save ~$0.10 per first-run scenario.
+3. **pr-reviewer was skipped**: Engineer PRs #162 and #211 did not receive an automated review pass. Cal should manually review these before merge. Future runs should reserve ~$0.30 per PR for pr-reviewer.
+4. **pd-plan CLI skipped**: Approved-but-queued items are documented in this digest only, not in the pd-plan database. Next run's preflight should parse this digest's "Queued for Next Run" section and dispatch those items first before generating new findings.
+5. **Budget-aware slot filling**: Orchestrator should compute a rough budget forecast (analyst ~$2, each PO ~$0.30, each engineer ~$0.60, each pr-reviewer ~$0.30) before dispatching engineers, and cap engineer count at `(remaining_budget - digest_reserve) / (engineer_cost + reviewer_cost)`.
+6. **The `analyst-2026-04-10-007` self-improvement item directly addresses observability gaps that made this digest harder to write** — prioritize it next run.
--- a/paper-dynasty/autonomous-pipeline-session-2026-04-10.md
+++ b/paper-dynasty/autonomous-pipeline-session-2026-04-10.md
@ -0,0 +1,145 @@
+---
+title: "Autonomous Improvement Pipeline — Build Session 2026-04-09/10"
+description: "Single-session design + implementation + first smoke test of the Paper Dynasty autonomous improvement pipeline. 2 PRs shipped, system ready to run nightly pending one more test."
+type: context
+domain: paper-dynasty
+tags: [autonomous-pipeline, session-summary, paper-dynasty, architecture]
+---
+
+## Summary
+
+In a single session spanning 2026-04-09 evening through 2026-04-10 early morning, Cal and Claude designed, specced, planned, implemented, merged, and ran the first smoke test of a nightly autonomous improvement pipeline for the Paper Dynasty ecosystem. The goal: a system where Cal wakes up to a Monday-morning queue of "here's what Claude did for you" PRs he can review and merge, keeping momentum even when he's unavailable.
+
+The system ships. It produced 2 real, mergeable PRs on its first run before hitting a budget ceiling. Post-run fixes are in. The systemd timer is installed but not enabled pending one more validation run.
+
+## The arc of the session
+
+### Phase 1 — Brainstorming (spec)
+
+Cal arrived with a two-part idea: (1) introspection on the codebase to recommend updates, (2) recommendations for workflow/tooling optimization. Through ~15 clarifying exchanges, we landed on this shape:
+
+- **Nightly scheduled** (not on-demand) — moves forward despite Cal's schedule
+- **Autonomous PR dispatch** (not just reports) — Monday morning review queue
+- **WIP slot limits** to prevent overwhelm: 10 S, 5 M, no autonomous L; L items go to a wishlist
+- **1:1 stability/feature bias** — mix both types of work
+- **Three repos in scope:** database, discord-app, card-creation (card-creation has its own autonomous dynamic now)
+- **Separation of concerns:**
+  - New **analyst agent** does code audits with fresh eyes (no ownership bias)
+  - **growth-po** does product/roadmap sweeps in a new "sweep mode"
+  - **Domain POs** (database-po, discord-po, cards-po) gate findings with go/no-go decisions
+  - **Engineer agents** build approved S/M work in isolated worktrees
+  - **pr-reviewer** gates PRs before Cal sees them
+- **Rolling 30-day rejection log** so the pipeline doesn't re-suggest rejected ideas
+- **Hybrid tracking:** pd-plan for slot counts + wishlist, KB for digests + rejection log
+- **Transparency as a core value** — every decision, rejection, and action documented so both humans and future agents have full context
+
+### Phase 2 — Plan
+
+20-task implementation plan written and self-reviewed against the spec. Caught one gap during self-review: the mix ratio (§9) wasn't explicitly implemented anywhere. Added a step 6b to the orchestrator prompt. Another round of refinements during plan review:
+
+1. Wishlist → Run Digest connection (L items should appear in nightly digest)
+2. Rolling 30-day rejection context fed to analyst + growth-po to avoid re-discovery
+3. Pure-bash preflight for pure data lookups (slot check, git pull, PR inventory, rejection query) — no LLM spin-up on "no slots" nights
+4. Dedup as a haiku call (not a script) — semantic matching catches rewording
+
+### Phase 3 — Implementation (subagent-driven)
+
+Created worktree `.worktrees/autonomous-pipeline` on branch `feat/autonomous-pipeline`. Executed plan via subagent-driven-development skill:
+
+- **Task 1** (inline): scaffolded `autonomous/` directory with README
+- **Batch A** (sonnet subagent, Tasks 2-5): extended `pd-plan` CLI with `slot`/`wishlist` schema columns, `slots`/`wishlist` subcommands, `--slot`/`--wishlist` flags on `add`/`update`, new summary section. 8 pytest tests, all passing.
+- **Task 6** (sonnet subagent): `autonomous/lib/check_slots.py` with 3 pytest tests
+- **Batch B** (sonnet subagent, Tasks 7-9): bash scripts `inventory_prs.sh`, `query_rejections.sh`, `preflight.sh`. Notable: switched from `tea pulls list` to `tea api` because the former returns labels as a flat string (not objects).
+- **Batch C** (sonnet subagent, Tasks 10-14): `.claude/agents/analyst.md`, sweep-mode append to `growth-po.md`, `dedup-haiku.md`, `orchestrator.md` (284 lines), `run-nightly.sh` wrapper
+- **Task 18** (inline): preflight skip smoke test — added 15 dummy initiatives, verified `preflight.sh` exits 1, cleaned up
+
+11 commits on the feature branch. Fast-forward merged to main. Worktree force-removed. Branch deleted. Pushed to origin.
+
+One snag worth noting: the first subagent dispatch hit a wall of permission prompts Cal had to click through. Existing memory already had the rule "code-writing subagents MUST use mode: acceptEdits" — I'd just failed to apply it. Fixed for all subsequent dispatches.
+
+### Phase 4 — Integration (Gitea + systemd)
+
+- **Gitea labels** created via pd-ops agent in all 3 sub-project repos: `autonomous`, `size:S`, `size:M`, `type:stability`, `type:feature` (colors: `#6366f1`, `#10b981`, `#f59e0b`, `#0891b2`, `#ec4899`). Umbrella repo got its own set later when the observability ticket was filed.
+- **Scheduled task** at `~/.config/claude-scheduled/tasks/autonomous-nightly/` — settings.json (haiku outer, $1 budget, 3600s timeout), prompt.md (just runs the wrapper), mcp.json (empty; the inner claude inherits Cal's global MCP config including gitea-mcp)
+- **Systemd timer** at `~/.config/systemd/user/claude-scheduled@autonomous-nightly.timer` — nightly 02:00 with 15-min random delay, Persistent=true. Registered but NOT enabled.
+
+### Phase 5 — First smoke test
+
+Kicked off `autonomous/run-nightly.sh` at 02:40:07 local. Ran 15 minutes. Terminated at 02:55:47 by the $5 budget ceiling.
+
+**Despite the budget hit, the pipeline actually worked:**
+
+- Preflight ran cleanly (slots 10S/5M free, 0 open PRs, 0 rejections)
+- Analyst produced 8 findings across database, discord-app, autonomous (self-improvement)
+- Growth-po produced 5 findings (all discord Phase 2 roadmap items, all S-sized)
+- Dedup correctly skipped (empty inputs = no possible dupes)
+- POs made real decisions: many approved, several thoughtfully reshaped
+- 2 PRs shipped before budget ran out, both correctly labeled and mergeable
+
+**PRs shipped:**
+
+- **discord-app#162** — `chore(cogs): remove dead gameplay_legacy cog (4,723 lines, zero references)` — caught that `cogs/gameplay_legacy.py` was 4,723 lines of dead code with zero inbound references
+- **database#211** — `fix(packs): remove unfiltered pre-count in GET /packs (3 round-trips to 2)` — caught a real correctness bug: unfiltered `Pack.select().count()` was returning 404 when no packs existed globally instead of returning empty filter results
+
+**What went wrong:**
+
+1. Analyst alone consumed ~$2.50 with a 411s, 104-tool-use deep sweep
+2. `pr-reviewer` dispatch was skipped — budget ran out
+3. Digest Write was permission-denied (inner claude wasn't running with --dangerously-skip-permissions) — manually extracted and saved from the JSON output
+4. pd-plan integration skipped — approved queued items only in the digest
+5. 7 approved items never dispatched, including a high-priority real bug (economy cog overwriting `tree.on_error` causing stuck play-lock)
+6. Multiple Bash tool denials wasted budget on retries (compound commands, venv activation, `source`, curl, `diff <()`)
+
+### Phase 6 — Post-run fixes
+
+Spun up a yolo-mode `claude -p` agent to apply three critical fixes. Commit `a79efb2`:
+
+1. Inner claude budget: $5 → $20
+2. Added `--dangerously-skip-permissions` to inner claude in `run-nightly.sh`
+3. Analyst scope tightened in `.claude/agents/analyst.md`: max findings 15 → 5, added 30 tool-use cap with budget starvation rationale
+
+Also filed `cal/paper-dynasty-umbrella#3` (labels: `autonomous`, `size:S`, `type:stability`) for the observability self-improvement (split stdout/stderr, write `last-run-result.json`, voice-notify on failure). This is exactly the kind of ticket the pipeline could pick up on a future autonomous run.
+
+## Current state (as of 2026-04-10)
+
+- ✅ All code merged to main and pushed to origin
+- ✅ 15 Gitea labels created across 4 repos (3 sub-projects + umbrella)
+- ✅ Scheduled task installed
+- ✅ Systemd timer unit installed
+- ✅ 2 real PRs shipped (pending Cal review / reviewer pipeline)
+- ✅ Observability ticket filed
+- ✅ Post-run fixes applied
+- ⏸️ Systemd timer **NOT ENABLED** — pending one more validation smoke test with the $20 budget + tightened analyst
+
+## Queued work for next run
+
+See `project_autonomous_first_run.md` memory file for the full list. Headline items:
+
+1. `analyst-2026-04-10-003` — Economy cog `tree.on_error` bug (real stuck-user impact) — dispatch first
+2. `cal/paper-dynasty-umbrella#3` — Observability improvement (unblocks future debugging) — dispatch early
+3. 5 other approved items from the first run (3 features, 2 stability)
+4. 4 reshaped items that need additional spec work before dispatch
+
+## Why this matters
+
+This was a meta-accomplishment: building the tooling that builds the tooling. The pipeline is now a standing autonomous capability in the Paper Dynasty ecosystem. Cal's availability is no longer the bottleneck for routine stability fixes, small features, and dead-code cleanup. As confidence builds, the slot limits can rise, the budget can expand, and the scope can broaden.
+
+The first run also validated a deeper question: **can agents produce genuinely useful work without human guidance on what to build?** The answer, based on these 2 PRs, is yes — the pipeline caught a real correctness bug and a real dead-code pile that Cal had not flagged. That's the whole value proposition working on night one.
+
+## Next session pickup
+
+When resuming:
+
+1. Check status of `cal/paper-dynasty-discord#162` and `cal/paper-dynasty-database#211` — merged? closed? pending?
+2. Check status of `cal/paper-dynasty-umbrella#3` — has it been picked up?
+3. Decide: enable the systemd timer, or run another manual smoke test first
+4. If running another smoke test: expect ~$7-10 with the new config (analyst $2, growth-po $0.30, 2 POs × $0.30, 5 engineers × $0.80, 5 pr-reviewers × $0.30)
+5. See `project_autonomous_pipeline.md` and `project_autonomous_first_run.md` in memory for full context
+
+## References
+
+- Spec: `docs/superpowers/specs/2026-04-09-autonomous-improvement-pipeline-design.md`
+- Plan: `docs/superpowers/plans/2026-04-09-autonomous-improvement-pipeline.md`
+- Commit log: `git log --oneline --grep='autonomous'` in paper-dynasty-umbrella
+- First run digest: `autonomous-nightly-2026-04-10.md` (this same domain)
+- Live system: `/mnt/NV2/Development/paper-dynasty/autonomous/`
--- a/paper-dynasty/database-deployment-guide.md
+++ b/paper-dynasty/database-deployment-guide.md
@ -178,7 +178,7 @@ When merging many PRs at once (e.g., batch pagination PRs), branch protection ru
 | `LOG_LEVEL` | Logging verbosity (default: INFO) |
 | `DATABASE_TYPE` | `postgresql` |
 | `POSTGRES_HOST` | Container name of PostgreSQL |
-| `POSTGRES_DB` | Database name (`pd_master`) |
+| `POSTGRES_DB` | Database name — `pd_master` (prod) / `paperdynasty_dev` (dev) |
 | `POSTGRES_USER` | DB username |
 | `POSTGRES_PASSWORD` | DB password |

@ -189,4 +189,6 @@ When merging many PRs at once (e.g., batch pagination PRs), branch protection ru
 | Database API (prod) | `ssh akamai` | `pd_api` | 815 |
 | Database API (dev) | `ssh pd-database` | `dev_pd_database` | 813 |
 | PostgreSQL (prod) | `ssh akamai` | `pd_postgres` | 5432 |
-| PostgreSQL (dev) | `ssh pd-database` | `pd_postgres` | 5432 |
+| PostgreSQL (dev) | `ssh pd-database` | `sba_postgres` | 5432 |
+
+**Dev database credentials:** container `sba_postgres`, database `paperdynasty_dev`, user `sba_admin`. Prod uses `pd_postgres`, database `pd_master`.
--- a/paper-dynasty/discord-browser-testing-workflow.md
+++ b/paper-dynasty/discord-browser-testing-workflow.md
@ -0,0 +1,170 @@
+---
+title: "Discord Bot Browser Testing via Playwright + CDP"
+description: "Step-by-step workflow for automated Discord bot testing using Playwright connected to Brave browser via Chrome DevTools Protocol. Covers setup, slash command execution, and screenshot capture."
+type: runbook
+domain: paper-dynasty
+tags: [paper-dynasty, discord, testing, playwright, automation]
+---
+
+# Discord Bot Browser Testing via Playwright + CDP
+
+Automated testing of Paper Dynasty Discord bot commands by connecting Playwright to a running Brave browser instance with Discord open.
+
+## Prerequisites
+
+- Brave browser installed (`brave-browser-stable`)
+- Playwright installed (`pip install playwright && playwright install chromium`)
+- Discord logged in via browser (not desktop app)
+- Discord bot running (locally via docker-compose or on remote host)
+- Bot's `API_TOKEN` must match the target API environment
+
+## Setup
+
+### 1. Launch Brave with CDP enabled
+
+Brave must be started with `--remote-debugging-port`. If Brave is already running, **kill it first** — otherwise the flag is ignored and the new process merges into the existing one.
+
+```bash
+killall brave && sleep 2 && brave-browser-stable --remote-debugging-port=9222 &
+```
+
+### 2. Verify CDP is responding
+
+```bash
+curl -s http://localhost:9222/json/version | python3 -m json.tool
+```
+
+Should return JSON with `Browser`, `webSocketDebuggerUrl`, etc.
+
+### 3. Open Discord in browser
+
+Navigate to `https://discord.com/channels/<server_id>/<channel_id>` in Brave.
+
+**Paper Dynasty test server:**
+- Server: Cals Test Server (`669356687294988350`)
+- Channel: #pd-game-test (`982850262903451658`)
+- URL: `https://discord.com/channels/669356687294988350/982850262903451658`
+
+### 4. Verify bot is running with correct API token
+
+```bash
+# Check docker-compose.yml has the right API_TOKEN for the target environment
+grep API_TOKEN /mnt/NV2/Development/paper-dynasty/discord-app/docker-compose.yml
+
+# Dev API token lives on the dev host:
+ssh pd-database "docker exec sba_postgres psql -U sba_admin -d paperdynasty_dev -c \"SELECT 1;\""
+
+# Restart bot if token was changed:
+cd /mnt/NV2/Development/paper-dynasty/discord-app && docker compose up -d
+```
+
+## Running Commands
+
+### Find the Discord tab
+
+```python
+from playwright.sync_api import sync_playwright
+import time
+
+with sync_playwright() as p:
+    browser = p.chromium.connect_over_cdp('http://localhost:9222')
+    for ctx in browser.contexts:
+        for page in ctx.pages:
+            if 'discord' in page.url.lower():
+                print(f'Found: {page.url}')
+                break
+    browser.close()
+```
+
+### Execute a slash command and capture result
+
+```python
+from playwright.sync_api import sync_playwright
+import time
+
+def run_slash_command(command: str, wait_seconds: int = 5, screenshot_path: str = '/tmp/discord_result.png'):
+    """
+    Type a slash command in Discord, select the top autocomplete option,
+    submit it, wait for the bot response, and take a screenshot.
+    """
+    with sync_playwright() as p:
+        browser = p.chromium.connect_over_cdp('http://localhost:9222')
+        for ctx in browser.contexts:
+            for page in ctx.pages:
+                if 'discord' in page.url.lower():
+                    msg_box = page.locator('[role="textbox"][data-slate-editor="true"]')
+                    msg_box.click()
+                    time.sleep(0.3)
+
+                    # Type the command (delay simulates human typing for autocomplete)
+                    msg_box.type(command, delay=80)
+                    time.sleep(2)
+
+                    # Tab selects the top autocomplete option
+                    page.keyboard.press('Tab')
+                    time.sleep(1)
+
+                    # Enter submits the command
+                    page.keyboard.press('Enter')
+                    time.sleep(wait_seconds)
+
+                    page.screenshot(path=screenshot_path)
+                    print(f'Screenshot saved to {screenshot_path}')
+                    break
+        browser.close()
+
+# Example usage:
+run_slash_command('/refractor status')
+```
+
+### Commands with parameters
+
+After pressing Tab to select the command, Discord shows an options panel. To fill parameters:
+
+1. The first parameter input is auto-focused after Tab
+2. Type the value, then Tab to move to the next parameter
+3. Press Enter when ready to submit
+
+```python
+# Example: /refractor status with tier filter
+msg_box.type('/refractor status', delay=80)
+time.sleep(2)
+page.keyboard.press('Tab')  # Select command from autocomplete
+time.sleep(1)
+# Now fill parameters if needed, or just submit
+page.keyboard.press('Enter')
+```
+
+## Key Selectors
+
+| Element | Selector |
+|---------|----------|
+| Message input box | `[role="textbox"][data-slate-editor="true"]` |
+| Autocomplete popup | `[class*="autocomplete"]` |
+
+## Gotchas
+
+- **Brave must be killed before relaunch** — if an instance is already running, `--remote-debugging-port` is silently ignored
+- **Bot token mismatch** — the bot's `API_TOKEN` in `docker-compose.yml` must match the target API (dev or prod). Symptoms: `{"detail":"Unauthorized"}` in bot logs
+- **Viewport is None** — when connecting via CDP, `page.viewport_size` returns None. Use `page.evaluate('() => ({w: window.innerWidth, h: window.innerHeight})')` instead
+- **Autocomplete timing** — typing too fast may not trigger Discord's autocomplete. The `delay=80` on `msg_box.type()` simulates human speed
+- **Multiple bots** — if multiple bots register the same slash command (e.g. MantiTestBot and PucklTestBot), Tab selects the top option. Verify the correct bot name in the autocomplete popup before proceeding
+
+## Test Plan Reference
+
+The Refractor integration test plan is at:
+`discord-app/tests/refractor-integration-test-plan.md`
+
+Key test case groups:
+- REF-01 to REF-06: Tier badges and display
+- REF-10 to REF-15: Progress bars and filtering
+- REF-40 to REF-42: Cross-command badges (card, roster)
+- REF-70 to REF-72: Cross-command badge propagation (the current priority)
+
+## Verified On
+
+- **Date:** 2026-04-06
+- **Browser:** Brave 146.0.7680.178 (Chromium-based)
+- **Playwright:** Node.js driver via Python sync API
+- **Bot:** MantiTestBot on Cals Test Server, #pd-game-test channel
+- **API:** pddev.manticorum.com (dev environment)
--- a/paper-dynasty/refractor-in-app-test-plan.md
+++ b/paper-dynasty/refractor-in-app-test-plan.md
@ -0,0 +1,107 @@
+---
+title: "Refractor In-App Test Plan"
+description: "Comprehensive manual test plan for the Refractor card evolution system — covers /refractor status, tier badges, post-game hooks, tier-up notifications, card art tiers, and known issues."
+type: guide
+domain: paper-dynasty
+tags: [paper-dynasty, testing, refractor, discord, database]
+---
+
+# Refractor In-App Test Plan
+
+Manual test plan for the Refractor (card evolution) system. All testing targets **dev** environment (`pddev.manticorum.com` / dev Discord bot).
+
+## Prerequisites
+
+- Dev bot running on `sba-bots`
+- Dev API at `pddev.manticorum.com` (port 813)
+- Team with seeded refractor data (team 31 from prior session)
+- At least one game playable to trigger post-game hooks
+
+---
+
+## REF-10: `/refractor status` — Basic Display
+
+| # | Test | Steps | Expected |
+|---|---|---|---|
+| 10 | No filters | `/refractor status` | Ephemeral embed with team branding, tier summary line, 10 cards sorted by tier DESC, pagination buttons if >10 cards |
+| 11 | Card type filter | `/refractor status card_type:Batter` | Only batter cards shown, count matches |
+| 12 | Tier filter | `/refractor status tier:T2—Refractor` | Only T2 cards, embed color changes to tier color |
+| 13 | Progress filter | `/refractor status progress:Close to next tier` | Only cards >=80% to next threshold, fully evolved excluded |
+| 14 | Combined filters | `/refractor status card_type:Batter tier:T1—Base Chrome` | Intersection of both filters |
+| 15 | Empty result | `/refractor status tier:T4—Superfractor` (if none exist) | "No cards match your filters..." message with filter details |
+
+## REF-20: `/refractor status` — Pagination
+
+| # | Test | Steps | Expected |
+|---|---|---|---|
+| 20 | Page buttons appear | `/refractor status` with >10 cards | Prev/Next buttons visible |
+| 21 | Next page | Click `Next >` | Page 2 shown, footer updates to "Page 2/N" |
+| 22 | Prev page | From page 2, click `< Prev` | Back to page 1 |
+| 23 | First page prev | On page 1, click `< Prev` | Nothing happens / stays on page 1 |
+| 24 | Last page next | On last page, click `Next >` | Nothing happens / stays on last page |
+| 25 | Button timeout | Wait 120s after command | Buttons become unresponsive |
+| 26 | Wrong user clicks | Another user clicks buttons | Silently ignored |
+
+## REF-30: Tier Badges in Card Embeds
+
+| # | Test | Steps | Expected |
+|---|---|---|---|
+| 30 | T0 card display | View a T0 card via `/myteam` or `/roster` | No badge prefix, just player name |
+| 31 | T1 badge | View a T1 card | Title shows `[BC] Player Name` |
+| 32 | T2 badge | View a T2 card | Title shows `[R] Player Name` |
+| 33 | T3 badge | View a T3 card | Title shows `[GR] Player Name` |
+| 34 | T4 badge | View a T4 card (if exists) | Title shows `[SF] Player Name` |
+| 35 | Badge in pack open | Open a pack with an evolved card | Badge appears in pack embed |
+| 36 | API down gracefully | (hard to test) | Card displays normally with no badge, no error |
+
+## REF-50: Post-Game Hook & Tier-Up Notifications
+
+| # | Test | Steps | Expected |
+|---|---|---|---|
+| 50 | Game completes normally | Play a full game | No errors in bot logs; refractor evaluate-game fires after season-stats update |
+| 51 | Tier-up notification | Play game where a card crosses a threshold | Embed in game channel: "Refractor Tier Up!", player name, tier name, correct color |
+| 52 | No tier-up | Play game where no thresholds crossed | No refractor embed posted, game completes normally |
+| 53 | Multiple tier-ups | Game where 2+ players tier up | One embed per tier-up, all posted |
+| 54 | Auto-init new card | Play game with a card that has no RefractorCardState | State created automatically, player evaluated, no error |
+| 55 | Superfractor notification | (may need forced data) | "SUPERFRACTOR!" title, teal color |
+
+## REF-60: Card Art with Tiers (API-level)
+
+| # | Test | Steps | Expected |
+|---|---|---|---|
+| 60 | T0 card image | `GET /api/v2/players/{id}/card-image?card_type=batting` | Base card, no tier styling |
+| 61 | Tier override | `GET ...?card_type=batting&tier=2` | Refractor styling visible (border, diamond indicator) |
+| 62 | Each tier visual | `?tier=1` through `?tier=4` | Correct border colors, diamond fill, header gradients per tier |
+| 63 | Pitcher card | `?card_type=pitching&tier=2` | Tier styling applies correctly to pitcher layout |
+
+## REF-70: Known Issues to Verify
+
+| # | Issue | Check | Status |
+|---|---|---|---|
+| 70 | Superfractor embed says "Rating boosts coming in a future update!" | Verify — boosts ARE implemented now, text is stale | **Fix needed** |
+| 71 | `on_timeout` doesn't edit message | Buttons stay visually active after 120s | **Known, low priority** |
+| 72 | Card embed perf (1 API call per card) | Note latency on roster views with 10+ cards | **Monitor** |
+| 73 | Season-stats failure kills refractor eval | Both in same try/except | **Known risk, verify logging** |
+
+---
+
+## API Endpoints Under Test
+
+| Method | Endpoint | Used By |
+|---|---|---|
+| GET | `/api/v2/refractor/tracks` | Track listing |
+| GET | `/api/v2/refractor/cards?team_id=X` | `/refractor status` command |
+| GET | `/api/v2/refractor/cards/{card_id}` | Tier badge in card embeds |
+| POST | `/api/v2/refractor/cards/{card_id}/evaluate` | Force re-evaluation |
+| POST | `/api/v2/refractor/evaluate-game/{game_id}` | Post-game hook |
+| GET | `/api/v2/teams/{team_id}/refractors` | Teams alias endpoint |
+| GET | `/api/v2/players/{id}/card-image?tier=N` | Card art tier preview |
+
+## Notification Embed Colors
+
+| Tier | Name | Color |
+|---|---|---|
+| T1 | Base Chrome | Green (0x2ECC71) |
+| T2 | Refractor | Gold (0xF1C40F) |
+| T3 | Gold Refractor | Purple (0x9B59B6) |
+| T4 | Superfractor | Teal (0x1ABC9C) |
--- a/scheduled-tasks/CONTEXT.md
+++ b/scheduled-tasks/CONTEXT.md
@ -158,6 +158,23 @@ ls -t ~/.local/share/claude-scheduled/logs/backlog-triage/ | head -1
 ~/.config/claude-scheduled/runner.sh backlog-triage
 ```

+## Session Resumption
+
+Tasks can opt into session persistence for multi-step workflows:
+
+```json
+{
+  "session_resumable": true,
+  "resume_last_session": true
+}
+```
+
+When `session_resumable` is `true`, runner.sh saves the `session_id` to `$LOG_DIR/last_session_id` after each run. When `resume_last_session` is also `true`, the next run resumes that session with `--resume`.
+
+Issue-poller and PR-reviewer capture `session_id` in logs and result JSON for manual follow-up.
+
+See also: [Agent SDK Evaluation](agent-sdk-evaluation.md) for CLI vs SDK comparison.
+
 ## Cost Safety

 - Per-task `max_budget_usd` cap — runner.sh detects `error_max_budget_usd` and warns
--- a/scheduled-tasks/agent-sdk-evaluation.md
+++ b/scheduled-tasks/agent-sdk-evaluation.md
@ -0,0 +1,175 @@
+---
+title: "Agent SDK Evaluation — CLI vs Python/TypeScript SDK"
+description: "Comparison of Claude Code CLI invocation (claude -p) vs the native Agent SDK for programmatic use in the headless-claude and claude-scheduled systems."
+type: context
+domain: scheduled-tasks
+tags: [claude-code, sdk, agent-sdk, python, typescript, headless, automation, evaluation]
+---
+
+# Agent SDK Evaluation: CLI vs Python/TypeScript SDK
+
+**Date:** 2026-04-03
+**Status:** Evaluation complete — recommendation below
+**Related:** Issue #3 (headless-claude: Additional Agent SDK improvements)
+
+## 1. Current Approach — CLI via `claude -p`
+
+All headless Claude invocations use the CLI subprocess pattern:
+
+```bash
+claude -p "<prompt>" \
+  --model sonnet \
+  --output-format json \
+  --allowedTools "Read,Grep,Glob" \
+  --append-system-prompt "..." \
+  --max-budget-usd 2.00
+```
+
+**Pros:**
+- Simple to invoke from any language (bash, n8n SSH nodes, systemd units)
+- Uses Claude Max OAuth — no API key needed, no per-token billing
+- Mature and battle-tested in our scheduled-tasks framework
+- CLAUDE.md and settings.json are loaded automatically
+- No runtime dependencies beyond the CLI binary
+
+**Cons:**
+- Structured output requires parsing JSON from stdout
+- Error handling is exit-code-based with stderr parsing
+- No mid-stream observability (streaming requires JSONL parsing)
+- Tool approval is allowlist-only — no dynamic per-call decisions
+- Session resumption requires manual `--resume` flag plumbing
+
+## 2. Python Agent SDK
+
+**Package:** `claude-agent-sdk` (renamed from `claude-code`)
+**Install:** `pip install claude-agent-sdk`
+**Requires:** Python 3.10+, `ANTHROPIC_API_KEY` env var
+
+```python
+from claude_agent_sdk import query, ClaudeAgentOptions
+
+async for message in query(
+    prompt="Diagnose server health",
+    options=ClaudeAgentOptions(
+        allowed_tools=["Read", "Grep", "Bash(python3 *)"],
+        output_format={"type": "json_schema", "schema": {...}},
+        max_budget_usd=2.00,
+    ),
+):
+    if hasattr(message, "result"):
+        print(message.result)
+```
+
+**Key features:**
+- Async generator with typed `SDKMessage` objects (User, Assistant, Result, System)
+- `ClaudeSDKClient` for stateful multi-turn conversations
+- `can_use_tool` callback for dynamic per-call tool approval
+- In-process hooks (`PreToolUse`, `PostToolUse`, `Stop`, etc.)
+- `rewindFiles()` to restore filesystem to any prior message point
+- Typed exception hierarchy (`CLINotFoundError`, `ProcessError`, etc.)
+
+**Limitation:** Shells out to the Claude Code CLI binary — it is NOT a pure HTTP client. The binary must be installed.
+
+## 3. TypeScript Agent SDK
+
+**Package:** `@anthropic-ai/claude-agent-sdk` (renamed from `@anthropic-ai/claude-code`)
+**Install:** `npm install @anthropic-ai/claude-agent-sdk`
+**Requires:** Node 18+, `ANTHROPIC_API_KEY` env var
+
+```typescript
+import { query } from "@anthropic-ai/claude-agent-sdk";
+
+for await (const message of query({
+  prompt: "Diagnose server health",
+  options: {
+    allowedTools: ["Read", "Grep", "Bash(python3 *)"],
+    maxBudgetUsd: 2.00,
+  }
+})) {
+  if ("result" in message) console.log(message.result);
+}
+```
+
+**Key features (superset of Python):**
+- Same async generator pattern
+- `"auto"` permission mode (model classifier per tool call) — TS-only
+- `spawnClaudeCodeProcess` hook for remote/containerized execution
+- `setMcpServers()` for dynamic MCP server swapping mid-session
+- V2 preview: `send()` / `stream()` patterns for simpler multi-turn
+- Bundles the Claude Code binary — no separate install needed
+
+## 4. Comparison Matrix
+
+| Capability | `claude -p` CLI | Python SDK | TypeScript SDK |
+|---|---|---|---|
+| **Auth** | OAuth (Claude Max) | API key only | API key only |
+| **Invocation** | Shell subprocess | Async generator | Async generator |
+| **Structured output** | `--json-schema` flag | Schema in options | Schema in options |
+| **Streaming** | JSONL parsing | Typed messages | Typed messages |
+| **Tool approval** | `--allowedTools` only | `can_use_tool` callback | `canUseTool` callback + auto mode |
+| **Session resume** | `--resume` flag | `resume: sessionId` | `resume: sessionId` |
+| **Cost tracking** | Parse result JSON | `ResultMessage.total_cost_usd` | Same + per-model breakdown |
+| **Error handling** | Exit codes + stderr | Typed exceptions | Typed exceptions |
+| **Hooks** | External shell scripts | In-process callbacks | In-process callbacks |
+| **Custom tools** | Not available | `tool()` decorator | `tool()` + Zod schemas |
+| **Subagents** | Not programmatic | `agents` option | `agents` option |
+| **File rewind** | Not available | `rewindFiles()` | `rewindFiles()` |
+| **MCP servers** | `--mcp-config` file | Inline config object | Inline + dynamic swap |
+| **CLAUDE.md loading** | Automatic | Must opt-in (`settingSources`) | Must opt-in |
+| **Dependencies** | CLI binary | CLI binary + Python | Node 18+ (bundles CLI) |
+
+## 5. Integration Paths
+
+### A. n8n Code Nodes
+
+The n8n Code node supports JavaScript (not TypeScript directly, but the SDK's JS output works). This would replace the current SSH → CLI pattern:
+
+```
+Schedule Trigger → Code Node (JS, uses SDK) → IF → Discord
+```
+
+**Trade-off:** Eliminates the SSH hop to CT 300, but requires `ANTHROPIC_API_KEY` and n8n to have the npm package installed. Current n8n runs in a Docker container on CT 210 — would need the SDK and CLI binary in the image.
+
+### B. Standalone Python Scripts
+
+Replace `claude -p` subprocess calls in custom dispatchers with the Python SDK:
+
+```python
+# Instead of: subprocess.run(["claude", "-p", prompt, ...])
+async for msg in query(prompt=prompt, options=opts):
+    ...
+```
+
+**Trade-off:** Richer error handling and streaming, but our dispatchers are bash scripts, not Python. Would require rewriting `runner.sh` and dispatchers in Python.
+
+### C. Systemd-triggered Tasks (Current Architecture)
+
+Keep systemd timers → bash scripts, but optionally invoke a thin Python wrapper that uses the SDK instead of `claude -p` directly.
+
+**Trade-off:** Adds Python as a dependency for scheduled tasks that currently only need bash + the CLI binary. Marginal benefit unless we need hooks or dynamic tool approval.
+
+## 6. Recommendation
+
+**Stay with CLI invocation for now. Revisit the Python SDK when we need dynamic tool approval or in-process hooks.**
+
+### Rationale
+
+1. **Auth is the blocker.** The SDK requires `ANTHROPIC_API_KEY` (API billing). Our entire scheduled-tasks framework runs on Claude Max OAuth at zero marginal cost. Switching to the SDK means paying per-token for every scheduled task, issue-worker, and PR-reviewer invocation. This alone makes the SDK non-viable for our current architecture.
+
+2. **The CLI covers our needs.** With `--append-system-prompt` (done), `--resume` (this PR), `--json-schema`, and `--allowedTools`, the CLI provides everything we currently need. Session resumption was the last missing piece.
+
+3. **Bash scripts are the right abstraction.** Our runners are launched by systemd timers. Bash + CLI is the natural fit — no runtime dependencies, no async event loops, no package management.
+
+### When to Revisit
+
+- If Anthropic adds OAuth support to the SDK (eliminating the billing difference)
+- If we need dynamic tool approval (e.g., "allow this Bash command but deny that one" at runtime)
+- If we build a long-running Python service that orchestrates multiple Claude sessions (the `ClaudeSDKClient` stateful pattern would be valuable there)
+- If we move to n8n custom nodes written in TypeScript (the TS SDK bundles the CLI binary)
+
+### Migration Path (If Needed Later)
+
+1. Start with the Python SDK in a single task (e.g., `backlog-triage`) as a proof of concept
+2. Create a thin `sdk-runner.py` wrapper that reads the same `settings.json` and `prompt.md` files
+3. Swap the systemd unit's `ExecStart` from `runner.sh` to `sdk-runner.py`
+4. Expand to other tasks if the POC proves valuable
--- a/scheduled-tasks/backlog-triage-sandbox-fix.md
+++ b/scheduled-tasks/backlog-triage-sandbox-fix.md
@ -0,0 +1,46 @@
+---
+title: "Backlog triage sandbox fix — repos.json outside working directory"
+description: "Fix for backlog-triage scheduled task failing to read repos.json because the file was outside the claude -p sandbox (working_dir). Resolved by symlinking into the working directory."
+type: troubleshooting
+domain: scheduled-tasks
+tags: [claude-code, backlog-triage, sandbox, runner, troubleshooting]
+---
+
+# Backlog Triage — repos.json Outside Sandbox
+
+**Date**: 2026-04-07
+
+## Problem
+
+The `backlog-triage` scheduled task reported:
+
+> `~/.config/claude-scheduled/repos.json` is outside the allowed session directories and couldn't be read.
+
+The task fell back to querying all discoverable repos via Gitea instead of using the curated repo list.
+
+## Root Cause
+
+`claude -p` sandboxes file access to the **working directory** (`/mnt/NV2/Development/claude-home`). The `repos.json` file lives at `~/.config/claude-scheduled/repos.json` (`/home/cal/`), which is outside the sandbox.
+
+The `--allowedTools "Read(~/.config/claude-scheduled/repos.json)"` flag controls **tool permissions** (which tools the session can call), not **filesystem access**. The sandbox boundary is set by the working directory, and `allowedTools` cannot override it.
+
+## Fix
+
+1. **Symlinked** `repos.json` into the working directory:
+   ```bash
+   ln -sf /home/cal/.config/claude-scheduled/repos.json \
+     /mnt/NV2/Development/claude-home/.claude/repos.json
+   ```
+
+2. **Updated** `tasks/backlog-triage/prompt.md` to reference `.claude/repos.json` instead of the absolute home-dir path.
+
+3. **Updated** `tasks/backlog-triage/settings.json` allowed_tools to `Read(.claude/repos.json)`.
+
+## Key Lesson
+
+For `runner.sh` template tasks, any file the task needs to read **must be inside the working directory** or reachable via a symlink within it. The `--allowedTools` flag is a permissions layer on top of the sandbox — it cannot grant access to paths outside the sandbox.
+
+## Also Changed (same session)
+
+- Removed `cognitive-memory` MCP from backlog-triage; replaced with `kb-search` (HTTP MCP at `10.10.0.226:8001/mcp`) for cross-referencing issue context against the knowledge base.
+- Removed all `mcp__cognitive-memory__*` tools from allowed_tools; added `mcp__kb-search__search` and `mcp__kb-search__get_document`.
--- a/scheduled-tasks/pr-reviewer-ai-reviewing-label-stuck.md
+++ b/scheduled-tasks/pr-reviewer-ai-reviewing-label-stuck.md
@ -0,0 +1,81 @@
+---
+title: "Fix: pr-reviewer leaving ai-reviewing label stuck on PRs"
+description: "Duplicate Gitea labels caused _get_label_id to SIGPIPE under pipefail, making remove_label silently bail and orphaning the ai-reviewing tag across 6 PRs."
+type: troubleshooting
+domain: scheduled-tasks
+tags:
+  - troubleshooting
+  - pr-reviewer
+  - gitea
+  - labels
+  - bash
+  - pipefail
+  - claude-scheduled
+---
+
+# Fix: pr-reviewer leaving `ai-reviewing` label stuck on PRs
+
+**Date:** 2026-04-10
+**Severity:** Medium — pr-reviewer-dispatcher skipped any PR that already carried `ai-reviewing`, so stuck PRs were never re-reviewed. Six PRs across `major-domo-database` and `paper-dynasty-database` were silently wedged for weeks.
+
+## Problem
+
+Open PRs across the tracked repos accumulated the orange `ai-reviewing` label with no corresponding `ai-reviewed` / `ai-changes-requested` outcome. Because `pr-reviewer-dispatcher.sh` filters out any PR that already has one of those three labels, stuck PRs stayed invisible to future runs.
+
+Two distinct stuck patterns were observable:
+
+1. **Both labels attached** (`major-domo-database` #128, #124): the reviewer clearly ran to completion — `ai-reviewed` was added — but `ai-reviewing` was never removed.
+2. **Only `ai-reviewing` attached** (`paper-dynasty-database` #207, #126, #125; `major-domo-database` #122): no review outcome label at all. Looked like a mid-run crash.
+
+## Root Cause
+
+Two compounding bugs in `~/.config/claude-scheduled/gitea-lib.sh`.
+
+### 1. Duplicate labels accumulated in repos
+
+`ensure_label` had no de-duplication check. Any transient failure in `_get_label_id` (bad response, jq parse, pipeline issue) fell through and created a *new* label with the same name. Over time two `ai-reviewing` rows existed in both `major-domo-database` (ids 30, 31) and `paper-dynasty-database` (ids 60, 35); `paper-dynasty-discord` had the same issue with `ai-working`.
+
+### 2. `_get_label_id` SIGPIPE under pipefail
+
+The original helper was:
+
+```bash
+_get_label_id() {
+  gitea_get "repos/$owner/$repo/labels?limit=50" |
+    jq -r --arg name "$name" '.[] | select(.name == $name) | .id' 2>/dev/null |
+    head -1
+}
+```
+
+The dispatcher runs under `set -euo pipefail`. With duplicate labels present, `jq` emits multiple id lines. `head -1` closes the pipe after the first line → `jq` hits SIGPIPE on the next write → pipeline exits non-zero → `pipefail` propagates. Inside `remove_label`:
+
+```bash
+label_id=$(_get_label_id ...) || return 0
+```
+
+…the `|| return 0` guard then **silently returned without ever calling DELETE**. The reviewer continued on and added `ai-reviewed`, leaving `ai-reviewing` orphaned. Same mechanism in the cleanup trap meant crashed runs also couldn't remove the label.
+
+Additionally, even when the pipe didn't fire SIGPIPE, `remove_label` was resolving the label id against the *repo label catalog* rather than the labels actually attached to the PR — so for `paper-dynasty-database` #125 (which had id=35 attached), `head -1` returned id=60 and the DELETE was a no-op on an id that wasn't even there.
+
+## Fix
+
+**`gitea-lib.sh` hardened (three helpers):**
+
+- **`_get_label_id`** — replaced `head -1` with `jq 'sort_by(.id) | .[0].id // empty'`. No pipeline truncation → no SIGPIPE. Also bumped `limit=50` → `limit=200` so large repos aren't silently truncated. On duplicates the *oldest* id is returned (the canonical row).
+- **`remove_label`** — now queries `repos/{o}/{r}/issues/{n}/labels` (labels actually attached to the PR), matches by name, and deletes every matching id. Can no longer DELETE the wrong id, and handles the theoretical case where both duplicates got attached.
+- **`ensure_label`** — counts existing labels with the target name before lookup, logs `WARNING: $repo has N labels named '$name' — reusing oldest` so the dispatcher log surfaces future dupes.
+
+**Repo cleanup:**
+
+- Cleared stale `ai-reviewing` from the 6 stuck PRs via the patched `remove_label`.
+- Deleted duplicate label rows (kept the oldest id in each repo): `major-domo-database` id 31, `paper-dynasty-database` id 60, `paper-dynasty-discord` id 52 (`ai-working`).
+- Swept all tracked repos for other `ai-*` label dupes — none remaining.
+
+**Verification:** `bash -n`, then `pr-reviewer-dispatcher.sh --dry-run` — correctly re-discovered the 5 PRs that had only `ai-reviewing` (now clean) and properly skipped the 2 that already had `ai-reviewed`.
+
+## Lessons
+
+- **`set -o pipefail` + `head -N` is a foot-gun.** Whenever a downstream stage can close the pipe early, upstream producers will get SIGPIPE and fail the pipeline. Use `jq '.[0]'`, `awk 'NR==1{print; exit}'`, or read into a variable and slice — never `| head -1` in a pipefail script.
+- **Resolve label ids from the issue, not the repo catalog.** Gitea allows duplicate label names per repo. Any helper that maps a name → id from the repo catalog and then acts on an issue is ambiguous. Always query `issues/{n}/labels` when you need to mutate an attachment.
+- **"Get or create" helpers need a de-dup guard.** `ensure_label` should either tolerate duplicates by reusing the oldest (what we did) or hard-error and force a human to clean up; silently creating a new row on any transient failure accumulates garbage state over weeks.
+- **Skip-label dispatchers need a staleness timeout.** The dispatcher currently treats `ai-reviewing` as a permanent skip signal. A stuck label wedges a PR forever. Consider adding a timestamp check (e.g., `ai-reviewing` older than 1 hour → force re-review) as a belt-and-suspenders guard against future variants of this bug.
--- a/server-configs/hosts.yml
+++ b/server-configs/hosts.yml
@ -245,11 +245,25 @@ hosts:
      - sqlite-major-domo
      - temp-postgres

+  # Docker Home Servers VM (Proxmox) - decommission candidate
+  # VM 116: Only Jellyfin remains after 2026-04-03 cleanup (watchstate removed — duplicate of manticore's canonical instance)
+  # Jellyfin on manticore already covers this service. VM 116 + VM 110 are candidates to reclaim 8 vCPUs + 16 GB RAM.
+  # See issue #31 for cleanup details.
+  docker-home-servers:
+    type: docker
+    ip: 10.10.0.124
+    vmid: 116
+    user: cal
+    description: "Legacy home servers VM — Jellyfin only, decommission candidate"
+    config_paths:
+      docker-compose: /home/cal/container-data
+    services:
+      - jellyfin  # only remaining service; duplicate of ubuntu-manticore jellyfin
+    decommission_candidate: true
+    notes: "watchstate removed 2026-04-03 (duplicate of manticore); 3.36 GB images pruned; see issue #31"
+
 # Decommissioned hosts (kept for reference)
 # decommissioned:
 #   tdarr-old:
 #     ip: 10.10.0.43
 #     note: "Replaced by ubuntu-manticore tdarr"
-#   docker-home:
-#     ip: 10.10.0.124
-#     note: "Decommissioned"
--- a/server-configs/proxmox/maintenance-reboot.md
+++ b/server-configs/proxmox/maintenance-reboot.md
@ -0,0 +1,246 @@
+---
+title: "Proxmox Monthly Maintenance Reboot"
+description: "Runbook for the first-Sunday-of-the-month Proxmox host reboot — dependency-aware shutdown/startup order, validation checklist, and Ansible automation."
+type: runbook
+domain: server-configs
+tags: [proxmox, maintenance, reboot, ansible, operations, systemd]
+---
+
+# Proxmox Monthly Maintenance Reboot
+
+## Overview
+
+| Detail | Value |
+|--------|-------|
+| **Schedule** | 1st Sunday of every month, 3:00 AM ET (08:00 UTC) |
+| **Expected downtime** | ~15 minutes (host reboot + VM/LXC startup) |
+| **Orchestration** | Ansible on LXC 304 — shutdown playbook → host reboot → post-reboot startup playbook |
+| **Calendar** | Google Calendar recurring event: "Proxmox Monthly Maintenance Reboot" |
+| **HA DNS** | ubuntu-manticore (10.10.0.226) provides Pi-hole 2 during Proxmox downtime |
+
+## Why
+
+- Kernel updates accumulate without reboot and never take effect
+- Long uptimes allow memory leaks and process state drift (e.g., avahi busy-loops)
+- Validates that all VMs/LXCs auto-start cleanly with `onboot: 1`
+
+## Architecture
+
+The reboot is split into two playbooks because LXC 304 (the Ansible controller) is itself a guest on the Proxmox host being rebooted:
+
+1. **`monthly-reboot.yml`** — Snapshots all guests, shuts them down in dependency order, issues a fire-and-forget `reboot` to the Proxmox host, then exits. LXC 304 is killed when the host reboots.
+2. **`post-reboot-startup.yml`** — After the host reboots, LXC 304 auto-starts via `onboot: 1`. A systemd service (`ansible-post-reboot.service`) waits 120 seconds for the Proxmox API to stabilize, then starts all guests in dependency order with staggered delays.
+
+The `onboot: 1` flag on all production guests acts as a safety net — even if the post-reboot playbook fails, Proxmox will start everything (though without controlled ordering).
+
+## Prerequisites (Before Maintenance)
+
+- [ ] Verify no active Tdarr transcodes on ubuntu-manticore
+- [ ] Verify no running database backups
+- [ ] Ensure workstation has Pi-hole 2 (10.10.0.226) as a fallback DNS server so it fails over automatically during downtime
+- [ ] Confirm ubuntu-manticore Pi-hole 2 is healthy: `ssh manticore "docker exec pihole pihole status"`
+
+## `onboot` Audit
+
+All production VMs and LXCs must have `onboot: 1` so they restart automatically as a safety net.
+
+**Check VMs:**
+```bash
+ssh proxmox "for id in \$(qm list | awk 'NR>1{print \$1}'); do \
+  name=\$(qm config \$id | grep '^name:' | awk '{print \$2}'); \
+  onboot=\$(qm config \$id | grep '^onboot:'); \
+  echo \"VM \$id (\$name): \${onboot:-onboot NOT SET}\"; \
+done"
+```
+
+**Check LXCs:**
+```bash
+ssh proxmox "for id in \$(pct list | awk 'NR>1{print \$1}'); do \
+  name=\$(pct config \$id | grep '^hostname:' | awk '{print \$2}'); \
+  onboot=\$(pct config \$id | grep '^onboot:'); \
+  echo \"LXC \$id (\$name): \${onboot:-onboot NOT SET}\"; \
+done"
+```
+
+**Audit results (2026-04-03):**
+
+| ID | Name | Type | `onboot` | Status |
+|----|------|------|----------|--------|
+| 106 | docker-home | VM | 1 | OK |
+| 109 | homeassistant | VM | 1 | OK (fixed 2026-04-03) |
+| 110 | discord-bots | VM | 1 | OK |
+| 112 | databases-bots | VM | 1 | OK |
+| 115 | docker-sba | VM | 1 | OK |
+| 116 | docker-home-servers | VM | 1 | OK |
+| 210 | docker-n8n-lxc | LXC | 1 | OK |
+| 221 | arr-stack | LXC | 1 | OK (fixed 2026-04-03) |
+| 222 | memos | LXC | 1 | OK |
+| 223 | foundry-lxc | LXC | 1 | OK (fixed 2026-04-03) |
+| 225 | gitea | LXC | 1 | OK |
+| 227 | uptime-kuma | LXC | 1 | OK |
+| 301 | claude-discord-coordinator | LXC | 1 | OK |
+| 302 | claude-runner | LXC | 1 | OK |
+| 303 | mcp-gateway | LXC | 0 | Intentional (on-demand) |
+| 304 | ansible-controller | LXC | 1 | OK |
+
+**If any production guest is missing `onboot: 1`:**
+```bash
+ssh proxmox "qm set <VMID> --onboot 1"   # for VMs
+ssh proxmox "pct set <CTID> --onboot 1"   # for LXCs
+```
+
+## Shutdown Order (Dependency-Aware)
+
+Reverse of the validated startup sequence. Stop consumers before their dependencies. Each tier polls per-guest status rather than using fixed waits.
+
+```
+Tier 4 — Media & Others (no downstream dependents)
+  VM 109  homeassistant
+  LXC 221 arr-stack
+  LXC 222 memos
+  LXC 223 foundry-lxc
+  LXC 302 claude-runner
+
+Tier 3 — Applications (depend on databases + infra)
+  VM 115  docker-sba (Paper Dynasty, Major Domo)
+  VM 110  discord-bots
+  LXC 301 claude-discord-coordinator
+
+Tier 2 — Infrastructure + DNS (depend on databases)
+  VM 106  docker-home (Pi-hole 1, NPM)
+  LXC 225 gitea
+  LXC 210 docker-n8n-lxc
+  LXC 227 uptime-kuma
+  VM 116  docker-home-servers
+
+Tier 1 — Databases (no dependencies, shut down last)
+  VM 112  databases-bots (force-stop after 90s if ACPI ignored)
+
+→ LXC 304 issues fire-and-forget reboot to Proxmox host, then is killed
+```
+
+**Known quirks:**
+- VM 112 (databases-bots) may ignore ACPI shutdown — playbook force-stops after 90s
+- VM 109 (homeassistant) is self-managed via HA Supervisor, excluded from Ansible inventory
+- LXC 303 (mcp-gateway) has `onboot: 0` and is operator-managed — not included in shutdown/startup. If it was running before maintenance, bring it up manually afterward
+
+## Startup Order (Staggered)
+
+After the Proxmox host reboots, LXC 304 auto-starts and the `ansible-post-reboot.service` waits 120s before running the controlled startup:
+
+```
+Tier 1 — Databases first
+  VM 112  databases-bots
+  → wait 30s for DB to accept connections
+
+Tier 2 — Infrastructure + DNS
+  VM 106  docker-home (Pi-hole 1, NPM)
+  LXC 225 gitea
+  LXC 210 docker-n8n-lxc
+  LXC 227 uptime-kuma
+  VM 116  docker-home-servers
+  → wait 30s
+
+Tier 3 — Applications
+  VM 115  docker-sba
+  VM 110  discord-bots
+  LXC 301 claude-discord-coordinator
+  → wait 30s
+
+Pi-hole fix — restart container via SSH to clear UDP DNS bug
+  ssh docker-home "docker restart pihole"
+  → wait 10s
+
+Tier 4 — Media & Others
+  VM 109  homeassistant
+  LXC 221 arr-stack
+  LXC 222 memos
+  LXC 223 foundry-lxc
+  LXC 302 claude-runner
+```
+
+## Post-Reboot Validation
+
+- [ ] Pi-hole 1 DNS resolving: `ssh docker-home "docker exec pihole dig google.com @127.0.0.1"`
+- [ ] Gitea accessible: `curl -sf https://git.manticorum.com/api/v1/version`
+- [ ] n8n healthy: `ssh docker-n8n-lxc "docker ps --filter name=n8n --format '{{.Status}}'"`
+- [ ] Discord bots responding (check Discord)
+- [ ] Uptime Kuma dashboard green: `curl -sf http://10.10.0.227:3001/api/status-page/homelab`
+- [ ] Home Assistant running: `curl -sf http://10.10.0.109:8123/api/ -H 'Authorization: Bearer <token>'`
+- [ ] Maintenance snapshots cleaned up (auto, 7-day retention)
+
+## Automation
+
+### Ansible Playbooks
+
+Both located at `/opt/ansible/playbooks/` on LXC 304.
+
+```bash
+# Dry run — shutdown only
+ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --check"
+
+# Manual full execution — shutdown + reboot
+ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml"
+
+# Manual post-reboot startup (if automatic startup failed)
+ssh ansible "ansible-playbook /opt/ansible/playbooks/post-reboot-startup.yml"
+
+# Shutdown only — skip the host reboot
+ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --tags shutdown"
+```
+
+### Systemd Units (on LXC 304)
+
+| Unit | Purpose | Schedule |
+|------|---------|----------|
+| `ansible-monthly-reboot.timer` | Triggers shutdown + reboot playbook | 1st Sunday of month, 08:00 UTC |
+| `ansible-monthly-reboot.service` | Runs `monthly-reboot.yml` | Activated by timer |
+| `ansible-post-reboot.service` | Runs `post-reboot-startup.yml` | On boot (multi-user.target), only if uptime < 10 min |
+
+```bash
+# Check timer status
+ssh ansible "systemctl status ansible-monthly-reboot.timer"
+
+# Next scheduled run
+ssh ansible "systemctl list-timers ansible-monthly-reboot.timer"
+
+# Check post-reboot service status
+ssh ansible "systemctl status ansible-post-reboot.service"
+
+# Disable for a month (e.g., during an incident)
+ssh ansible "systemctl stop ansible-monthly-reboot.timer"
+```
+
+### Deployment (one-time setup on LXC 304)
+
+```bash
+# Copy playbooks
+scp ansible/playbooks/monthly-reboot.yml ansible:/opt/ansible/playbooks/
+scp ansible/playbooks/post-reboot-startup.yml ansible:/opt/ansible/playbooks/
+
+# Copy and enable systemd units
+scp ansible/systemd/ansible-monthly-reboot.timer ansible:/etc/systemd/system/
+scp ansible/systemd/ansible-monthly-reboot.service ansible:/etc/systemd/system/
+scp ansible/systemd/ansible-post-reboot.service ansible:/etc/systemd/system/
+ssh ansible "sudo systemctl daemon-reload && \
+  sudo systemctl enable --now ansible-monthly-reboot.timer && \
+  sudo systemctl enable ansible-post-reboot.service"
+
+# Verify SSH key access from LXC 304 to docker-home (needed for Pi-hole restart)
+ssh ansible "ssh -o BatchMode=yes docker-home 'echo ok'"
+```
+
+## Rollback
+
+If a guest fails to start after reboot:
+1. Check Proxmox web UI or `pvesh get /nodes/proxmox/qemu/<VMID>/status/current`
+2. Review guest logs: `ssh proxmox "journalctl -u pve-guests -n 50"`
+3. Manual start: `ssh proxmox "pvesh create /nodes/proxmox/qemu/<VMID>/status/start"`
+4. If guest is corrupted, restore from the pre-reboot Proxmox snapshot
+5. If post-reboot startup failed entirely, run manually: `ssh ansible "ansible-playbook /opt/ansible/playbooks/post-reboot-startup.yml"`
+
+## Related Documentation
+
+- [Ansible Controller Setup](../../vm-management/ansible-controller-setup.md) — LXC 304 details and inventory
+- [Proxmox 7→9 Upgrade Plan](../../vm-management/proxmox-upgrades/proxmox-7-to-9-upgrade-plan.md) — original startup order and Phase 1 lessons
+- [VM Decommission Runbook](../../vm-management/vm-decommission-runbook.md) — removing VMs from the rotation
--- a/server-configs/proxmox/qemu/105.conf
+++ b/server-configs/proxmox/qemu/105.conf
@ -1,15 +0,0 @@
-agent: 1
-boot: order=scsi0;net0
-cores: 8
-memory: 16384
-meta: creation-qemu=6.1.0,ctime=1646688596
-name: docker-vpn
-net0: virtio=76:36:85:A7:6A:A3,bridge=vmbr0,firewall=1
-numa: 0
-onboot: 1
-ostype: l26
-scsi0: local-lvm:vm-105-disk-0,size=256G
-scsihw: virtio-scsi-pci
-smbios1: uuid=55061264-b9b1-4ce4-8d44-9c187affcb1d
-sockets: 1
-vmgenid: 30878bdf-66f9-41bf-be34-c31b400340f9
--- a/server-configs/proxmox/qemu/106.conf
+++ b/server-configs/proxmox/qemu/106.conf
@ -1,7 +1,7 @@
 agent: 1
 boot: order=scsi0;net0
 cores: 4
-memory: 16384
+memory: 6144
 meta: creation-qemu=6.1.0,ctime=1646083628
 name: docker-home
 net0: virtio=BA:65:DF:88:85:4C,bridge=vmbr0,firewall=1
@ -11,5 +11,5 @@ ostype: l26
 scsi0: local-lvm:vm-106-disk-0,size=256G
 scsihw: virtio-scsi-pci
 smbios1: uuid=54ef12fc-edcc-4744-a109-dd2de9a6dc03
-sockets: 2
+sockets: 1
 vmgenid: a13c92a2-a955-485e-a80e-391e99b19fbd
--- a/server-configs/proxmox/qemu/115.conf
+++ b/server-configs/proxmox/qemu/115.conf
@ -12,5 +12,5 @@ ostype: l26
 scsi0: local-lvm:vm-115-disk-0,size=256G
 scsihw: virtio-scsi-pci
 smbios1: uuid=19be98ee-f60d-473d-acd2-9164717fcd11
-sockets: 2
+sockets: 1
 vmgenid: 682dfeab-8c63-4f0b-8ed2-8828c2f808ef
--- a/server-configs/proxmox/right-sizing-vm-106.md
+++ b/server-configs/proxmox/right-sizing-vm-106.md
@ -0,0 +1,141 @@
+---
+title: "VM 106 (docker-home) Right-Sizing Runbook"
+description: "Runbook for right-sizing VM 106 from 16 GB/8 vCPU to 6 GB/4 vCPU — pre-checks, resize commands, and post-resize validation."
+type: runbook
+domain: server-configs
+tags: [proxmox, infra-audit, right-sizing, docker-home]
+---
+
+# VM 106 (docker-home) Right-Sizing Runbook
+
+## Context
+
+Infrastructure audit (2026-04-02) found VM 106 severely overprovisioned:
+
+| Resource | Allocated | Actual Usage | Target |
+|----------|-----------|--------------|--------|
+| RAM | 16 GB | 1.1–1.5 GB | 6 GB (4× headroom) |
+| vCPUs | 8 (2 sockets × 4 cores) | load 0.12/core | 4 (1 socket × 4 cores) |
+
+**Services**: Pi-hole, Nginx Proxy Manager, Portainer
+
+## Pre-Check Results (2026-04-03)
+
+Automated checks were run before resizing. **All clear.**
+
+### Container memory limits
+
+```bash
+docker inspect pihole nginx-proxy-manager_app_1 portainer \
+  | python3 -c "import json,sys; c=json.load(sys.stdin); \
+    [print(x['Name'], 'MemoryLimit:', x['HostConfig']['Memory']) for x in c]"
+```
+
+Result:
+```
+/pihole MemoryLimit: 0
+/nginx-proxy-manager_app_1 MemoryLimit: 0
+/portainer MemoryLimit: 0
+```
+
+`0` = no limit — no containers will OOM at 6 GB.
+
+### Docker Compose memory reservations
+
+```bash
+grep -rn 'memory\|mem_limit\|memswap' /home/cal/container-data/*/docker-compose.yml
+```
+
+Result: **no matches** — no compose-level memory reservations.
+
+### Live memory usage at audit time
+
+```
+total: 15 GiB  used: 1.1 GiB  free: 6.8 GiB  buff/cache: 7.7 GiB
+Pi-hole:  463 MiB
+NPM:      367 MiB
+Portainer:  12 MiB
+Total containers: ~842 MiB
+```
+
+## Resize Procedure
+
+Brief downtime: Pi-hole and NPM will be unavailable during shutdown.
+Manticore runs Pi-hole 2 (10.10.0.226) for HA DNS — clients fail over automatically.
+
+### Step 1 — Shut down the VM
+
+```bash
+ssh proxmox "qm shutdown 106 --timeout 60"
+# Wait for shutdown
+ssh proxmox "qm status 106"   # Should show: status: stopped
+```
+
+### Step 2 — Apply new hardware config
+
+```bash
+# Reduce RAM: 16384 MB → 6144 MB
+ssh proxmox "qm set 106 --memory 6144"
+
+# Reduce vCPUs: 2 sockets × 4 cores → 1 socket × 4 cores (8 → 4 vCPUs)
+ssh proxmox "qm set 106 --sockets 1 --cores 4"
+
+# Verify
+ssh proxmox "qm config 106 | grep -E 'memory|cores|sockets'"
+```
+
+Expected output:
+```
+cores: 4
+memory: 6144
+sockets: 1
+```
+
+### Step 3 — Start the VM
+
+```bash
+ssh proxmox "qm start 106"
+```
+
+Wait ~30 seconds for Docker to come up.
+
+### Step 4 — Verify services
+
+```bash
+# Pi-hole DNS resolution
+ssh pihole "docker exec pihole dig google.com @127.0.0.1 | grep -E 'SERVER|ANSWER'"
+
+# NPM — check it's running
+ssh pihole "docker ps --filter name=nginx-proxy-manager --format '{{.Status}}'"
+
+# Portainer
+ssh pihole "docker ps --filter name=portainer --format '{{.Status}}'"
+
+# Memory usage post-resize
+ssh pihole "free -h"
+```
+
+### Step 5 — Monitor for 24h
+
+Check memory doesn't approach the 6 GB limit:
+
+```bash
+ssh pihole "free -h && docker stats --no-stream --format 'table {{.Name}}\t{{.MemUsage}}'"
+```
+
+Alert threshold: if `used` exceeds 4.5 GB (75% of 6 GB), consider increasing to 8 GB.
+
+## Rollback
+
+If services fail to come up after resizing:
+
+```bash
+# Restore original allocation
+ssh proxmox "qm set 106 --memory 16384 --sockets 2 --cores 4"
+ssh proxmox "qm start 106"
+```
+
+## Related
+
+- [Maintenance Reboot Runbook](maintenance-reboot.md) — VM 106 is Tier 2 (shut down after apps, before databases)
+- Issue: cal/claude-home#19
--- a/vm-management/proxmox-upgrades/proxmox-7-to-9-upgrade-plan.md
+++ b/vm-management/proxmox-upgrades/proxmox-7-to-9-upgrade-plan.md
@ -28,8 +28,8 @@ tags: [proxmox, upgrade, pve, backup, rollback, infrastructure]
 **Production Services** (7 LXC + 7 VMs) — cleaned up 2026-02-19:
 - **Critical**: Paper Dynasty/Major Domo (VM 115), Discord bots (VM 110), Gitea (LXC 225), n8n (LXC 210), Home Assistant (VM 109), Databases (VM 112), docker-home/Pi-hole 1 (VM 106)
 - **Important**: Claude Discord Coordinator (LXC 301), arr-stack (LXC 221), Uptime Kuma (LXC 227), Foundry VTT (LXC 223), Memos (LXC 222)
- **Stopped/Investigate**: docker-vpn (VM 105, decommissioning), docker-home-servers (VM 116, needs investigation)
- **Removed (2026-02-19)**: 108 (ansible), 224 (openclaw), 300 (openclaw-migrated), 101/102/104/111/211 (game servers), 107 (plex), 113 (tdarr - moved to .226), 114 (duplicate arr-stack), 117 (unused), 100/103 (old templates)
+- **Decommission Candidate**: docker-home-servers (VM 116) — Jellyfin-only after 2026-04-03 cleanup; watchstate removed (duplicate of manticore); see issue #31
+- **Removed (2026-02-19)**: 108 (ansible), 224 (openclaw), 300 (openclaw-migrated), 101/102/104/111/211 (game servers), 107 (plex), 113 (tdarr - moved to .226), 114 (duplicate arr-stack), 117 (unused), 100/103 (old templates), 105 (docker-vpn - decommissioned 2026-04)

 **Key Constraints**:
 - Home Assistant VM 109 requires dual network (vmbr1 for Matter support)
--- a/vm-management/scripts/cloud-init-user-data.yaml
+++ b/vm-management/scripts/cloud-init-user-data.yaml
@ -67,10 +67,15 @@ runcmd:
  
  # Add cal user to docker group (will take effect after next login)
  - usermod -aG docker cal
-  
+
  # Test Docker installation
  - docker run --rm hello-world

+  # Mask avahi-daemon — not needed in a static-IP homelab with Pi-hole DNS,
+  # and has a known kernel busy-loop bug that wastes CPU
+  - systemctl stop avahi-daemon || true
+  - systemctl mask avahi-daemon
+
 # Write configuration files
 write_files:
  # SSH hardening configuration
--- a/vm-management/vm-decommission-runbook.md
+++ b/vm-management/vm-decommission-runbook.md
@ -0,0 +1,163 @@
+---
+title: "VM Decommission Runbook"
+description: "Step-by-step procedure for safely decommissioning a Proxmox VM — dependency checks, destruction, and repo cleanup."
+type: runbook
+domain: vm-management
+tags: [proxmox, decommission, infrastructure, cleanup]
+---
+
+# VM Decommission Runbook
+
+Procedure for safely removing a stopped Proxmox VM and reclaiming its disk space. Derived from the VM 105 (docker-vpn) decommission (2026-04-02, issue #20).
+
+## Prerequisites
+
+- VM must already be **stopped** on Proxmox
+- Services previously running on the VM must be confirmed migrated or no longer needed
+- SSH access to Proxmox host (`ssh proxmox`)
+
+## Phase 1 — Dependency Verification
+
+Run all checks before destroying anything. A clean result on all five means safe to proceed.
+
+### 1.1 Pi-hole DNS
+
+Check both primary and secondary Pi-hole for DNS records pointing to the VM's IP:
+
+```bash
+ssh pihole "grep '<VM_IP>' /etc/pihole/custom.list || echo 'No DNS entries'"
+ssh pihole "pihole -q <VM_HOSTNAME>"
+```
+
+### 1.2 Nginx Proxy Manager (NPM)
+
+Check NPM for any proxy hosts with the VM's IP as an upstream:
+
+- NPM UI: https://npm.manticorum.com → Proxy Hosts → search for VM IP
+- Or via API: `ssh npm-pihole "curl -s http://localhost:81/api/nginx/proxy-hosts" | grep <VM_IP>`
+
+### 1.3 Proxmox Firewall Rules
+
+```bash
+ssh proxmox "cat /etc/pve/firewall/<VMID>.fw 2>/dev/null || echo 'No firewall rules'"
+```
+
+### 1.4 Backup Existence
+
+```bash
+ssh proxmox "ls -la /var/lib/vz/dump/ | grep <VMID>"
+```
+
+### 1.5 VPN / Tunnel References
+
+Check if any WireGuard or VPN configs on other hosts reference this VM:
+
+```bash
+ssh proxmox "grep -r '<VM_IP>' /etc/wireguard/ 2>/dev/null || echo 'No WireGuard refs'"
+```
+
+Also check SSH config and any automation scripts in the claude-home repo:
+
+```bash
+grep -r '<VM_IP>\|<VM_HOSTNAME>' ~/Development/claude-home/
+```
+
+## Phase 2 — Safety Measures
+
+### 2.1 Disable Auto-Start
+
+Prevent the VM from starting on Proxmox reboot while you work:
+
+```bash
+ssh proxmox "qm set <VMID> --onboot 0"
+```
+
+### 2.2 Record Disk Space (Before)
+
+```bash
+ssh proxmox "lvs | grep pve"
+```
+
+Save this output for comparison after destruction.
+
+### 2.3 Optional: Take a Final Backup
+
+If the VM might contain anything worth preserving:
+
+```bash
+ssh proxmox "vzdump <VMID> --mode snapshot --storage home-truenas --compress zstd"
+```
+
+Skip if the VM has been stopped for a long time and all services are confirmed migrated.
+
+## Phase 3 — Destroy
+
+```bash
+ssh proxmox "qm destroy <VMID> --purge"
+```
+
+The `--purge` flag removes the disk along with the VM config. Verify:
+
+```bash
+ssh proxmox "qm list | grep <VMID>"          # Should return nothing
+ssh proxmox "lvs | grep vm-<VMID>-disk"       # Should return nothing
+ssh proxmox "lvs | grep pve"                  # Compare with Phase 2.2
+```
+
+## Phase 4 — Repo Cleanup
+
+Update these files in the `claude-home` repo:
+
+| File | Action |
+|------|--------|
+| `~/.ssh/config` | Comment out Host block, add `# DECOMMISSIONED: <name> (<IP>) - <reason>` |
+| `server-configs/proxmox/qemu/<VMID>.conf` | Delete the file |
+| Migration results (if applicable) | Check off decommission tasks |
+| `vm-management/proxmox-upgrades/proxmox-7-to-9-upgrade-plan.md` | Move from Stopped/Investigate to Decommissioned |
+| `networking/examples/ssh-homelab-setup.md` | Comment out or remove entry |
+| `networking/examples/server_inventory.yaml` | Comment out or remove entry |
+
+Leave historical/planning docs (migration plans, wave results) as-is — they serve as historical records.
+
+## Phase 5 — Commit and PR
+
+Branch naming: `chore/<ISSUE_NUMBER>-decommission-<vm-name>`
+
+Commit message format:
+```
+chore: decommission VM <VMID> (<name>) — reclaim <SIZE> disk (#<ISSUE>)
+
+Closes #<ISSUE>
+```
+
+This is typically a docs-only PR (all `.md` and config files) which gets auto-approved by the `auto-merge-docs` workflow.
+
+## Checklist Template
+
+Copy this for each decommission:
+
+```markdown
+### VM <VMID> (<name>) Decommission
+
+**Pre-deletion verification:**
+- [ ] Pi-hole DNS — no records
+- [ ] NPM upstreams — no proxy hosts
+- [ ] Proxmox firewall — no rules
+- [ ] Backup status — verified
+- [ ] VPN/tunnel references — none
+
+**Execution:**
+- [ ] Disabled onboot
+- [ ] Recorded disk space before
+- [ ] Took backup (or confirmed skip)
+- [ ] Destroyed VM with --purge
+- [ ] Verified disk space reclaimed
+
+**Cleanup:**
+- [ ] SSH config updated
+- [ ] VM config file deleted from repo
+- [ ] Migration docs updated
+- [ ] Upgrade plan updated
+- [ ] Example files updated
+- [ ] Committed, pushed, PR created
+```
--- a/vm-management/wave2-migration-results.md
+++ b/vm-management/wave2-migration-results.md
@ -262,7 +262,7 @@ When connecting Jellyseerr to arr apps, be careful with tag configurations - inv
 - [x] Test movie/show requests through Jellyseerr

 ### After 48 Hours
- [ ] Decommission VM 121 (docker-vpn)
+- [x] Decommission VM 121 (docker-vpn)
 - [ ] Clean up local migration temp files (`/tmp/arr-config-migration/`)

 ---
--- a/workstation/claude-code-multi-account.md
+++ b/workstation/claude-code-multi-account.md
@ -152,11 +152,13 @@ Both accounts can run simultaneously in separate terminal windows.

 ## Current Configuration on This Workstation

+**Status: DISABLED** (as of 2026-04-06). The `.envrc` file is still in place but direnv has been denied (`direnv deny ~/work`). To re-enable: `direnv allow ~/work`.
+
 | Location | Account | Purpose |
 |----------|---------|---------|
 | `~/.claude` | Primary (cal.corum@gmail.com) | All projects except ~/work |
 | `~/.claude-ac` | Alternate | ~/work projects |
-| `~/work/.envrc` | — | direnv trigger for CLAUDE_CONFIG_DIR |
+| `~/work/.envrc` | — | direnv trigger for CLAUDE_CONFIG_DIR (currently denied) |

 ## How It All Fits Together

--- a/workstation/llama-cpp-setup.md
+++ b/workstation/llama-cpp-setup.md
@ -0,0 +1,67 @@
+---
+title: "llama.cpp Installation and Setup"
+description: "llama.cpp b8680 Vulkan build installation on workstation with RTX 4080 Super, including model download workflow."
+type: reference
+domain: workstation
+tags: [llama-cpp, vulkan, nvidia, gguf, local-inference]
+---
+
+## Installation
+
+Installed from pre-built release binary (no CUDA build available for Linux — Vulkan is the correct choice for NVIDIA GPUs):
+
+```bash
+# Extract to /opt
+sudo mkdir -p /opt/llama.cpp
+sudo tar -xzf llama-b8680-bin-ubuntu-vulkan-x64.tar.gz -C /opt/llama.cpp --strip-components=1
+
+# Symlink all binaries to PATH
+for bin in /opt/llama.cpp/llama-*; do
+  sudo ln -sf "$bin" /usr/local/bin/$(basename "$bin")
+done
+```
+
+**Version**: b8680  
+**Backends loaded**: Vulkan (GPU), CPU (Zen4 for 7800X3D), RPC  
+**Source**: https://github.com/ggml-org/llama.cpp/releases
+
+## Release Binary Options (Linux x64)
+
+| Build | Use case |
+|-------|----------|
+| `ubuntu-x64` | CPU only |
+| `ubuntu-vulkan-x64` | NVIDIA/AMD GPU via Vulkan |
+| `ubuntu-rocm-x64` | AMD GPU via ROCm |
+| `ubuntu-openvino-x64` | Intel CPU/GPU/NPU |
+
+No pre-built CUDA binary exists — Vulkan is the NVIDIA option. For native CUDA, build from source with `-DGGML_CUDA=ON`.
+
+## Models
+
+Stored in `/home/cal/Models/`.
+
+| Model | File | Size |
+|-------|------|------|
+| Qwen3.5-9B Q4_K_M | `Qwen3.5-9B-Q4_K_M.gguf` | 5.3 GB |
+
+## Downloading Models
+
+The built-in `-hf` downloader can stall. Use `curl` with resume support instead:
+
+```bash
+curl -L -C - --progress-bar \
+  -o /home/cal/Models/<model>.gguf \
+  "https://huggingface.co/<org>/<repo>/resolve/main/<model>.gguf"
+```
+
+`-C -` enables resume if the download is interrupted.
+
+## Running
+
+```bash
+# Full GPU offload
+llama-cli -m /home/cal/Models/Qwen3.5-9B-Q4_K_M.gguf -ngl 99
+
+# Server mode
+llama-server -m /home/cal/Models/Qwen3.5-9B-Q4_K_M.gguf -ngl 99 --port 8080
+```
--- a/workstation/troubleshooting.md
+++ b/workstation/troubleshooting.md
@ -0,0 +1,33 @@
+---
+title: "Workstation Troubleshooting"
+description: "Troubleshooting notes for Nobara/KDE Wayland workstation issues."
+type: troubleshooting
+domain: workstation
+tags: [troubleshooting, wayland, kde]
+---
+
+# Workstation Troubleshooting
+
+## Discord screen sharing shows no windows on KDE Wayland (2026-04-03)
+
+**Severity:** Medium — cannot share screen via Discord desktop app
+
+**Problem:** Clicking "Share Your Screen" in Discord desktop app (v0.0.131, Electron 37) opens the Discord picker but shows zero windows/screens. Same behavior in both the desktop app and the web app when using Discord's own picker. Affects both native Wayland and XWayland modes.
+
+**Root Cause:** Discord's built-in screen picker uses Electron's `desktopCapturer.getSources()` which relies on X11 window enumeration. On KDE Wayland:
+- In native Wayland mode: no X11 windows exist, so the picker is empty
+- In forced X11/XWayland mode (`ELECTRON_OZONE_PLATFORM_HINT=x11`): Discord can only see other XWayland windows (itself, Android emulator), not native Wayland apps
+- Discord ignores `--use-fake-ui-for-media-stream` and other Chromium flags that should force portal usage
+- The `discord-flags.conf` file is **not read** by the Nobara/RPM Discord package — flags must go in the `.desktop` file `Exec=` line
+
+**Fix:** Use **Discord web app in Firefox** for screen sharing. Firefox natively delegates to the XDG Desktop Portal via PipeWire, which shows the KDE screen picker with all windows. The desktop app's own picker remains broken on Wayland as of v0.0.131.
+
+Configuration applied (for general Discord Wayland support):
+- `~/.local/share/applications/discord.desktop` — overrides system `.desktop` with Wayland flags
+- `~/.config/discord-flags.conf` — created but not read by this Discord build
+
+**Lesson:**
+- Discord desktop on Linux Wayland cannot do screen sharing through its own picker — always use the web app in Firefox for this
+- Electron's `desktopCapturer` API is fundamentally X11-only; the PipeWire/portal path requires the app to use `getDisplayMedia()` instead, which Discord's desktop app does not do
+- `discord-flags.conf` is unreliable across distros — always verify flags landed in `/proc/<pid>/cmdline`
+- Vesktop (community client) is an alternative that properly implements portal-based screen sharing, if the web app is insufficient
Author	SHA1	Message	Date
Cal Corum	1a83c863cb	docs: sync KB — autonomous-nightly-2026-04-10-run2.md All checks were successful Reindex Knowledge Base / reindex (push) Successful in 2s Details	2026-04-10 12:00:47 -05:00
Cal Corum	1ab0bf27ed	docs: sync KB — pr-reviewer-ai-reviewing-label-stuck.md All checks were successful Reindex Knowledge Base / reindex (push) Successful in 3s Details Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 10:38:45 -05:00
Cal Corum	87aeaf3309	docs: sync KB — autonomous-nightly-2026-04-10.md,autonomous-pipeline-session-2026-04-10.md	2026-04-10 10:38:45 -05:00
Cal Corum	8d165efbe6	docs: sync KB — 2026-04-08-home-network-review.md,2026-04-08-home-network-review-design.md	2026-04-10 10:38:45 -05:00
Cal Corum	a307e4dcb7	docs: sync KB — database-release-2026.4.7.md,release-2026.4.7.md	2026-04-10 10:38:45 -05:00
Cal Corum	d3b9e43016	docs: sync KB — backlog-triage-sandbox-fix.md	2026-04-10 10:38:45 -05:00
Cal Corum	92c5ce0ebb	docs: sync KB — apcupsd-ups-monitoring.md,llama-cpp-setup.md	2026-04-10 10:38:45 -05:00
Cal Corum	ffb036042c	docs: sync KB — claude-code-multi-account.md	2026-04-10 10:38:45 -05:00
cal	d34bc01305	Merge pull request 'feat: right-size VM 115 (docker-sba) 16→8 vCPUs' (#44 ) from enhancement/18-rightsize-vm115-vcpus into main Reviewed-on: #44 Reviewed-by: Claude <cal.corum+openclaw@gmail.com>	2026-04-06 15:41:34 +00:00
Cal Corum	01e6302709	feat: right-size VM 115 config and add --hosts flag to audit script All checks were successful Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 1s Details Reduce VM 115 (docker-sba) from 16 vCPUs (2×8) to 8 vCPUs (1×8) to match actual workload (0.06 load/core). Add --hosts flag to homelab-audit.sh for targeted post-change audits. Closes #18 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 15:41:16 +00:00
cal	024aea82c4	Merge pull request 'feat: add monthly Docker prune cron Ansible playbook (#29 )' (#45 ) from issue/29-docker-image-prune-cron-on-all-docker-hosts into main Reviewed-on: #45	2026-04-06 15:41:04 +00:00
Cal Corum	d4ee899c1d	feat: add monthly Docker prune cron Ansible playbook (#29 ) All checks were successful Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 1s Details Closes #29 Deploys /etc/cron.monthly/docker-prune to all six Docker hosts via Ansible. Uses 720h (30-day) age filter on containers and images, with volume pruning exempt for `keep`-labeled volumes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-06 15:40:33 +00:00
cal	d7987a90ff	Merge pull request 'docs: right-size VM 106 (docker-home) — 16 GB/8 vCPU → 6 GB/4 vCPU (#19 )' (#47 ) from issue/19-right-size-vm-106-docker-home-16-gb-6-8-gb-ram into main All checks were successful Reindex Knowledge Base / reindex (push) Successful in 2s Details Reviewed-on: #47	2026-04-06 15:40:20 +00:00
cal	5b23d92435	Merge branch 'main' into issue/19-right-size-vm-106-docker-home-16-gb-6-8-gb-ram All checks were successful Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 3s Details	2026-04-06 15:40:07 +00:00
cal	29238f3ddf	Merge pull request 'feat: weekly Proxmox backup verification → Discord (#27 )' (#48 ) from issue/27-set-up-weekly-proxmox-backup-verification-discord into main All checks were successful Reindex Knowledge Base / reindex (push) Successful in 3s Details Reviewed-on: #48	2026-04-06 15:39:53 +00:00
Cal Corum	dd7c68c13a	docs: sync KB — discord-browser-testing-workflow.md	2026-04-06 02:00:38 -05:00
Cal Corum	acb8fef084	docs: sync KB — database-deployment-guide.md,refractor-in-app-test-plan.md	2026-04-06 00:00:03 -05:00
Cal Corum	cacf4a9043	feat: add weekly Gitea disk cleanup Ansible playbook Gitea LXC 225 hit 100% disk from accumulated Docker buildx volumes, repo-archive cache, and journal logs. Adds automated weekly cleanup managed by systemd timer on the Ansible controller (Wed 04:00 UTC). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-05 19:24:59 -05:00
Cal Corum	95bae33309	feat: add weekly Proxmox backup verification and CT 302 self-health check (#27 ) All checks were successful Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s Details Closes #27 - proxmox-backup-check.sh: SSHes to Proxmox, queries pvesh task history, classifies each running VM/CT as green/yellow/red by backup recency, posts a Discord embed summary. Designed for weekly cron on CT 302. - ct302-self-health.sh: Checks disk usage on CT 302 itself, silently exits when healthy, posts a Discord alert when any filesystem exceeds 80% threshold. Closes the blind spot where the monitoring system cannot monitor itself externally. - Updated monitoring/scripts/CONTEXT.md with full operational docs, install instructions, and cron schedules for both new scripts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-04 06:07:57 -05:00
Cal Corum	29a20fbe06	feat: add monthly Proxmox maintenance reboot automation (#26 ) All checks were successful Reindex Knowledge Base / reindex (push) Successful in 2s Details Establishes a first-Sunday-of-the-month maintenance window orchestrated by Ansible on LXC 304. Split into two playbooks to handle the self-reboot paradox (the controller is a guest on the host being rebooted): - monthly-reboot.yml: snapshots, tiered shutdown with per-guest polling, fire-and-forget host reboot - post-reboot-startup.yml: controlled tiered startup with staggered delays, Pi-hole UDP DNS fix, validation, and snapshot cleanup Also fixes onboot:1 on VM 109, LXC 221, LXC 223 and creates a recurring Google Calendar event for the maintenance window. Closes #26 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 23:33:59 -05:00
Cal Corum	9b47f0c027	docs: right-size VM 106 (docker-home) — 16 GB/8 vCPU → 6 GB/4 vCPU (#19 ) All checks were successful Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 5s Details Pre-checks confirmed safe to right-size: no container --memory limits, no Docker Compose memory reservations. Live usage 1.1 GB / 15 GB (7%). - Update 106.conf: memory 16384 → 6144, sockets 2 → 1 (8 → 4 vCPUs) - Add right-sizing-vm-106.md runbook with pre-check results and resize commands Closes #19 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-03 23:05:43 -05:00
cal	fdc44acb28	Merge pull request 'chore: add --hosts test coverage and right-size VM 115 socket config' (#46 ) from chore/26-proxmox-monthly-maintenance-reboot into main	2026-04-04 00:35:31 +00:00
Cal Corum	48a804dda2	feat: right-size VM 115 config and add --hosts flag to audit script All checks were successful Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s Details Reduce VM 115 (docker-sba) from 16 vCPUs (2×8) to 8 vCPUs (1×8) to match actual workload (0.06 load/core). Add --hosts flag to homelab-audit.sh for targeted post-change audits. Closes #18 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 17:33:01 -05:00
Cal Corum	7a0c264f27	feat: add monthly Proxmox maintenance reboot automation (#26 ) Establishes a first-Sunday-of-the-month maintenance window orchestrated by Ansible on LXC 304. Split into two playbooks to handle the self-reboot paradox (the controller is a guest on the host being rebooted): - monthly-reboot.yml: snapshots, tiered shutdown with per-guest polling, fire-and-forget host reboot - post-reboot-startup.yml: controlled tiered startup with staggered delays, Pi-hole UDP DNS fix, validation, and snapshot cleanup Also fixes onboot:1 on VM 109, LXC 221, LXC 223 and creates a recurring Google Calendar event for the maintenance window. Closes #26 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 16:17:55 -05:00
Cal Corum	64f299aa1a	docs: sync KB — maintenance-reboot.md All checks were successful Reindex Knowledge Base / reindex (push) Successful in 2s Details	2026-04-03 16:00:22 -05:00
cal	a9a778f53c	Merge pull request 'feat: dynamic summary, --hosts filter, and --json output (#24 )' (#38 ) from issue/24-homelab-audit-sh-dynamic-summary-and-hosts-filter into main	2026-04-03 20:22:24 +00:00
Cal Corum	1a3785f01a	feat: dynamic summary, --hosts filter, and --json output (#24 ) All checks were successful Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s Details Closes #24 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-03 20:08:07 +00:00
cal	938240e1f9	Merge pull request 'fix: clean up VM 116 watchstate duplicate and document decommission candidacy (#31 )' (#41 ) from issue/31-vm-116-resolve-watchstate-duplicate-and-clean-up-r into main All checks were successful Reindex Knowledge Base / reindex (push) Successful in 1s Details Reviewed-on: #41	2026-04-03 20:01:27 +00:00
Cal Corum	66143f6090	fix: clean up VM 116 watchstate duplicate and document decommission candidacy (#31 ) All checks were successful Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s Details - Removed stopped watchstate container from VM 116 (duplicate of manticore's canonical instance) - Pruned 5 orphan images (watchstate, freetube, pihole, hello-world): 3.36 GB reclaimed - Confirmed manticore watchstate is healthy and syncing Jellyfin state - VM 116 now runs only Jellyfin (also runs on manticore) - Added VM 116 (docker-home-servers) to hosts.yml as decommission candidate - Updated proxmox-7-to-9-upgrade-plan.md status from Stopped/Investigate to Decommission Candidate Closes #31 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-03 20:01:13 +00:00
cal	13483157a9	Merge pull request 'feat: session resumption + Agent SDK evaluation' (#43 ) from feature/3-agent-sdk-improvements into main All checks were successful Reindex Knowledge Base / reindex (push) Successful in 3s Details Reviewed-on: #43	2026-04-03 20:00:12 +00:00
Cal Corum	e321e7bd47	feat: add session resumption and Agent SDK evaluation All checks were successful Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s Details - runner.sh: opt-in session persistence via session_resumable and resume_last_session settings; fix read_setting to normalize booleans - issue-poller.sh: capture and log session_id from worker invocations, include in result JSON - pr-reviewer-dispatcher.sh: capture and log session_id from reviews - n8n workflow: add --append-system-prompt to initial SSH node, add Follow Up Diagnostics node using --resume for deeper investigation, update Discord Alert with remediation details - Add Agent SDK evaluation doc (CLI vs Python/TS SDK comparison) - Update CONTEXT.md with session resumption documentation Closes #3 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-03 19:59:44 +00:00
cal	4e33e1cae3	Merge pull request 'fix: document per-core load threshold policy for health monitoring (#22 )' (#42 ) from issue/22-tune-n8n-alert-thresholds-to-per-core-load-metrics into main All checks were successful Reindex Knowledge Base / reindex (push) Successful in 2s Details	2026-04-03 18:36:14 +00:00
Cal Corum	193ae68f96	docs: document per-core load threshold policy for server health monitoring (#22 ) All checks were successful Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 5s Details Closes #22 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-03 13:35:23 -05:00
Cal Corum	7c9c96eb52	docs: sync KB — troubleshooting.md All checks were successful Reindex Knowledge Base / reindex (push) Successful in 3s Details	2026-04-03 12:00:22 -05:00
cal	a8c85a8d91	Merge pull request 'chore: decommission VM 105 (docker-vpn) — repo cleanup' (#40 ) from chore/20-decommission-vm-105-docker-vpn into main Some checks failed Reindex Knowledge Base / reindex (push) Failing after 17s Details	2026-04-03 12:56:43 +00:00
Cal Corum	9e8346a8ab	chore: decommission VM 105 (docker-vpn) — repo cleanup (#20 ) All checks were successful Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s Details VM 105 was already destroyed on Proxmox. This removes stale references: - Delete server-configs/proxmox/qemu/105.conf - Comment out docker-vpn entries in example SSH config and server inventory - Move VM 105 from Stopped/Investigate to Removed in upgrade plan - Check off decommission task in wave2 migration results Closes #20 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 23:57:55 -05:00
Cal Corum	4234351cfa	feat: add Ansible playbook to mask avahi-daemon on all Ubuntu VMs (#28 ) All checks were successful Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s Details Closes #28 Adds mask-avahi.yml targeting the vms:physical inventory groups (all Ubuntu QEMU VMs + ubuntu-manticore). Also adds avahi masking to the cloud-init template so future VMs are hardened from first boot. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-04-02 23:32:47 -05:00
Cal Corum	a97f443f60	docs: sync KB — vm-decommission-runbook.md All checks were successful Reindex Knowledge Base / reindex (push) Successful in 3s Details	2026-04-02 22:00:04 -05:00