Compare commits

...

38 Commits

Author SHA1 Message Date
Cal Corum
1a83c863cb docs: sync KB — autonomous-nightly-2026-04-10-run2.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 2s
2026-04-10 12:00:47 -05:00
Cal Corum
1ab0bf27ed docs: sync KB — pr-reviewer-ai-reviewing-label-stuck.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 10:38:45 -05:00
Cal Corum
87aeaf3309 docs: sync KB — autonomous-nightly-2026-04-10.md,autonomous-pipeline-session-2026-04-10.md 2026-04-10 10:38:45 -05:00
Cal Corum
8d165efbe6 docs: sync KB — 2026-04-08-home-network-review.md,2026-04-08-home-network-review-design.md 2026-04-10 10:38:45 -05:00
Cal Corum
a307e4dcb7 docs: sync KB — database-release-2026.4.7.md,release-2026.4.7.md 2026-04-10 10:38:45 -05:00
Cal Corum
d3b9e43016 docs: sync KB — backlog-triage-sandbox-fix.md 2026-04-10 10:38:45 -05:00
Cal Corum
92c5ce0ebb docs: sync KB — apcupsd-ups-monitoring.md,llama-cpp-setup.md 2026-04-10 10:38:45 -05:00
Cal Corum
ffb036042c docs: sync KB — claude-code-multi-account.md 2026-04-10 10:38:45 -05:00
cal
d34bc01305 Merge pull request 'feat: right-size VM 115 (docker-sba) 16→8 vCPUs' (#44) from enhancement/18-rightsize-vm115-vcpus into main
Reviewed-on: #44
Reviewed-by: Claude <cal.corum+openclaw@gmail.com>
2026-04-06 15:41:34 +00:00
Cal Corum
01e6302709 feat: right-size VM 115 config and add --hosts flag to audit script
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 1s
Reduce VM 115 (docker-sba) from 16 vCPUs (2×8) to 8 vCPUs (1×8) to
match actual workload (0.06 load/core). Add --hosts flag to
homelab-audit.sh for targeted post-change audits.

Closes #18

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 15:41:16 +00:00
cal
024aea82c4 Merge pull request 'feat: add monthly Docker prune cron Ansible playbook (#29)' (#45) from issue/29-docker-image-prune-cron-on-all-docker-hosts into main
Reviewed-on: #45
2026-04-06 15:41:04 +00:00
Cal Corum
d4ee899c1d feat: add monthly Docker prune cron Ansible playbook (#29)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 1s
Closes #29

Deploys /etc/cron.monthly/docker-prune to all six Docker hosts via
Ansible. Uses 720h (30-day) age filter on containers and images, with
volume pruning exempt for `keep`-labeled volumes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 15:40:33 +00:00
cal
d7987a90ff Merge pull request 'docs: right-size VM 106 (docker-home) — 16 GB/8 vCPU → 6 GB/4 vCPU (#19)' (#47) from issue/19-right-size-vm-106-docker-home-16-gb-6-8-gb-ram into main
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 2s
Reviewed-on: #47
2026-04-06 15:40:20 +00:00
cal
5b23d92435 Merge branch 'main' into issue/19-right-size-vm-106-docker-home-16-gb-6-8-gb-ram
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 3s
2026-04-06 15:40:07 +00:00
cal
29238f3ddf Merge pull request 'feat: weekly Proxmox backup verification → Discord (#27)' (#48) from issue/27-set-up-weekly-proxmox-backup-verification-discord into main
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
Reviewed-on: #48
2026-04-06 15:39:53 +00:00
Cal Corum
dd7c68c13a docs: sync KB — discord-browser-testing-workflow.md 2026-04-06 02:00:38 -05:00
Cal Corum
acb8fef084 docs: sync KB — database-deployment-guide.md,refractor-in-app-test-plan.md 2026-04-06 00:00:03 -05:00
Cal Corum
cacf4a9043 feat: add weekly Gitea disk cleanup Ansible playbook
Gitea LXC 225 hit 100% disk from accumulated Docker buildx volumes,
repo-archive cache, and journal logs. Adds automated weekly cleanup
managed by systemd timer on the Ansible controller (Wed 04:00 UTC).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 19:24:59 -05:00
Cal Corum
95bae33309 feat: add weekly Proxmox backup verification and CT 302 self-health check (#27)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
Closes #27

- proxmox-backup-check.sh: SSHes to Proxmox, queries pvesh task history,
  classifies each running VM/CT as green/yellow/red by backup recency,
  posts a Discord embed summary. Designed for weekly cron on CT 302.

- ct302-self-health.sh: Checks disk usage on CT 302 itself, silently
  exits when healthy, posts a Discord alert when any filesystem exceeds
  80% threshold. Closes the blind spot where the monitoring system
  cannot monitor itself externally.

- Updated monitoring/scripts/CONTEXT.md with full operational docs,
  install instructions, and cron schedules for both new scripts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 06:07:57 -05:00
Cal Corum
29a20fbe06 feat: add monthly Proxmox maintenance reboot automation (#26)
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 2s
Establishes a first-Sunday-of-the-month maintenance window orchestrated
by Ansible on LXC 304. Split into two playbooks to handle the self-reboot
paradox (the controller is a guest on the host being rebooted):

- monthly-reboot.yml: snapshots, tiered shutdown with per-guest polling,
  fire-and-forget host reboot
- post-reboot-startup.yml: controlled tiered startup with staggered delays,
  Pi-hole UDP DNS fix, validation, and snapshot cleanup

Also fixes onboot:1 on VM 109, LXC 221, LXC 223 and creates a recurring
Google Calendar event for the maintenance window.

Closes #26

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 23:33:59 -05:00
Cal Corum
9b47f0c027 docs: right-size VM 106 (docker-home) — 16 GB/8 vCPU → 6 GB/4 vCPU (#19)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 5s
Pre-checks confirmed safe to right-size: no container --memory limits,
no Docker Compose memory reservations. Live usage 1.1 GB / 15 GB (7%).

- Update 106.conf: memory 16384 → 6144, sockets 2 → 1 (8 → 4 vCPUs)
- Add right-sizing-vm-106.md runbook with pre-check results and resize commands

Closes #19

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 23:05:43 -05:00
cal
fdc44acb28 Merge pull request 'chore: add --hosts test coverage and right-size VM 115 socket config' (#46) from chore/26-proxmox-monthly-maintenance-reboot into main 2026-04-04 00:35:31 +00:00
Cal Corum
48a804dda2 feat: right-size VM 115 config and add --hosts flag to audit script
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
Reduce VM 115 (docker-sba) from 16 vCPUs (2×8) to 8 vCPUs (1×8) to
match actual workload (0.06 load/core). Add --hosts flag to
homelab-audit.sh for targeted post-change audits.

Closes #18

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 17:33:01 -05:00
Cal Corum
7a0c264f27 feat: add monthly Proxmox maintenance reboot automation (#26)
Establishes a first-Sunday-of-the-month maintenance window orchestrated
by Ansible on LXC 304. Split into two playbooks to handle the self-reboot
paradox (the controller is a guest on the host being rebooted):

- monthly-reboot.yml: snapshots, tiered shutdown with per-guest polling,
  fire-and-forget host reboot
- post-reboot-startup.yml: controlled tiered startup with staggered delays,
  Pi-hole UDP DNS fix, validation, and snapshot cleanup

Also fixes onboot:1 on VM 109, LXC 221, LXC 223 and creates a recurring
Google Calendar event for the maintenance window.

Closes #26

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 16:17:55 -05:00
Cal Corum
64f299aa1a docs: sync KB — maintenance-reboot.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 2s
2026-04-03 16:00:22 -05:00
cal
a9a778f53c Merge pull request 'feat: dynamic summary, --hosts filter, and --json output (#24)' (#38) from issue/24-homelab-audit-sh-dynamic-summary-and-hosts-filter into main 2026-04-03 20:22:24 +00:00
Cal Corum
1a3785f01a feat: dynamic summary, --hosts filter, and --json output (#24)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
Closes #24

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 20:08:07 +00:00
cal
938240e1f9 Merge pull request 'fix: clean up VM 116 watchstate duplicate and document decommission candidacy (#31)' (#41) from issue/31-vm-116-resolve-watchstate-duplicate-and-clean-up-r into main
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 1s
Reviewed-on: #41
2026-04-03 20:01:27 +00:00
Cal Corum
66143f6090 fix: clean up VM 116 watchstate duplicate and document decommission candidacy (#31)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
- Removed stopped watchstate container from VM 116 (duplicate of manticore's canonical instance)
- Pruned 5 orphan images (watchstate, freetube, pihole, hello-world): 3.36 GB reclaimed
- Confirmed manticore watchstate is healthy and syncing Jellyfin state
- VM 116 now runs only Jellyfin (also runs on manticore)
- Added VM 116 (docker-home-servers) to hosts.yml as decommission candidate
- Updated proxmox-7-to-9-upgrade-plan.md status from Stopped/Investigate to Decommission Candidate

Closes #31

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 20:01:13 +00:00
cal
13483157a9 Merge pull request 'feat: session resumption + Agent SDK evaluation' (#43) from feature/3-agent-sdk-improvements into main
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
Reviewed-on: #43
2026-04-03 20:00:12 +00:00
Cal Corum
e321e7bd47 feat: add session resumption and Agent SDK evaluation
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
- runner.sh: opt-in session persistence via session_resumable and
  resume_last_session settings; fix read_setting to normalize booleans
- issue-poller.sh: capture and log session_id from worker invocations,
  include in result JSON
- pr-reviewer-dispatcher.sh: capture and log session_id from reviews
- n8n workflow: add --append-system-prompt to initial SSH node, add
  Follow Up Diagnostics node using --resume for deeper investigation,
  update Discord Alert with remediation details
- Add Agent SDK evaluation doc (CLI vs Python/TS SDK comparison)
- Update CONTEXT.md with session resumption documentation

Closes #3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 19:59:44 +00:00
cal
4e33e1cae3 Merge pull request 'fix: document per-core load threshold policy for health monitoring (#22)' (#42) from issue/22-tune-n8n-alert-thresholds-to-per-core-load-metrics into main
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 2s
2026-04-03 18:36:14 +00:00
Cal Corum
193ae68f96 docs: document per-core load threshold policy for server health monitoring (#22)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 5s
Closes #22

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 13:35:23 -05:00
Cal Corum
7c9c96eb52 docs: sync KB — troubleshooting.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
2026-04-03 12:00:22 -05:00
cal
a8c85a8d91 Merge pull request 'chore: decommission VM 105 (docker-vpn) — repo cleanup' (#40) from chore/20-decommission-vm-105-docker-vpn into main
Some checks failed
Reindex Knowledge Base / reindex (push) Failing after 17s
2026-04-03 12:56:43 +00:00
Cal Corum
9e8346a8ab chore: decommission VM 105 (docker-vpn) — repo cleanup (#20)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
VM 105 was already destroyed on Proxmox. This removes stale references:
- Delete server-configs/proxmox/qemu/105.conf
- Comment out docker-vpn entries in example SSH config and server inventory
- Move VM 105 from Stopped/Investigate to Removed in upgrade plan
- Check off decommission task in wave2 migration results

Closes #20

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 23:57:55 -05:00
Cal Corum
4234351cfa feat: add Ansible playbook to mask avahi-daemon on all Ubuntu VMs (#28)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
Closes #28

Adds mask-avahi.yml targeting the vms:physical inventory groups (all
Ubuntu QEMU VMs + ubuntu-manticore). Also adds avahi masking to the
cloud-init template so future VMs are hardened from first boot.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 23:32:47 -05:00
Cal Corum
a97f443f60 docs: sync KB — vm-decommission-runbook.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
2026-04-02 22:00:04 -05:00
45 changed files with 4910 additions and 56 deletions

View File

@ -0,0 +1,55 @@
---
# Monthly Docker Prune — Deploy Cleanup Cron to All Docker Hosts
#
# Deploys /etc/cron.monthly/docker-prune to each VM running Docker.
# The script prunes stopped containers, unused images, and orphaned volumes
# older than 30 days (720h). Volumes labeled `keep` are exempt.
#
# Resolves accumulated disk waste from stopped containers and stale images.
# The `--filter "until=720h"` age gate prevents removing recently-pulled
# images that haven't started yet. `docker image prune -a` only removes
# images not referenced by any container (running or stopped), so the
# age filter adds an extra safety margin.
#
# Hosts: VM 106 (docker-home), VM 110 (discord-bots), VM 112 (databases-bots),
# VM 115 (docker-sba), VM 116 (docker-home-servers), manticore
#
# Controller: LXC 304 (ansible-controller) at 10.10.0.232
#
# Usage:
# # Dry run (shows what would change, skips writes)
# ansible-playbook /opt/ansible/playbooks/docker-prune.yml --check
#
# # Single host
# ansible-playbook /opt/ansible/playbooks/docker-prune.yml --limit docker-sba
#
# # All Docker hosts
# ansible-playbook /opt/ansible/playbooks/docker-prune.yml
#
# To undo: rm /etc/cron.monthly/docker-prune on target hosts
- name: Deploy Docker monthly prune cron to all Docker hosts
hosts: docker-home:discord-bots:databases-bots:docker-sba:docker-home-servers:manticore
become: true
tasks:
- name: Deploy docker-prune cron script
ansible.builtin.copy:
dest: /etc/cron.monthly/docker-prune
owner: root
group: root
mode: "0755"
content: |
#!/bin/bash
# Monthly Docker cleanup — deployed by Ansible (issue #29)
# Prunes stopped containers, unused images (>30 days), and orphaned volumes.
# Volumes labeled `keep` are exempt from volume pruning.
set -euo pipefail
docker container prune -f --filter "until=720h"
docker image prune -a -f --filter "until=720h"
docker volume prune -f --filter "label!=keep"
- name: Verify docker-prune script is executable
ansible.builtin.command: test -x /etc/cron.monthly/docker-prune
changed_when: false

View File

@ -0,0 +1,80 @@
---
# gitea-cleanup.yml — Weekly cleanup of Gitea server disk space
#
# Removes stale Docker buildx volumes, unused images, Gitea repo-archive
# cache, and vacuums journal logs to prevent disk exhaustion on LXC 225.
#
# Schedule: Weekly via systemd timer on LXC 304 (ansible-controller)
#
# Usage:
# ansible-playbook /opt/ansible/playbooks/gitea-cleanup.yml # full run
# ansible-playbook /opt/ansible/playbooks/gitea-cleanup.yml --check # dry run
- name: Gitea server disk cleanup
hosts: gitea
gather_facts: false
tasks:
- name: Check current disk usage
ansible.builtin.shell: df --output=pcent / | tail -1
register: disk_before
changed_when: false
- name: Display current disk usage
ansible.builtin.debug:
msg: "Disk usage before cleanup: {{ disk_before.stdout | trim }}"
- name: Clear Gitea repo-archive cache
ansible.builtin.find:
paths: /var/lib/gitea/data/repo-archive
file_type: any
register: repo_archive_files
- name: Remove repo-archive files
ansible.builtin.file:
path: "{{ item.path }}"
state: absent
loop: "{{ repo_archive_files.files }}"
loop_control:
label: "{{ item.path | basename }}"
when: repo_archive_files.files | length > 0
- name: Remove orphaned Docker buildx volumes
ansible.builtin.shell: |
volumes=$(docker volume ls -q --filter name=buildx_buildkit)
if [ -n "$volumes" ]; then
echo "$volumes" | xargs docker volume rm 2>&1
else
echo "No buildx volumes to remove"
fi
register: buildx_cleanup
changed_when: "'No buildx volumes' not in buildx_cleanup.stdout"
- name: Prune unused Docker images
ansible.builtin.command: docker image prune -af
register: image_prune
changed_when: "'Total reclaimed space: 0B' not in image_prune.stdout"
- name: Prune unused Docker volumes
ansible.builtin.command: docker volume prune -f
register: volume_prune
changed_when: "'Total reclaimed space: 0B' not in volume_prune.stdout"
- name: Vacuum journal logs to 500M
ansible.builtin.command: journalctl --vacuum-size=500M
register: journal_vacuum
changed_when: "'freed 0B' not in journal_vacuum.stderr"
- name: Check disk usage after cleanup
ansible.builtin.shell: df --output=pcent / | tail -1
register: disk_after
changed_when: false
- name: Display cleanup summary
ansible.builtin.debug:
msg: >-
Cleanup complete.
Disk: {{ disk_before.stdout | default('N/A') | trim }} → {{ disk_after.stdout | default('N/A') | trim }}.
Buildx: {{ (buildx_cleanup.stdout_lines | default(['N/A'])) | last }}.
Images: {{ (image_prune.stdout_lines | default(['N/A'])) | last }}.
Journal: {{ (journal_vacuum.stderr_lines | default(['N/A'])) | last }}.

View File

@ -0,0 +1,43 @@
---
# Mask avahi-daemon on all Ubuntu hosts
#
# Avahi (mDNS/Bonjour) is not needed in a static-IP homelab with Pi-hole DNS.
# A kernel busy-loop bug in avahi-daemon was found consuming ~1.7 CPU cores
# across 5 VMs. Masking prevents it from ever starting again, surviving reboots.
#
# Targets: vms + physical (all Ubuntu QEMU VMs and ubuntu-manticore)
# Controller: ansible-controller (LXC 304 at 10.10.0.232)
#
# Usage:
# # Dry run
# ansible-playbook /opt/ansible/playbooks/mask-avahi.yml --check
#
# # Test on a single host first
# ansible-playbook /opt/ansible/playbooks/mask-avahi.yml --limit discord-bots
#
# # Roll out to all Ubuntu hosts
# ansible-playbook /opt/ansible/playbooks/mask-avahi.yml
#
# To undo: systemctl unmask avahi-daemon
- name: Mask avahi-daemon on all Ubuntu hosts
hosts: vms:physical
become: true
tasks:
- name: Stop avahi-daemon
ansible.builtin.systemd:
name: avahi-daemon
state: stopped
ignore_errors: true
- name: Mask avahi-daemon
ansible.builtin.systemd:
name: avahi-daemon
masked: true
- name: Verify avahi is masked
ansible.builtin.command: systemctl is-enabled avahi-daemon
register: avahi_status
changed_when: false
failed_when: avahi_status.stdout | trim != 'masked'

View File

@ -0,0 +1,265 @@
---
# Monthly Proxmox Maintenance Reboot — Shutdown & Reboot
#
# Orchestrates a graceful shutdown of all guests in dependency order,
# then issues a fire-and-forget reboot to the Proxmox host.
#
# After the host reboots, LXC 304 auto-starts via onboot:1 and the
# post-reboot-startup.yml playbook runs automatically via the
# ansible-post-reboot.service systemd unit (triggered by @reboot).
#
# Schedule: 1st Sunday of each month, 08:00 UTC (3 AM ET)
# Controller: LXC 304 (ansible-controller) at 10.10.0.232
#
# Usage:
# # Dry run
# ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --check
#
# # Full execution
# ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml
#
# # Shutdown only (skip the host reboot)
# ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --tags shutdown
#
# Note: VM 109 (homeassistant) is excluded from Ansible inventory
# (self-managed via HA Supervisor) but is included in pvesh start/stop.
- name: Pre-reboot health check and snapshots
hosts: pve-node
gather_facts: false
tags: [pre-reboot, shutdown]
tasks:
- name: Check Proxmox cluster health
ansible.builtin.command: pvesh get /cluster/status --output-format json
register: cluster_status
changed_when: false
- name: Get list of running QEMU VMs
ansible.builtin.shell: >
pvesh get /nodes/proxmox/qemu --output-format json |
python3 -c "import sys,json; [print(vm['vmid']) for vm in json.load(sys.stdin) if vm.get('status')=='running']"
register: running_vms
changed_when: false
- name: Get list of running LXC containers
ansible.builtin.shell: >
pvesh get /nodes/proxmox/lxc --output-format json |
python3 -c "import sys,json; [print(ct['vmid']) for ct in json.load(sys.stdin) if ct.get('status')=='running']"
register: running_lxcs
changed_when: false
- name: Display running guests
ansible.builtin.debug:
msg: "Running VMs: {{ running_vms.stdout_lines }} | Running LXCs: {{ running_lxcs.stdout_lines }}"
- name: Snapshot running VMs
ansible.builtin.command: >
pvesh create /nodes/proxmox/qemu/{{ item }}/snapshot
--snapname pre-maintenance-{{ lookup('pipe', 'date +%Y-%m-%d') }}
--description "Auto snapshot before monthly maintenance reboot"
loop: "{{ running_vms.stdout_lines }}"
when: running_vms.stdout_lines | length > 0
ignore_errors: true
- name: Snapshot running LXCs
ansible.builtin.command: >
pvesh create /nodes/proxmox/lxc/{{ item }}/snapshot
--snapname pre-maintenance-{{ lookup('pipe', 'date +%Y-%m-%d') }}
--description "Auto snapshot before monthly maintenance reboot"
loop: "{{ running_lxcs.stdout_lines }}"
when: running_lxcs.stdout_lines | length > 0
ignore_errors: true
- name: "Shutdown Tier 4 — Media & Others"
hosts: pve-node
gather_facts: false
tags: [shutdown]
vars:
tier4_vms: [109]
# LXC 303 (mcp-gateway) is onboot=0 and operator-managed — not included here
tier4_lxcs: [221, 222, 223, 302]
tasks:
- name: Shutdown Tier 4 VMs
ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/shutdown
loop: "{{ tier4_vms }}"
ignore_errors: true
- name: Shutdown Tier 4 LXCs
ansible.builtin.command: pvesh create /nodes/proxmox/lxc/{{ item }}/status/shutdown
loop: "{{ tier4_lxcs }}"
ignore_errors: true
- name: Wait for Tier 4 VMs to stop
ansible.builtin.shell: >
pvesh get /nodes/proxmox/qemu/{{ item }}/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
register: t4_vm_status
until: t4_vm_status.stdout.strip() == "stopped"
retries: 12
delay: 5
loop: "{{ tier4_vms }}"
ignore_errors: true
- name: Wait for Tier 4 LXCs to stop
ansible.builtin.shell: >
pvesh get /nodes/proxmox/lxc/{{ item }}/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
register: t4_lxc_status
until: t4_lxc_status.stdout.strip() == "stopped"
retries: 12
delay: 5
loop: "{{ tier4_lxcs }}"
ignore_errors: true
- name: "Shutdown Tier 3 — Applications"
hosts: pve-node
gather_facts: false
tags: [shutdown]
vars:
tier3_vms: [115, 110]
tier3_lxcs: [301]
tasks:
- name: Shutdown Tier 3 VMs
ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/shutdown
loop: "{{ tier3_vms }}"
ignore_errors: true
- name: Shutdown Tier 3 LXCs
ansible.builtin.command: pvesh create /nodes/proxmox/lxc/{{ item }}/status/shutdown
loop: "{{ tier3_lxcs }}"
ignore_errors: true
- name: Wait for Tier 3 VMs to stop
ansible.builtin.shell: >
pvesh get /nodes/proxmox/qemu/{{ item }}/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
register: t3_vm_status
until: t3_vm_status.stdout.strip() == "stopped"
retries: 12
delay: 5
loop: "{{ tier3_vms }}"
ignore_errors: true
- name: Wait for Tier 3 LXCs to stop
ansible.builtin.shell: >
pvesh get /nodes/proxmox/lxc/{{ item }}/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
register: t3_lxc_status
until: t3_lxc_status.stdout.strip() == "stopped"
retries: 12
delay: 5
loop: "{{ tier3_lxcs }}"
ignore_errors: true
- name: "Shutdown Tier 2 — Infrastructure"
hosts: pve-node
gather_facts: false
tags: [shutdown]
vars:
tier2_vms: [106, 116]
tier2_lxcs: [225, 210, 227]
tasks:
- name: Shutdown Tier 2 VMs
ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/shutdown
loop: "{{ tier2_vms }}"
ignore_errors: true
- name: Shutdown Tier 2 LXCs
ansible.builtin.command: pvesh create /nodes/proxmox/lxc/{{ item }}/status/shutdown
loop: "{{ tier2_lxcs }}"
ignore_errors: true
- name: Wait for Tier 2 VMs to stop
ansible.builtin.shell: >
pvesh get /nodes/proxmox/qemu/{{ item }}/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
register: t2_vm_status
until: t2_vm_status.stdout.strip() == "stopped"
retries: 12
delay: 5
loop: "{{ tier2_vms }}"
ignore_errors: true
- name: Wait for Tier 2 LXCs to stop
ansible.builtin.shell: >
pvesh get /nodes/proxmox/lxc/{{ item }}/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
register: t2_lxc_status
until: t2_lxc_status.stdout.strip() == "stopped"
retries: 12
delay: 5
loop: "{{ tier2_lxcs }}"
ignore_errors: true
- name: "Shutdown Tier 1 — Databases"
hosts: pve-node
gather_facts: false
tags: [shutdown]
vars:
tier1_vms: [112]
tasks:
- name: Shutdown database VMs
ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/shutdown
loop: "{{ tier1_vms }}"
ignore_errors: true
- name: Wait for database VMs to stop (up to 90s)
ansible.builtin.shell: >
pvesh get /nodes/proxmox/qemu/{{ item }}/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
register: t1_vm_status
until: t1_vm_status.stdout.strip() == "stopped"
retries: 18
delay: 5
loop: "{{ tier1_vms }}"
ignore_errors: true
- name: Force stop database VMs if still running
ansible.builtin.shell: >
status=$(pvesh get /nodes/proxmox/qemu/{{ item }}/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))");
if [ "$status" = "running" ]; then
pvesh create /nodes/proxmox/qemu/{{ item }}/status/stop;
echo "Force stopped VM {{ item }}";
else
echo "VM {{ item }} already stopped";
fi
loop: "{{ tier1_vms }}"
register: force_stop_result
changed_when: force_stop_result.results | default([]) | selectattr('stdout', 'defined') | selectattr('stdout', 'search', 'Force stopped') | list | length > 0
- name: "Verify and reboot Proxmox host"
hosts: pve-node
gather_facts: false
tags: [reboot]
tasks:
- name: Verify all guests are stopped (excluding LXC 304)
ansible.builtin.shell: >
running_vms=$(pvesh get /nodes/proxmox/qemu --output-format json |
python3 -c "import sys,json; vms=[v for v in json.load(sys.stdin) if v.get('status')=='running']; print(len(vms))");
running_lxcs=$(pvesh get /nodes/proxmox/lxc --output-format json |
python3 -c "import sys,json; cts=[c for c in json.load(sys.stdin) if c.get('status')=='running' and c['vmid'] != 304]; print(len(cts))");
echo "Running VMs: $running_vms, Running LXCs: $running_lxcs";
if [ "$running_vms" != "0" ] || [ "$running_lxcs" != "0" ]; then exit 1; fi
register: verify_stopped
- name: Issue fire-and-forget reboot (controller will be killed)
ansible.builtin.shell: >
nohup bash -c 'sleep 10 && reboot' &>/dev/null &
echo "Reboot scheduled in 10 seconds"
register: reboot_issued
when: not ansible_check_mode
- name: Log reboot issued
ansible.builtin.debug:
msg: "{{ reboot_issued.stdout }} — Ansible process will terminate when host reboots. Post-reboot startup handled by ansible-post-reboot.service on LXC 304."

View File

@ -0,0 +1,214 @@
---
# Post-Reboot Startup — Controlled Guest Startup After Proxmox Reboot
#
# Starts all guests in dependency order with staggered delays to avoid
# I/O storms. Runs automatically via ansible-post-reboot.service on
# LXC 304 after the Proxmox host reboots.
#
# Can also be run manually:
# ansible-playbook /opt/ansible/playbooks/post-reboot-startup.yml
#
# Note: VM 109 (homeassistant) is excluded from Ansible inventory
# (self-managed via HA Supervisor) but is included in pvesh start/stop.
- name: Wait for Proxmox API to be ready
hosts: pve-node
gather_facts: false
tags: [startup]
tasks:
- name: Wait for Proxmox API
ansible.builtin.command: pvesh get /version --output-format json
register: pve_version
until: pve_version.rc == 0
retries: 30
delay: 10
changed_when: false
- name: Display Proxmox version
ansible.builtin.debug:
msg: "Proxmox API ready: {{ pve_version.stdout | from_json | json_query('version') | default('unknown') }}"
- name: "Startup Tier 1 — Databases"
hosts: pve-node
gather_facts: false
tags: [startup]
tasks:
- name: Start database VM (112)
ansible.builtin.command: pvesh create /nodes/proxmox/qemu/112/status/start
ignore_errors: true
- name: Wait for VM 112 to be running
ansible.builtin.shell: >
pvesh get /nodes/proxmox/qemu/112/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
register: db_status
until: db_status.stdout.strip() == "running"
retries: 12
delay: 5
changed_when: false
- name: Wait for database services to initialize
ansible.builtin.pause:
seconds: 30
- name: "Startup Tier 2 — Infrastructure"
hosts: pve-node
gather_facts: false
tags: [startup]
vars:
tier2_vms: [106, 116]
tier2_lxcs: [225, 210, 227]
tasks:
- name: Start Tier 2 VMs
ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/start
loop: "{{ tier2_vms }}"
ignore_errors: true
- name: Start Tier 2 LXCs
ansible.builtin.command: pvesh create /nodes/proxmox/lxc/{{ item }}/status/start
loop: "{{ tier2_lxcs }}"
ignore_errors: true
- name: Wait for infrastructure to come up
ansible.builtin.pause:
seconds: 30
- name: "Startup Tier 3 — Applications"
hosts: pve-node
gather_facts: false
tags: [startup]
vars:
tier3_vms: [115, 110]
tier3_lxcs: [301]
tasks:
- name: Start Tier 3 VMs
ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/start
loop: "{{ tier3_vms }}"
ignore_errors: true
- name: Start Tier 3 LXCs
ansible.builtin.command: pvesh create /nodes/proxmox/lxc/{{ item }}/status/start
loop: "{{ tier3_lxcs }}"
ignore_errors: true
- name: Wait for applications to start
ansible.builtin.pause:
seconds: 30
- name: Restart Pi-hole container via SSH (UDP DNS fix)
ansible.builtin.command: ssh docker-home "docker restart pihole"
ignore_errors: true
- name: Wait for Pi-hole to stabilize
ansible.builtin.pause:
seconds: 10
- name: "Startup Tier 4 — Media & Others"
hosts: pve-node
gather_facts: false
tags: [startup]
vars:
tier4_vms: [109]
tier4_lxcs: [221, 222, 223, 302]
tasks:
- name: Start Tier 4 VMs
ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/start
loop: "{{ tier4_vms }}"
ignore_errors: true
- name: Start Tier 4 LXCs
ansible.builtin.command: pvesh create /nodes/proxmox/lxc/{{ item }}/status/start
loop: "{{ tier4_lxcs }}"
ignore_errors: true
- name: Post-reboot validation
hosts: pve-node
gather_facts: false
tags: [startup, validate]
tasks:
- name: Wait for all services to initialize
ansible.builtin.pause:
seconds: 60
- name: Check all expected VMs are running
ansible.builtin.shell: >
pvesh get /nodes/proxmox/qemu --output-format json |
python3 -c "
import sys, json
vms = json.load(sys.stdin)
expected = {106, 109, 110, 112, 115, 116}
running = {v['vmid'] for v in vms if v.get('status') == 'running'}
missing = expected - running
if missing:
print(f'WARN: VMs not running: {missing}')
sys.exit(1)
print(f'All expected VMs running: {running & expected}')
"
register: vm_check
ignore_errors: true
- name: Check all expected LXCs are running
ansible.builtin.shell: >
pvesh get /nodes/proxmox/lxc --output-format json |
python3 -c "
import sys, json
cts = json.load(sys.stdin)
# LXC 303 (mcp-gateway) intentionally excluded — onboot=0, operator-managed
expected = {210, 221, 222, 223, 225, 227, 301, 302, 304}
running = {c['vmid'] for c in cts if c.get('status') == 'running'}
missing = expected - running
if missing:
print(f'WARN: LXCs not running: {missing}')
sys.exit(1)
print(f'All expected LXCs running: {running & expected}')
"
register: lxc_check
ignore_errors: true
- name: Clean up old maintenance snapshots (older than 7 days)
ansible.builtin.shell: >
cutoff=$(date -d '7 days ago' +%s);
for vmid in $(pvesh get /nodes/proxmox/qemu --output-format json |
python3 -c "import sys,json; [print(v['vmid']) for v in json.load(sys.stdin)]"); do
for snap in $(pvesh get /nodes/proxmox/qemu/$vmid/snapshot --output-format json |
python3 -c "import sys,json; [print(s['name']) for s in json.load(sys.stdin) if s['name'].startswith('pre-maintenance-')]" 2>/dev/null); do
snap_date=$(echo $snap | sed 's/pre-maintenance-//');
snap_epoch=$(date -d "$snap_date" +%s 2>/dev/null);
if [ -z "$snap_epoch" ]; then
echo "WARN: could not parse date for snapshot $snap on VM $vmid";
elif [ "$snap_epoch" -lt "$cutoff" ]; then
pvesh delete /nodes/proxmox/qemu/$vmid/snapshot/$snap && echo "Deleted $snap from VM $vmid";
fi
done
done;
for ctid in $(pvesh get /nodes/proxmox/lxc --output-format json |
python3 -c "import sys,json; [print(c['vmid']) for c in json.load(sys.stdin)]"); do
for snap in $(pvesh get /nodes/proxmox/lxc/$ctid/snapshot --output-format json |
python3 -c "import sys,json; [print(s['name']) for s in json.load(sys.stdin) if s['name'].startswith('pre-maintenance-')]" 2>/dev/null); do
snap_date=$(echo $snap | sed 's/pre-maintenance-//');
snap_epoch=$(date -d "$snap_date" +%s 2>/dev/null);
if [ -z "$snap_epoch" ]; then
echo "WARN: could not parse date for snapshot $snap on LXC $ctid";
elif [ "$snap_epoch" -lt "$cutoff" ]; then
pvesh delete /nodes/proxmox/lxc/$ctid/snapshot/$snap && echo "Deleted $snap from LXC $ctid";
fi
done
done;
echo "Snapshot cleanup complete"
ignore_errors: true
- name: Display validation results
ansible.builtin.debug:
msg:
- "VM status: {{ vm_check.stdout }}"
- "LXC status: {{ lxc_check.stdout }}"
- "Maintenance reboot complete — post-reboot startup finished"

View File

@ -0,0 +1,15 @@
[Unit]
Description=Monthly Proxmox maintenance reboot (Ansible)
After=network-online.target
Wants=network-online.target
[Service]
Type=oneshot
User=cal
WorkingDirectory=/opt/ansible
ExecStart=/usr/bin/ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml
StandardOutput=append:/opt/ansible/logs/monthly-reboot.log
StandardError=append:/opt/ansible/logs/monthly-reboot.log
TimeoutStartSec=900
# No [Install] section — this service is activated exclusively by ansible-monthly-reboot.timer

View File

@ -0,0 +1,13 @@
[Unit]
Description=Monthly Proxmox maintenance reboot timer
Documentation=https://git.manticorum.com/cal/claude-home/src/branch/main/server-configs/proxmox/maintenance-reboot.md
[Timer]
# First Sunday of the month at 08:00 UTC (3:00 AM ET during EDT)
# Day range 01-07 ensures it's always the first occurrence of that weekday
OnCalendar=Sun *-*-01..07 08:00:00
Persistent=true
RandomizedDelaySec=600
[Install]
WantedBy=timers.target

View File

@ -0,0 +1,21 @@
[Unit]
Description=Post-reboot controlled guest startup (Ansible)
After=network-online.target
Wants=network-online.target
# Only run after a fresh boot — not on service restart
ConditionUpTimeSec=600
[Service]
Type=oneshot
User=cal
WorkingDirectory=/opt/ansible
# Delay 120s to let Proxmox API stabilize and onboot guests settle
ExecStartPre=/bin/sleep 120
ExecStart=/usr/bin/ansible-playbook /opt/ansible/playbooks/post-reboot-startup.yml
StandardOutput=append:/opt/ansible/logs/post-reboot-startup.log
StandardError=append:/opt/ansible/logs/post-reboot-startup.log
TimeoutStartSec=1800
[Install]
# Runs automatically on every boot of LXC 304
WantedBy=multi-user.target

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,297 @@
# Home Network Review — Design Spec
**Date:** 2026-04-08
**Approach:** Hybrid Layer-by-Layer (discover-then-fix per layer, bottom-up)
**Execution model:** Sub-agent driven — parallel agents within each layer's discovery/analysis phases, sequential remediation
## Context
### Current Infrastructure
- **Router/Gateway:** UniFi UDM Pro
- **Switch:** US-24-PoE (250W)
- **Access Points:** 3x UAP-AC-Lite (Office, First Floor, Upper Floor)
- **Hypervisor:** Proxmox at `10.10.0.10`
- **Physical server:** ubuntu-manticore (`10.10.0.226`) — Pi-hole, Jellyfin, Tdarr, KB RAG stack
- **VM 115:** docker-sba (`10.10.0.88`) — Paper Dynasty, SBA services
- **NAS:** TrueNAS at `10.10.0.35`
- **Reverse proxy:** Nginx Proxy Manager — external access via `*.manticorum.com`
- **DNS:** Dual Pi-hole HA — primary `10.10.0.16` (npm-pihole LXC), secondary `10.10.0.226` (manticore), synced via Orbital Sync + NPM DNS sync cron
### Current Network Topology
| Network | Subnet | Purpose |
|---------|--------|---------|
| Home | `10.0.0.0/23` | Personal devices |
| Lab | `10.10.0.0/24` | Homelab infrastructure |
### Known Issues & Goals (Priority Order)
1. **Performance (C):** Roku on Upper Floor AP has 6 Mbps Rx rate despite -44 dBm signal. 1x1 MIMO, AP/Client Signal Balance: Poor. Likely AP TX power asymmetry with weak client radio.
2. **Cleanup (D):** Handful of custom firewall rules, need sanity check. Internal `.homelab.local` domain may not be functional — `.local` conflicts with mDNS (RFC 6762).
3. **Security (A):** Many services exposed via `*.manticorum.com` through NPM. Need WAN exposure audit.
4. **Reliability (B):** Validate Pi-hole HA failover, identify single points of failure.
5. **Expansion (E):** Add guest WiFi, expand Tailscale to full mesh, build smart home foundation.
### Additional Requirements
- **Guest WiFi:** New VLAN, isolated, internet-only
- **Tailscale:** Currently on phones with exit nodes on both networks. Goal: universal reachability — all devices can reach each other whether on home/lab network, cellular, or cloud
- **Smart Home:** Home Assistant antenna installed, not migrated. Previous Matter/HomeKit attempts failed. Want solid network foundation (IoT VLAN, mDNS) before going deeper
- **IoT VLAN:** Default-deny internet access. Per-device exceptions if needed.
## Design
### Agent Assignments
| Layer | Lead Agent(s) | Support |
|-------|---------------|---------|
| 1. WiFi & Physical | `network-engineer` | |
| 2. Network Architecture | `network-engineer` | `it-ops-orchestrator` |
| 3. DNS | `network-engineer` | |
| 4. Firewall & Security | `security-engineer`, `security-auditor` | |
| 5. Overlay & Remote Access | `network-engineer` | |
| 6. Smart Home Foundation | `iot-engineer` | `network-engineer` |
| Final Pass | `security-auditor` | `pentester` |
### Per-Layer Workflow
Each layer follows the same three-phase cycle:
1. **Discover** — export configs, scan current state, document baseline (parallel sub-agents)
2. **Analyze** — review findings, identify issues, produce recommendations (parallel sub-agents)
3. **Remediate** — implement changes, validate, document new state (sequential)
---
### Layer 1: WiFi & Physical
**Goal:** Optimize wireless performance, diagnose Roku issue, establish baseline RF environment.
**Discovery (parallel):**
- Export AP configs from UniFi (channels, power levels, band steering, DTIM, minimum RSSI)
- Pull client device list with signal/rate/retry stats
- Document AP placement (floor, room, mounting)
- Check for channel conflicts — 3 APs on 5GHz 80MHz channels could overlap
**Analysis (parallel):**
- Evaluate channel plan — non-overlapping channels? DFS channels available?
- Review AP power levels — high TX power on AC Lites causes asymmetry with weak client radios
- Assess band steering config — is 2.4GHz available as fallback?
- Roku-specific: determine if lowering AP-Upper Floor TX power or moving Roku to 2.4GHz improves Rx rate
**Remediation (sequential):**
- Apply optimized channel plan
- Adjust TX power levels per AP
- Configure minimum RSSI thresholds if not set
- Validate Roku improvement
- Document new baseline
**Key insight:** The Roku's 1x1 radio with 6 Mbps Rx rate at -44 dBm signal strongly suggests AP TX power is too high relative to what the Roku can transmit back. Lowering AP power or moving to 2.4GHz are the likely fixes.
---
### Layer 2: Network Architecture
**Goal:** Expand from 2 VLANs to 4, supporting guest WiFi and IoT isolation.
**Target VLAN layout:**
| VLAN | Name | Subnet | Purpose |
|------|------|--------|---------|
| Existing | Home | `10.0.0.0/23` | Trusted personal devices |
| Existing | Lab | `10.10.0.0/24` | Homelab servers, Proxmox, infrastructure |
| New | Guest | TBD (e.g., `10.20.0.0/24`) | Guest WiFi — internet only, no local access |
| New | IoT | TBD (e.g., `10.30.0.0/24`) | Smart devices — no internet by default |
**Discovery (parallel):**
- Export current VLAN config (VLAN IDs, DHCP scopes, assignments)
- Inventory all devices and current network placement
- Document inter-VLAN routing rules
- Check switch port VLAN assignments (tagged/untagged)
**Analysis (parallel):**
- Determine which devices move to IoT VLAN (Roku, smart bulbs, switches, HA hub)
- Design DHCP scopes for new VLANs
- Plan inter-VLAN access: IoT reaches HA only, HA reaches into IoT, no IoT internet
- WiFi SSIDs: one per VLAN or shared SSID with VLAN assignment?
**Remediation (sequential):**
- Create Guest and IoT VLANs in UniFi
- Configure DHCP for new VLANs
- Create WiFi networks (Guest SSID, IoT SSID)
- Migrate devices to appropriate VLANs
- Validate connectivity per VLAN
- Document new topology
---
### Layer 3: DNS
**Goal:** Validate Pi-hole HA, plan mDNS for smart home, ensure DNS works across all four VLANs.
**Discovery (parallel):**
- Validate Orbital Sync (matching blocklists, custom entries on both Pi-holes)
- Check NPM DNS sync cron — is `custom.list` consistent?
- Document current DNS records in `homelab.local` zone
- Check DHCP DNS server advertisements on both existing VLANs
**Analysis (parallel):**
- Verify failover: what happens when primary (`10.10.0.16`) goes down?
- DNS per VLAN: Guest gets Pi-hole (ad blocking) but NOT internal name resolution. IoT resolves HA only.
- mDNS for smart home — Matter/HomeKit use mDNS for discovery, doesn't cross VLANs. Options:
- UniFi mDNS reflector (built-in, simple, reflects everything)
- Avahi reflector on a host (more granular)
- Explicit HA configuration for IoT VLAN discovery
- Check if iOS DNS bypass issue (from KB) is still relevant
**Remediation (sequential):**
- Configure DNS for Guest and IoT VLANs
- Set up mDNS reflection (method TBD)
- Fix any Orbital Sync or failover gaps
- Validate DNS resolution from each VLAN
- Document DNS architecture
---
### Layer 4: Firewall & Security
**Goal:** Clean up rules, audit WAN exposure, validate internal domain, harden perimeter.
**Discovery (parallel):**
- Export all UniFi firewall rules (WAN/LAN/Guest, in/out/local)
- Inventory all NPM proxy hosts — which services exposed on `*.manticorum.com`
- Test internal domain resolution: does `.homelab.local` work from each network?
- Check NPM SSL cert status and auto-renewal
- Document port forwards on UDM Pro
- Check UDM Pro WAN-facing services (remote management, STUN, UPnP)
**Analysis (parallel):**
- **Firewall rule audit:** Redundant, conflicting, or overly broad rules? Missing rules (e.g., IoT→Lab block)?
- **NPM exposure review:** Per proxy host — does it need to be internet-facing? Auth configured? Security headers (HSTS, X-Frame-Options, CSP)?
- **Internal domain strategy:** `.local` conflicts with mDNS. Options:
- Keep `.homelab.local` with Pi-hole handling (risk of mDNS collision)
- Switch to `lab.manticorum.com` with split DNS (recommended — you own the domain, no mDNS conflict, clean)
- Use `.home.arpa` (RFC 8375, purpose-built for home networks)
- **Inter-VLAN rules:** Guest = internet-only. IoT = no internet, HA access only. Lab = reachable from Home, not from Guest/IoT.
- **WAN hardening:** UPnP status, unnecessary exposure
**Remediation (sequential):**
- Remove/consolidate stale firewall rules
- Harden NPM proxy hosts (auth, headers, prune unnecessary exposure)
- Implement chosen internal domain strategy (recommendation: `lab.manticorum.com` split DNS)
- Create inter-VLAN firewall rules for Guest and IoT
- Disable UPnP if enabled, close unnecessary WAN exposure
- External port scan validation
- Document final ruleset and NPM inventory
---
### Layer 5: Overlay & Remote Access
**Goal:** Tailscale full mesh — universal reachability across home, cellular, and cloud.
**Discovery (parallel):**
- Document current Tailscale setup (devices, exit nodes, ACL policy)
- Check for subnet router usage vs exit-node-only
- Identify all devices for the mesh (workstation, phones, laptops, servers, cloud VMs)
- Check if OpenVPN is active or legacy
**Analysis (parallel):**
- **Architecture options:**
- Subnet routers: Tailscale on 1-2 hosts advertising home + lab subnets. Simpler, fewer installs.
- Full mesh: Tailscale on every server. Direct reachability, no SPOF, more to manage.
- Hybrid (recommended): Tailscale on key servers + subnet router for the rest.
- **DNS integration:** Tailscale MagicDNS vs Pi-hole coexistence
- **ACL policy:** Which devices reach which? Phones get everything? Cloud VMs lab-only?
- **Exit node strategy:** Keep current phone exit nodes? Add workstation?
- **OpenVPN decommission:** If Tailscale covers all use cases, remove it
**Remediation (sequential):**
- Install/configure Tailscale on chosen devices
- Set up subnet routes or direct mesh
- Configure Tailscale ACLs
- Integrate DNS (MagicDNS + Pi-hole)
- Test: home→cloud, cellular→lab, cloud→home
- Decommission OpenVPN if replaced
- Document mesh topology and ACLs
---
### Layer 6: Smart Home Foundation
**Goal:** IoT VLAN ready (from Layer 2), Home Assistant deployed, Matter/Thread infrastructure in place.
**Discovery (parallel):**
- Inventory smart devices — protocols (WiFi, Zigbee, Z-Wave, Matter, Thread)
- Document HA hardware (antenna type — Zigbee coordinator? Thread border router? SkyConnect?)
- Document previous HomeKit/Matter attempts — what failed and why
- Identify devices for HA migration
**Analysis (parallel):**
- **Protocol strategy:**
- Which devices support Matter (firmware update path)?
- WiFi-only devices → IoT VLAN, managed through HA
- Zigbee/Thread devices → HA radio, no VLAN needed
- **HA network placement:** Must reach IoT VLAN, be reachable from Home VLAN (UI), handle mDNS. Options: dedicated VM, container on manticore, dedicated hardware.
- **Matter/Thread specifics:**
- Thread border routers: same segment as HA coordinator
- Matter commissioning uses BLE + WiFi — which VLAN?
- Apple Home: HA HomeKit bridge vs replace HomeKit entirely
- **Migration path:** Phased, validate each batch
**Remediation (sequential):**
- Deploy Home Assistant (if not already running)
- Configure HA network access (IoT VLAN reach, Home VLAN UI)
- Set up Zigbee/Thread coordinator
- Migrate devices in phases
- Test Matter commissioning end-to-end
- Document device inventory, protocols, HA architecture
---
### Final Pass: Cross-Cutting Security Audit
**Goal:** Holistic review after all layers complete — catch anything missed or introduced.
**Agent:** `security-auditor` lead, `pentester` assist.
**Tasks:**
- Port scan from WAN — verify only intended services reachable
- Inter-VLAN isolation verification — Guest can't reach Lab/Home/IoT, IoT can't reach internet or Lab
- NPM proxy hosts: SSL + headers validated
- No default credentials on network gear or exposed services
- Tailscale ACLs match actual reachability
- Produce final network topology document
---
## Dependencies
```
Layer 1 (WiFi) ─────────────────────────────────────────────┐
│ │
Layer 2 (VLANs) ────────────────────────────────────────────┤
│ │
Layer 3 (DNS) ──────────────────────────────────────────────┤
│ │
Layer 4 (Firewall) ─────────────────────────────────────────┤
│ │
Layer 5 (Tailscale) ────────────────────────────────────────┤
│ │
Layer 6 (Smart Home) ───────────────────────────────────────┤
Final Pass
```
Layers are sequential — each builds on the one below. Within each layer, discovery and analysis phases run parallel sub-agents. Remediation is sequential within a layer.
## Deliverables
Per layer:
- Baseline snapshot (current state before changes)
- Changes made (with rationale)
- Validation results
- Updated documentation
Final:
- Complete network topology document
- Firewall rule inventory
- NPM proxy host inventory with security status
- Tailscale mesh diagram and ACL policy
- Smart home device inventory and protocol map
- Security audit report

View File

@ -21,7 +21,7 @@
{
"parameters": {
"operation": "executeCommand",
"command": "/root/.local/bin/claude -p \"Run python3 ~/.claude/skills/server-diagnostics/client.py health paper-dynasty and analyze the results. If any containers are not running or there are critical issues, summarize them. Otherwise just say 'All systems healthy'.\" --output-format json --json-schema '{\"type\":\"object\",\"properties\":{\"status\":{\"type\":\"string\",\"enum\":[\"healthy\",\"issues_found\"]},\"summary\":{\"type\":\"string\"},\"root_cause\":{\"type\":\"string\"},\"severity\":{\"type\":\"string\",\"enum\":[\"low\",\"medium\",\"high\",\"critical\"]},\"affected_services\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}},\"actions_taken\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}}},\"required\":[\"status\",\"severity\",\"summary\"]}' --allowedTools \"Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)\"",
"command": "/root/.local/bin/claude -p \"Run python3 ~/.claude/skills/server-diagnostics/client.py health paper-dynasty and analyze the results. If any containers are not running or there are critical issues, summarize them. Otherwise just say 'All systems healthy'.\" --output-format json --json-schema '{\"type\":\"object\",\"properties\":{\"status\":{\"type\":\"string\",\"enum\":[\"healthy\",\"issues_found\"]},\"summary\":{\"type\":\"string\"},\"root_cause\":{\"type\":\"string\"},\"severity\":{\"type\":\"string\",\"enum\":[\"low\",\"medium\",\"high\",\"critical\"]},\"affected_services\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}},\"actions_taken\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}}},\"required\":[\"status\",\"severity\",\"summary\"]}' --allowedTools \"Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)\" --append-system-prompt \"You are a server diagnostics agent. Use the server-diagnostics skill client.py for all operations. Never run destructive commands.\"",
"options": {}
},
"id": "ssh-claude-code",
@ -75,20 +75,48 @@
"typeVersion": 2,
"position": [660, 0]
},
{
"parameters": {
"operation": "executeCommand",
"command": "=/root/.local/bin/claude -p \"The previous health check found issues. Investigate deeper: check container logs, resource usage, and recent events. Provide a detailed root cause analysis and recommended remediation steps.\" --resume \"{{ $json.session_id }}\" --output-format json --json-schema '{\"type\":\"object\",\"properties\":{\"root_cause_detail\":{\"type\":\"string\"},\"container_logs\":{\"type\":\"string\"},\"resource_status\":{\"type\":\"string\"},\"remediation_steps\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}},\"requires_human\":{\"type\":\"boolean\"}},\"required\":[\"root_cause_detail\",\"remediation_steps\",\"requires_human\"]}' --allowedTools \"Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)\" --max-turns 15 --append-system-prompt \"You are a server diagnostics agent performing a follow-up investigation. The initial health check found issues. Dig deeper into logs and metrics. Never run destructive commands.\"",
"options": {}
},
"id": "ssh-followup",
"name": "Follow Up Diagnostics",
"type": "n8n-nodes-base.ssh",
"typeVersion": 1,
"position": [880, -200],
"credentials": {
"sshPassword": {
"id": "REPLACE_WITH_CREDENTIAL_ID",
"name": "Claude Code LXC"
}
}
},
{
"parameters": {
"jsCode": "// Parse follow-up diagnostics response\nconst stdout = $input.first().json.stdout || '';\nconst initial = $('Parse Claude Response').first().json;\n\ntry {\n const response = JSON.parse(stdout);\n const data = response.structured_output || JSON.parse(response.result || '{}');\n \n return [{\n json: {\n ...initial,\n followup: {\n root_cause_detail: data.root_cause_detail || 'No detail available',\n container_logs: data.container_logs || '',\n resource_status: data.resource_status || '',\n remediation_steps: data.remediation_steps || [],\n requires_human: data.requires_human || false,\n cost_usd: response.total_cost_usd,\n session_id: response.session_id\n },\n total_cost_usd: (initial.cost_usd || 0) + (response.total_cost_usd || 0)\n }\n }];\n} catch (e) {\n return [{\n json: {\n ...initial,\n followup: {\n error: e.message,\n root_cause_detail: 'Follow-up parse failed',\n remediation_steps: [],\n requires_human: true\n },\n total_cost_usd: initial.cost_usd || 0\n }\n }];\n}"
},
"id": "parse-followup",
"name": "Parse Follow-up Response",
"type": "n8n-nodes-base.code",
"typeVersion": 2,
"position": [1100, -200]
},
{
"parameters": {
"method": "POST",
"url": "https://discord.com/api/webhooks/1451783909409816763/O9PMDiNt6ZIWRf8HKocIZ_E4vMGV_lEwq50aAiZ9HVFR2UGwO6J1N9_wOm82p0MetIqT",
"sendBody": true,
"specifyBody": "json",
"jsonBody": "={\n \"embeds\": [{\n \"title\": \"{{ $json.severity === 'critical' ? '🔴' : $json.severity === 'high' ? '🟠' : '🟡' }} Server Alert\",\n \"description\": {{ JSON.stringify($json.summary) }},\n \"color\": {{ $json.severity === 'critical' ? 15158332 : $json.severity === 'high' ? 15105570 : 16776960 }},\n \"fields\": [\n {\n \"name\": \"Severity\",\n \"value\": \"{{ $json.severity.toUpperCase() }}\",\n \"inline\": true\n },\n {\n \"name\": \"Server\",\n \"value\": \"paper-dynasty (10.10.0.88)\",\n \"inline\": true\n },\n {\n \"name\": \"Cost\",\n \"value\": \"${{ $json.cost_usd ? $json.cost_usd.toFixed(4) : '0.0000' }}\",\n \"inline\": true\n },\n {\n \"name\": \"Root Cause\",\n \"value\": \"{{ $json.root_cause || 'N/A' }}\",\n \"inline\": false\n },\n {\n \"name\": \"Affected Services\",\n \"value\": \"{{ $json.affected_services.length ? $json.affected_services.join(', ') : 'None' }}\",\n \"inline\": false\n },\n {\n \"name\": \"Actions Taken\",\n \"value\": \"{{ $json.actions_taken.length ? $json.actions_taken.join('\\n') : 'None' }}\",\n \"inline\": false\n }\n ],\n \"timestamp\": \"{{ new Date().toISOString() }}\"\n }]\n}",
"jsonBody": "={\n \"embeds\": [{\n \"title\": \"{{ $json.severity === 'critical' ? '🔴' : $json.severity === 'high' ? '🟠' : '🟡' }} Server Alert\",\n \"description\": {{ JSON.stringify($json.summary) }},\n \"color\": {{ $json.severity === 'critical' ? 15158332 : $json.severity === 'high' ? 15105570 : 16776960 }},\n \"fields\": [\n {\n \"name\": \"Severity\",\n \"value\": \"{{ $json.severity.toUpperCase() }}\",\n \"inline\": true\n },\n {\n \"name\": \"Server\",\n \"value\": \"paper-dynasty (10.10.0.88)\",\n \"inline\": true\n },\n {\n \"name\": \"Cost\",\n \"value\": \"${{ $json.total_cost_usd ? $json.total_cost_usd.toFixed(4) : '0.0000' }}\",\n \"inline\": true\n },\n {\n \"name\": \"Root Cause\",\n \"value\": {{ JSON.stringify(($json.followup && $json.followup.root_cause_detail) || $json.root_cause || 'N/A') }},\n \"inline\": false\n },\n {\n \"name\": \"Affected Services\",\n \"value\": \"{{ $json.affected_services.length ? $json.affected_services.join(', ') : 'None' }}\",\n \"inline\": false\n },\n {\n \"name\": \"Remediation Steps\",\n \"value\": {{ JSON.stringify(($json.followup && $json.followup.remediation_steps.length) ? $json.followup.remediation_steps.map((s, i) => (i+1) + '. ' + s).join('\\n') : ($json.actions_taken.length ? $json.actions_taken.join('\\n') : 'None')) }},\n \"inline\": false\n },\n {\n \"name\": \"Requires Human?\",\n \"value\": \"{{ ($json.followup && $json.followup.requires_human) ? '⚠️ Yes' : '✅ No' }}\",\n \"inline\": true\n }\n ],\n \"timestamp\": \"{{ new Date().toISOString() }}\"\n }]\n}",
"options": {}
},
"id": "discord-alert",
"name": "Discord Alert",
"type": "n8n-nodes-base.httpRequest",
"typeVersion": 4.2,
"position": [880, -100]
"position": [1320, -200]
},
{
"parameters": {
@ -145,7 +173,7 @@
"main": [
[
{
"node": "Discord Alert",
"node": "Follow Up Diagnostics",
"type": "main",
"index": 0
}
@ -158,6 +186,28 @@
}
]
]
},
"Follow Up Diagnostics": {
"main": [
[
{
"node": "Parse Follow-up Response",
"type": "main",
"index": 0
}
]
]
},
"Parse Follow-up Response": {
"main": [
[
{
"node": "Discord Alert",
"type": "main",
"index": 0
}
]
]
}
},
"settings": {

View File

@ -0,0 +1,69 @@
---
title: "Database API Release — 2026.4.7"
description: "Major cleanup: middleware connection management, security hardening, performance fixes, and Pydantic/Docker upgrades."
type: reference
domain: major-domo
tags: [release-notes, deployment, database, major-domo]
---
# Database API Release — 2026.4.7
**Date:** 2026-04-07
**Tag:** TBD (next CalVer tag after `2026.4.5`)
**Image:** `manticorum67/major-domo-database:{tag}` + `:latest`
**Server:** `ssh akamai` (`~/container-data/sba-database`)
**Deploy method:** `git tag -a YYYY.M.BUILD -m "description" && git push origin YYYY.M.BUILD` → CI builds Docker image → pull + restart on akamai
## Release Summary
Large batch merge of 22 PRs covering connection management, security hardening, query performance, code cleanup, and infrastructure upgrades. The headline change is middleware-based DB connection management replacing 177+ manual `db.close()` calls across all routers.
## Changes
### Architecture
- **Middleware connection management** — replaced all manual `db.close()` calls with HTTP middleware that opens connections before requests and closes after responses (PR #97)
- **Disabled autoconnect + pool timeout**`PooledPostgresqlDatabase` now uses `autoconnect=False` and `timeout=5` for tighter connection lifecycle control (PR #87)
- **Migration tracking system** — new system for tracking applied database migrations (PR #96)
### Security
- **Removed hardcoded webhook URL** — Discord webhook URL moved to `DISCORD_WEBHOOK_URL` env var (PR #83). Old token is in git history — rotate it.
- **Removed hardcoded fallback DB password** — no more default password in `db_engine.py` (PR #55)
- **Removed token from log warnings** — Bad Token log messages no longer include the raw token value (PR #85)
### Performance
- **Batch standings updates** — eliminated N+1 queries in `recalculate_standings` (PR #93)
- **Bulk DELETE in career recalculation** — replaced row-by-row DELETE with single bulk operation (PR #92)
- **Added missing FK indexes** — indexes on FK columns in `stratplay` and `stratgame` tables (PR #95)
- **Fixed total_count in get_totalstats** — count no longer overwritten with page length (PR #102)
### Bug Fixes
- **Boolean field comparisons** — replaced integer comparisons (`== 1`) with proper `True`/`False` (PR #94)
- **CustomCommandCreator.discord_id** — aligned model field with BIGINT column type (PR #88)
- **Literal validation on sort param**`GET /api/v3/players` now validates sort values (PR #68)
- **PitchingStat combined_season** — added missing classmethod for combined season stats (PR #67)
### Code Cleanup
- Removed SQLite fallback code from `db_engine.py` (PR #89)
- Replaced deprecated `.dict()` with `.model_dump()` across all Pydantic models (PR #90)
- Added type annotations to untyped query parameters (PR #86)
- Removed commented-out dead code blocks (PR #48)
- Replaced `print()` debug statements with `logger` calls in `db_engine.py` (PR #53)
- Removed unimplemented `is_trade` parameter from transactions endpoint (PR #57)
- Eliminated N+1 queries in `get_custom_commands` (PR #51)
### Infrastructure
- **Docker base image upgraded** from Python 3.11 to 3.12 (PR #91)
- **CI switched to tag-triggered builds** (PR #107)
## Known Issues
- ~20 unit tests broken by SQLite fallback removal — tests relied on SQLite that no longer exists (issue #108)
- `test_get_nonexistent_play` returns 500 instead of 404 (issue #109)
- `test_batting_sbaplayer_career_totals` returns 422 instead of 200 (issue #110)
## Deployment Notes
- **New env var required:** `DISCORD_WEBHOOK_URL` must be set in the container environment. Check `docker-compose.yml` passes it through.
- **Rotate webhook token** — the old hardcoded token is in git history.
- **Migration tracking:** new migration table will be created on first run.
- **Rollback:** `docker compose pull manticorum67/major-domo-database:2026.4.5 && docker compose up -d`

View File

@ -0,0 +1,37 @@
---
title: "Discord Bot Release — 2026.4.7"
description: "Minor fix: add missing logger to SubmitConfirmationModal."
type: reference
domain: major-domo
tags: [release-notes, deployment, discord, major-domo]
---
# Discord Bot Release — 2026.4.7
**Date:** 2026-04-07
**Tag:** TBD (next CalVer tag after `2026.3.13`)
**Image:** `manticorum67/major-domo-discordapp:{tag}` + `:production`
**Server:** `ssh akamai` (`~/container-data/major-domo`)
**Deploy method:** `.scripts/release.sh` → CI builds Docker image → `.scripts/deploy.sh`
## Release Summary
Minimal release with a single logging fix. Previous releases (2026.3.122026.3.13) included the larger performance and feature work (FA lock enforcement, trade view optimization, parallel lookups).
## Changes
### Bug Fixes
- **Missing logger in SubmitConfirmationModal** — added logger initialization that was absent, preventing proper error logging in transaction confirmation flows
## Not Included (PR #120)
PR #120 (caching for stable data) remains open with two unfixed issues:
1. `_channel_color_cache` cross-user contamination — cache keyed by channel only, user-specific colors bleed across users
2. `recalculate_standings()` doesn't invalidate standings cache
These must be addressed before PR #120 can merge.
## Deployment Notes
- No new env vars or config changes required
- **Rollback:** `.scripts/deploy.sh` with previous image tag, or `ssh akamai``docker compose pull manticorum67/major-domo-discordapp:2026.3.13 && docker compose up -d`

View File

@ -0,0 +1,128 @@
---
title: "APC UPS Monitoring with apcupsd and Discord Alerts"
description: "Setup guide for apcupsd on nobara-pc workstation with Discord webhook alerts for power events (on battery, off battery, battery replace, comm failure/restore)."
type: guide
domain: monitoring
tags: [apcupsd, ups, discord, webhook, power, alerts, usb]
---
# APC UPS Monitoring with apcupsd
## Overview
apcupsd monitors the APC Back-UPS RS 1500MS2 connected via USB to the workstation (nobara-pc). Discord alerts fire automatically on power events via webhook scripts in `/etc/apcupsd/`.
## Hardware
- **UPS Model**: Back-UPS RS 1500MS2
- **Connection**: USB (vendor ID `051d:0002`)
- **Nominal Power**: 900W
- **Nominal Battery Voltage**: 24V
- **Serial**: 0B2544L30372
## Configuration
**Config file**: `/etc/apcupsd/apcupsd.conf`
Key settings:
```
UPSNAME WS-UPS
UPSCABLE usb
UPSTYPE usb
DEVICE # blank = USB autodetect
POLLTIME 15 # poll every 15 seconds
SENSE Medium # UPS-side sensitivity (set in EEPROM)
LOTRANS 88.0 Volts # switch to battery below this
HITRANS 144.0 Volts # switch to battery above this
BATTERYLEVEL 5 # shutdown at 5% charge
MINUTES 3 # shutdown at 3 min remaining
```
## Service
```bash
sudo systemctl enable --now apcupsd
systemctl status apcupsd
```
## Useful Commands
```bash
# Full status dump
apcaccess status
# Single field (no parsing needed)
apcaccess -p LINEV
apcaccess -p LASTXFER
apcaccess -p BCHARGE
# View event log
cat /var/log/apcupsd.events
# Watch events in real-time
tail -f /var/log/apcupsd.events
```
## Discord Alerts
Five event scripts in `/etc/apcupsd/` send Discord embeds to the `#homelab-alerts` webhook:
| Script | Trigger | Embed Color |
|--------|---------|-------------|
| `onbattery` | UPS switches to battery | Red (0xFF6B6B) |
| `offbattery` | Line power restored | Green (0x57F287) |
| `changeme` | Battery needs replacement | Yellow (0xFFFF00) |
| `commfailure` | USB communication lost | Red (0xFF6B6B) |
| `commok` | USB communication restored | Green (0x57F287) |
All scripts use the same webhook URL as other monitoring scripts (jellyfin_gpu_monitor, nvidia_update_checker).
The `onbattery` alert includes line voltage, load percentage, battery charge, and time remaining — useful for diagnosing whether transfers are caused by voltage sags vs other issues.
## Troubleshooting
### UPS not detected
```bash
# Check USB connection
lsusb | grep 051d
# If missing, try a different USB port or cable
# The UPS uses vendor ID 051d:0002
```
### No Discord alerts on power event
```bash
# Test the script manually
sudo /etc/apcupsd/onbattery WS-UPS
# Check that curl is available at /usr/bin/curl
which curl
# Verify webhook URL is still valid
curl -s -o /dev/null -w "%{http_code}" -H "Content-Type: application/json" \
-X POST "WEBHOOK_URL" -d '{"content":"test"}'
# Should return 204
```
### LASTXFER shows "Low line voltage"
This means input voltage is dropping below the LOTRANS threshold (88V). Common causes:
- Heavy appliance on the same circuit (HVAC, fridge compressor)
- Loose wiring/outlet connection
- Utility-side voltage sags
- Overloaded circuit
Correlate event timestamps from `/var/log/apcupsd.events` with appliance cycling to identify the source.
### Frequent unnecessary transfers
If sensitivity is too high, the UPS transfers on minor sags that don't affect equipment:
- Check current: `apcaccess -p SENSE`
- Lower via `apctest` EEPROM menu (requires stopping apcupsd first)
- Options: High → Medium → Low
## Initial Diagnostics (2026-04-06)
- Two different APC UPS units exhibited the same on_batt/on_line bouncing behavior
- `LASTXFER: Low line voltage` confirmed voltage sags as the cause
- Sensitivity already at Medium — transfers are from real sags below 88V
- Load at 49% (441W of 900W capacity) — not overloaded
- Next steps: correlate event timestamps with appliance activity, try different circuit, electrician inspection

View File

@ -1,9 +1,9 @@
---
title: "Monitoring Scripts Context"
description: "Operational context for all monitoring scripts: Jellyfin GPU health monitor, NVIDIA driver update checker, Tdarr API/file monitors, and Windows reboot detection. Includes cron schedules, Discord integration patterns, and troubleshooting."
description: "Operational context for all monitoring scripts: Proxmox backup checker, CT 302 self-health, Jellyfin GPU health monitor, NVIDIA driver update checker, Tdarr API/file monitors, and Windows reboot detection. Includes cron schedules, Discord integration patterns, and troubleshooting."
type: context
domain: monitoring
tags: [jellyfin, gpu, nvidia, tdarr, discord, cron, python, windows, scripts]
tags: [proxmox, backup, jellyfin, gpu, nvidia, tdarr, discord, cron, python, bash, windows, scripts]
---
# Monitoring Scripts - Operational Context
@ -13,6 +13,77 @@ This directory contains active operational scripts for system monitoring, health
## Core Monitoring Scripts
### Proxmox Backup Verification
**Script**: `proxmox-backup-check.sh`
**Purpose**: Weekly check that every running VM/CT has a successful vzdump backup within 7 days. Posts a color-coded Discord embed with per-guest status.
**Key Features**:
- SSHes to Proxmox host and queries `pvesh` task history + guest lists via API
- Categorizes each guest: 🟢 green (backed up), 🟡 yellow (overdue), 🔴 red (no backup)
- Sorts output by VMID; only posts to Discord — no local side effects
- `--dry-run` mode prints the Discord payload without sending
- `--days N` overrides the default 7-day window
**Schedule**: Weekly on Monday 08:00 UTC (CT 302 cron)
```bash
0 8 * * 1 DISCORD_WEBHOOK="<url>" /root/scripts/proxmox-backup-check.sh >> /var/log/proxmox-backup-check.log 2>&1
```
**Usage**:
```bash
# Dry run (no Discord)
proxmox-backup-check.sh --dry-run
# Post to Discord
DISCORD_WEBHOOK="https://discord.com/api/webhooks/..." proxmox-backup-check.sh
# Custom window
proxmox-backup-check.sh --days 14 --discord-webhook "https://..."
```
**Dependencies**: `jq`, `curl`, SSH access to Proxmox host alias `proxmox`
**Install on CT 302**:
```bash
cp proxmox-backup-check.sh /root/scripts/
chmod +x /root/scripts/proxmox-backup-check.sh
```
### CT 302 Self-Health Monitor
**Script**: `ct302-self-health.sh`
**Purpose**: Monitors disk usage on CT 302 (claude-runner) itself. Alerts to Discord when any filesystem exceeds the threshold (default 80%). Runs silently when healthy — no Discord spam on green.
**Key Features**:
- Checks all non-virtual filesystems (`df`, excludes tmpfs/devtmpfs/overlay)
- Only sends a Discord alert when a filesystem is at or above threshold
- `--always-post` flag forces a post even when healthy (useful for testing)
- `--dry-run` mode prints payload without sending
**Schedule**: Daily at 07:00 UTC (CT 302 cron)
```bash
0 7 * * * DISCORD_WEBHOOK="<url>" /root/scripts/ct302-self-health.sh >> /var/log/ct302-self-health.log 2>&1
```
**Usage**:
```bash
# Check and alert if over 80%
DISCORD_WEBHOOK="https://discord.com/api/webhooks/..." ct302-self-health.sh
# Lower threshold test
ct302-self-health.sh --threshold 50 --dry-run
# Always post (weekly status report pattern)
ct302-self-health.sh --always-post --discord-webhook "https://..."
```
**Dependencies**: `jq`, `curl`, `df`
**Install on CT 302**:
```bash
cp ct302-self-health.sh /root/scripts/
chmod +x /root/scripts/ct302-self-health.sh
```
### Jellyfin GPU Health Monitor
**Script**: `jellyfin_gpu_monitor.py`
**Purpose**: Monitor Jellyfin container GPU access with Discord alerts and auto-restart capability
@ -235,6 +306,17 @@ python3 tdarr_file_monitor.py >> /mnt/NV2/Development/claude-home/logs/tdarr-fil
0 9 * * 1 /usr/bin/python3 /home/cal/scripts/nvidia_update_checker.py --check --discord-alerts >> /home/cal/logs/nvidia-update-checker.log 2>&1
```
**Active Cron Jobs** (on CT 302 / claude-runner, root user):
```bash
# Proxmox backup verification - Weekly (Mondays at 8 AM UTC)
0 8 * * 1 DISCORD_WEBHOOK="<homelab-alerts-webhook>" /root/scripts/proxmox-backup-check.sh >> /var/log/proxmox-backup-check.log 2>&1
# CT 302 self-health disk check - Daily at 7 AM UTC (alerts only when >80%)
0 7 * * * DISCORD_WEBHOOK="<homelab-alerts-webhook>" /root/scripts/ct302-self-health.sh >> /var/log/ct302-self-health.log 2>&1
```
**Note**: Scripts must be installed manually on CT 302. Source of truth is `monitoring/scripts/` in this repo — copy to `/root/scripts/` on CT 302 to deploy.
**Manual/On-Demand**:
- `tdarr_monitor.py` - Run as needed for Tdarr health checks
- `tdarr_file_monitor.py` - Can be scheduled if automatic backup needed

View File

@ -0,0 +1,158 @@
#!/usr/bin/env bash
# ct302-self-health.sh — CT 302 (claude-runner) disk self-check → Discord
#
# Monitors disk usage on CT 302 itself and alerts to Discord when any
# filesystem exceeds the threshold. Closes the blind spot where the
# monitoring system cannot monitor itself via external health checks.
#
# Designed to run silently when healthy (no Discord spam on green).
# Only posts when a filesystem is at or above THRESHOLD.
#
# Usage:
# ct302-self-health.sh [--discord-webhook URL] [--threshold N] [--dry-run] [--always-post]
#
# Environment overrides:
# DISCORD_WEBHOOK Discord webhook URL (required unless --dry-run)
# DISK_THRESHOLD Disk usage % alert threshold (default: 80)
#
# Install on CT 302 (daily, 07:00 UTC):
# 0 7 * * * /root/scripts/ct302-self-health.sh >> /var/log/ct302-self-health.log 2>&1
set -uo pipefail
DISK_THRESHOLD="${DISK_THRESHOLD:-80}"
DISCORD_WEBHOOK="${DISCORD_WEBHOOK:-}"
DRY_RUN=0
ALWAYS_POST=0
while [[ $# -gt 0 ]]; do
case "$1" in
--discord-webhook)
if [[ $# -lt 2 ]]; then
echo "Error: --discord-webhook requires a value" >&2
exit 1
fi
DISCORD_WEBHOOK="$2"
shift 2
;;
--threshold)
if [[ $# -lt 2 ]]; then
echo "Error: --threshold requires a value" >&2
exit 1
fi
DISK_THRESHOLD="$2"
shift 2
;;
--dry-run)
DRY_RUN=1
shift
;;
--always-post)
ALWAYS_POST=1
shift
;;
*)
echo "Unknown option: $1" >&2
exit 1
;;
esac
done
if [[ "$DRY_RUN" -eq 0 && -z "$DISCORD_WEBHOOK" ]]; then
echo "Error: DISCORD_WEBHOOK not set. Use --discord-webhook URL or set env var." >&2
exit 1
fi
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; }
# ---------------------------------------------------------------------------
# Check disk usage on all real filesystems
# ---------------------------------------------------------------------------
# df output: Filesystem Use% Mounted-on (skipping tmpfs, devtmpfs, overlay)
TRIGGERED=()
ALL_FS=()
while IFS= read -r line; do
fs=$(echo "$line" | awk '{print $1}')
pct=$(echo "$line" | awk '{print $5}' | tr -d '%')
mount=$(echo "$line" | awk '{print $6}')
ALL_FS+=("${pct}% ${mount} (${fs})")
if [[ "$pct" -ge "$DISK_THRESHOLD" ]]; then
TRIGGERED+=("${pct}% used — ${mount} (${fs})")
fi
done < <(df -h --output=source,size,used,avail,pcent,target |
tail -n +2 |
awk '$1 !~ /^(tmpfs|devtmpfs|overlay|udev)/' |
awk '{print $1, $5, $6}')
HOSTNAME=$(hostname -s)
TRIGGERED_COUNT=${#TRIGGERED[@]}
log "Disk check complete: ${TRIGGERED_COUNT} filesystem(s) above ${DISK_THRESHOLD}%"
# Exit cleanly with no Discord post if everything is healthy
if [[ "$TRIGGERED_COUNT" -eq 0 && "$ALWAYS_POST" -eq 0 && "$DRY_RUN" -eq 0 ]]; then
log "All filesystems healthy — no alert needed."
exit 0
fi
# ---------------------------------------------------------------------------
# Build Discord payload
# ---------------------------------------------------------------------------
if [[ "$TRIGGERED_COUNT" -gt 0 ]]; then
EMBED_COLOR=15548997 # 0xED4245 red
TITLE="🔴 ${HOSTNAME}: Disk usage above ${DISK_THRESHOLD}%"
alert_lines=$(printf '⚠️ %s\n' "${TRIGGERED[@]}")
FIELDS=$(jq -n \
--arg name "Filesystems Over Threshold" \
--arg value "$alert_lines" \
'[{"name": $name, "value": $value, "inline": false}]')
else
EMBED_COLOR=5763719 # 0x57F287 green
TITLE="🟢 ${HOSTNAME}: All filesystems healthy"
FIELDS='[]'
fi
# Add summary of all filesystems
all_lines=$(printf '%s\n' "${ALL_FS[@]}")
FIELDS=$(echo "$FIELDS" | jq \
--arg name "All Filesystems" \
--arg value "$all_lines" \
'. + [{"name": $name, "value": $value, "inline": false}]')
FOOTER="$(date -u '+%Y-%m-%d %H:%M UTC') · CT 302 self-health · threshold: ${DISK_THRESHOLD}%"
PAYLOAD=$(jq -n \
--arg title "$TITLE" \
--argjson color "$EMBED_COLOR" \
--argjson fields "$FIELDS" \
--arg footer "$FOOTER" \
'{
"embeds": [{
"title": $title,
"color": $color,
"fields": $fields,
"footer": {"text": $footer}
}]
}')
if [[ "$DRY_RUN" -eq 1 ]]; then
log "DRY RUN — Discord payload:"
echo "$PAYLOAD" | jq .
exit 0
fi
log "Posting to Discord..."
HTTP_STATUS=$(curl -s -o /tmp/ct302-self-health-discord.out \
-w "%{http_code}" \
-X POST "$DISCORD_WEBHOOK" \
-H "Content-Type: application/json" \
-d "$PAYLOAD")
if [[ "$HTTP_STATUS" -ge 200 && "$HTTP_STATUS" -lt 300 ]]; then
log "Discord notification sent (HTTP ${HTTP_STATUS})."
else
log "Warning: Discord returned HTTP ${HTTP_STATUS}."
cat /tmp/ct302-self-health-discord.out >&2
exit 1
fi

View File

@ -5,7 +5,7 @@
# to collect system metrics, then generates a summary report.
#
# Usage:
# homelab-audit.sh [--output-dir DIR]
# homelab-audit.sh [--output-dir DIR] [--hosts label:ip,label:ip,...]
#
# Environment overrides:
# STUCK_PROC_CPU_WARN CPU% at which a D-state process is flagged (default: 10)
@ -29,6 +29,8 @@ LOAD_WARN=2.0
MEM_WARN=85
ZOMBIE_WARN=1
SWAP_WARN=512
HOSTS_FILTER="" # comma-separated host list from --hosts; empty = audit all
JSON_OUTPUT=0 # set to 1 by --json
while [[ $# -gt 0 ]]; do
case "$1" in
@ -40,6 +42,18 @@ while [[ $# -gt 0 ]]; do
REPORT_DIR="$2"
shift 2
;;
--hosts)
if [[ $# -lt 2 ]]; then
echo "Error: --hosts requires an argument (label:ip,label:ip,...)" >&2
exit 1
fi
HOSTS_FILTER="$2"
shift 2
;;
--json)
JSON_OUTPUT=1
shift
;;
*)
echo "Unknown option: $1" >&2
exit 1
@ -50,6 +64,7 @@ done
mkdir -p "$REPORT_DIR"
SSH_FAILURES_LOG="$REPORT_DIR/ssh-failures.log"
FINDINGS_FILE="$REPORT_DIR/findings.txt"
AUDITED_HOSTS=() # populated in main; used by generate_summary for per-host counts
# ---------------------------------------------------------------------------
# Remote collector script
@ -281,6 +296,18 @@ generate_summary() {
printf " Critical : %d\n" "$crit_count"
echo "=============================="
if [[ ${#AUDITED_HOSTS[@]} -gt 0 ]] && ((warn_count + crit_count > 0)); then
echo ""
printf " %-30s %8s %8s\n" "Host" "Warnings" "Critical"
printf " %-30s %8s %8s\n" "----" "--------" "--------"
for host in "${AUDITED_HOSTS[@]}"; do
local hw hc
hw=$(grep -c "^WARN ${host}:" "$FINDINGS_FILE" 2>/dev/null || true)
hc=$(grep -c "^CRIT ${host}:" "$FINDINGS_FILE" 2>/dev/null || true)
((hw + hc > 0)) && printf " %-30s %8d %8d\n" "$host" "$hw" "$hc"
done
fi
if ((warn_count + crit_count > 0)); then
echo ""
echo "Findings:"
@ -293,6 +320,9 @@ generate_summary() {
grep '^SSH_FAILURE' "$SSH_FAILURES_LOG" | awk '{print " " $2 " (" $3 ")"}'
fi
echo ""
printf "Total: %d warning(s), %d critical across %d host(s)\n" \
"$warn_count" "$crit_count" "$host_count"
echo ""
echo "Reports: $REPORT_DIR"
}
@ -383,6 +413,69 @@ check_cert_expiry() {
done
}
# ---------------------------------------------------------------------------
# JSON report — writes findings.json to $REPORT_DIR when --json is used
# ---------------------------------------------------------------------------
write_json_report() {
local host_count="$1"
local json_file="$REPORT_DIR/findings.json"
local ssh_failure_count=0
local warn_count=0
local crit_count=0
[[ -f "$SSH_FAILURES_LOG" ]] &&
ssh_failure_count=$(grep -c '^SSH_FAILURE' "$SSH_FAILURES_LOG" 2>/dev/null || true)
[[ -f "$FINDINGS_FILE" ]] &&
warn_count=$(grep -c '^WARN' "$FINDINGS_FILE" 2>/dev/null || true)
[[ -f "$FINDINGS_FILE" ]] &&
crit_count=$(grep -c '^CRIT' "$FINDINGS_FILE" 2>/dev/null || true)
python3 - "$json_file" "$host_count" "$ssh_failure_count" \
"$warn_count" "$crit_count" "$FINDINGS_FILE" <<'PYEOF'
import sys, json, datetime
json_file = sys.argv[1]
host_count = int(sys.argv[2])
ssh_failure_count = int(sys.argv[3])
warn_count = int(sys.argv[4])
crit_count = int(sys.argv[5])
findings_file = sys.argv[6]
findings = []
try:
with open(findings_file) as f:
for line in f:
line = line.strip()
if not line:
continue
parts = line.split(None, 2)
if len(parts) < 3:
continue
severity, host_colon, message = parts[0], parts[1], parts[2]
findings.append({
"severity": severity,
"host": host_colon.rstrip(":"),
"message": message,
})
except FileNotFoundError:
pass
output = {
"timestamp": datetime.datetime.utcnow().isoformat() + "Z",
"hosts_audited": host_count,
"warnings": warn_count,
"critical": crit_count,
"ssh_failures": ssh_failure_count,
"total_findings": warn_count + crit_count,
"findings": findings,
}
with open(json_file, "w") as f:
json.dump(output, f, indent=2)
print(f"JSON report: {json_file}")
PYEOF
}
# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------
@ -390,22 +483,50 @@ main() {
echo "Starting homelab audit — $(date)"
echo "Report dir: $REPORT_DIR"
echo "STUCK_PROC_CPU_WARN threshold: ${STUCK_PROC_CPU_WARN}%"
[[ -n "$HOSTS_FILTER" ]] && echo "Host filter: $HOSTS_FILTER"
echo ""
>"$FINDINGS_FILE"
echo " Checking Proxmox backup recency..."
check_backup_recency
local host_count=0
while read -r label addr; do
echo " Auditing $label ($addr)..."
parse_and_report "$label" "$addr"
check_cert_expiry "$label" "$addr"
((host_count++)) || true
done < <(collect_inventory)
if [[ -n "$HOSTS_FILTER" ]]; then
# --hosts mode: audit specified hosts directly, skip Proxmox inventory
# Accepts comma-separated entries; each entry may be plain hostname or label:ip
local check_proxmox=0
IFS=',' read -ra filter_hosts <<<"$HOSTS_FILTER"
for entry in "${filter_hosts[@]}"; do
local label="${entry%%:*}"
[[ "$label" == "proxmox" ]] && check_proxmox=1
done
if ((check_proxmox)); then
echo " Checking Proxmox backup recency..."
check_backup_recency
fi
for entry in "${filter_hosts[@]}"; do
local label="${entry%%:*}"
local addr="${entry#*:}"
echo " Auditing $label ($addr)..."
parse_and_report "$label" "$addr"
check_cert_expiry "$label" "$addr"
AUDITED_HOSTS+=("$label")
((host_count++)) || true
done
else
echo " Checking Proxmox backup recency..."
check_backup_recency
while read -r label addr; do
echo " Auditing $label ($addr)..."
parse_and_report "$label" "$addr"
check_cert_expiry "$label" "$addr"
AUDITED_HOSTS+=("$label")
((host_count++)) || true
done < <(collect_inventory)
fi
generate_summary "$host_count"
[[ "$JSON_OUTPUT" -eq 1 ]] && write_json_report "$host_count"
}
main "$@"

View File

@ -0,0 +1,230 @@
#!/usr/bin/env bash
# proxmox-backup-check.sh — Weekly Proxmox backup verification → Discord
#
# SSHes to the Proxmox host and checks that every running VM/CT has a
# successful vzdump backup within the last 7 days. Posts a color-coded
# Discord summary with per-guest status.
#
# Usage:
# proxmox-backup-check.sh [--discord-webhook URL] [--days N] [--dry-run]
#
# Environment overrides:
# DISCORD_WEBHOOK Discord webhook URL (required unless --dry-run)
# PROXMOX_NODE Proxmox node name (default: proxmox)
# PROXMOX_SSH SSH alias or host for Proxmox (default: proxmox)
# WINDOW_DAYS Backup recency window in days (default: 7)
#
# Install on CT 302 (weekly, Monday 08:00 UTC):
# 0 8 * * 1 /root/scripts/proxmox-backup-check.sh >> /var/log/proxmox-backup-check.log 2>&1
set -uo pipefail
PROXMOX_NODE="${PROXMOX_NODE:-proxmox}"
PROXMOX_SSH="${PROXMOX_SSH:-proxmox}"
WINDOW_DAYS="${WINDOW_DAYS:-7}"
DISCORD_WEBHOOK="${DISCORD_WEBHOOK:-}"
DRY_RUN=0
while [[ $# -gt 0 ]]; do
case "$1" in
--discord-webhook)
if [[ $# -lt 2 ]]; then
echo "Error: --discord-webhook requires a value" >&2
exit 1
fi
DISCORD_WEBHOOK="$2"
shift 2
;;
--days)
if [[ $# -lt 2 ]]; then
echo "Error: --days requires a value" >&2
exit 1
fi
WINDOW_DAYS="$2"
shift 2
;;
--dry-run)
DRY_RUN=1
shift
;;
*)
echo "Unknown option: $1" >&2
exit 1
;;
esac
done
if [[ "$DRY_RUN" -eq 0 && -z "$DISCORD_WEBHOOK" ]]; then
echo "Error: DISCORD_WEBHOOK not set. Use --discord-webhook URL or set env var." >&2
exit 1
fi
if ! command -v jq &>/dev/null; then
echo "Error: jq is required but not installed." >&2
exit 1
fi
SSH_OPTS="-o StrictHostKeyChecking=accept-new -o ConnectTimeout=10 -o BatchMode=yes"
CUTOFF=$(date -d "-${WINDOW_DAYS} days" +%s)
NOW=$(date +%s)
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; }
# ---------------------------------------------------------------------------
# Fetch data from Proxmox
# ---------------------------------------------------------------------------
log "Fetching VM and CT list from Proxmox node '${PROXMOX_NODE}'..."
VMS_JSON=$(ssh $SSH_OPTS "$PROXMOX_SSH" \
"pvesh get /nodes/${PROXMOX_NODE}/qemu --output-format json 2>/dev/null" || echo "[]")
CTS_JSON=$(ssh $SSH_OPTS "$PROXMOX_SSH" \
"pvesh get /nodes/${PROXMOX_NODE}/lxc --output-format json 2>/dev/null" || echo "[]")
log "Fetching recent vzdump task history (limit 200)..."
TASKS_JSON=$(ssh $SSH_OPTS "$PROXMOX_SSH" \
"pvesh get /nodes/${PROXMOX_NODE}/tasks --typefilter vzdump --limit 200 --output-format json 2>/dev/null" || echo "[]")
# ---------------------------------------------------------------------------
# Build per-guest backup status
# ---------------------------------------------------------------------------
# Merge VMs and CTs into one list: [{vmid, name, type}]
GUESTS_JSON=$(jq -n \
--argjson vms "$VMS_JSON" \
--argjson cts "$CTS_JSON" '
($vms | map(select(.status == "running") | {vmid: (.vmid | tostring), name, type: "VM"})) +
($cts | map(select(.status == "running") | {vmid: (.vmid | tostring), name, type: "CT"}))
')
GUEST_COUNT=$(echo "$GUESTS_JSON" | jq 'length')
log "Found ${GUEST_COUNT} running guests."
# For each guest, find the most recent successful (status == "OK") vzdump task
RESULTS=$(jq -n \
--argjson guests "$GUESTS_JSON" \
--argjson tasks "$TASKS_JSON" \
--argjson cutoff "$CUTOFF" \
--argjson now "$NOW" \
--argjson window "$WINDOW_DAYS" '
$guests | map(
. as $g |
($tasks | map(
select(
(.vmid | tostring) == $g.vmid
and .status == "OK"
) | .starttime
) | max // 0) as $last_ts |
{
vmid: $g.vmid,
name: $g.name,
type: $g.type,
last_backup_ts: $last_ts,
age_days: (if $last_ts > 0 then (($now - $last_ts) / 86400 | floor) else -1 end),
status: (
if $last_ts >= $cutoff then "green"
elif $last_ts > 0 then "yellow"
else "red"
end
)
}
) | sort_by(.vmid | tonumber)
')
GREEN_GUESTS=$(echo "$RESULTS" | jq '[.[] | select(.status == "green")]')
YELLOW_GUESTS=$(echo "$RESULTS" | jq '[.[] | select(.status == "yellow")]')
RED_GUESTS=$(echo "$RESULTS" | jq '[.[] | select(.status == "red")]')
GREEN_COUNT=$(echo "$GREEN_GUESTS" | jq 'length')
YELLOW_COUNT=$(echo "$YELLOW_GUESTS" | jq 'length')
RED_COUNT=$(echo "$RED_GUESTS" | jq 'length')
log "Results: ${GREEN_COUNT} green, ${YELLOW_COUNT} yellow, ${RED_COUNT} red"
# ---------------------------------------------------------------------------
# Build Discord payload
# ---------------------------------------------------------------------------
if [[ "$RED_COUNT" -gt 0 ]]; then
EMBED_COLOR=15548997 # 0xED4245 red
STATUS_LINE="🔴 Backup issues detected — action required"
elif [[ "$YELLOW_COUNT" -gt 0 ]]; then
EMBED_COLOR=16705372 # 0xFF851C orange
STATUS_LINE="🟡 Some backups are overdue (>${WINDOW_DAYS}d)"
else
EMBED_COLOR=5763719 # 0x57F287 green
STATUS_LINE="🟢 All ${GUEST_COUNT} guests backed up within ${WINDOW_DAYS} days"
fi
# Format guest lines: "VM 116 (plex) — 2d ago" or "CT 302 (claude-runner) — NO BACKUPS"
format_guest() {
local prefix="$1" guests="$2"
echo "$guests" | jq -r '.[] | "\(.type) \(.vmid) (\(.name))"' |
while IFS= read -r line; do echo "${prefix} ${line}"; done
}
format_guest_with_age() {
local prefix="$1" guests="$2"
echo "$guests" | jq -r '.[] | "\(.type) \(.vmid) (\(.name)) — \(.age_days)d ago"' |
while IFS= read -r line; do echo "${prefix} ${line}"; done
}
# Build fields array
fields='[]'
if [[ "$GREEN_COUNT" -gt 0 ]]; then
green_lines=$(format_guest_with_age "✅" "$GREEN_GUESTS")
fields=$(echo "$fields" | jq \
--arg name "🟢 Healthy (${GREEN_COUNT})" \
--arg value "$green_lines" \
'. + [{"name": $name, "value": $value, "inline": false}]')
fi
if [[ "$YELLOW_COUNT" -gt 0 ]]; then
yellow_lines=$(format_guest_with_age "⚠️" "$YELLOW_GUESTS")
fields=$(echo "$fields" | jq \
--arg name "🟡 Overdue — last backup >${WINDOW_DAYS}d ago (${YELLOW_COUNT})" \
--arg value "$yellow_lines" \
'. + [{"name": $name, "value": $value, "inline": false}]')
fi
if [[ "$RED_COUNT" -gt 0 ]]; then
red_lines=$(format_guest "❌" "$RED_GUESTS")
fields=$(echo "$fields" | jq \
--arg name "🔴 No Successful Backups Found (${RED_COUNT})" \
--arg value "$red_lines" \
'. + [{"name": $name, "value": $value, "inline": false}]')
fi
FOOTER="$(date -u '+%Y-%m-%d %H:%M UTC') · ${GUEST_COUNT} guests · window: ${WINDOW_DAYS}d"
PAYLOAD=$(jq -n \
--arg title "Proxmox Backup Check — ${STATUS_LINE}" \
--argjson color "$EMBED_COLOR" \
--argjson fields "$fields" \
--arg footer "$FOOTER" \
'{
"embeds": [{
"title": $title,
"color": $color,
"fields": $fields,
"footer": {"text": $footer}
}]
}')
if [[ "$DRY_RUN" -eq 1 ]]; then
log "DRY RUN — Discord payload:"
echo "$PAYLOAD" | jq .
exit 0
fi
log "Posting to Discord..."
HTTP_STATUS=$(curl -s -o /tmp/proxmox-backup-check-discord.out \
-w "%{http_code}" \
-X POST "$DISCORD_WEBHOOK" \
-H "Content-Type: application/json" \
-d "$PAYLOAD")
if [[ "$HTTP_STATUS" -ge 200 && "$HTTP_STATUS" -lt 300 ]]; then
log "Discord notification sent (HTTP ${HTTP_STATUS})."
else
log "Warning: Discord returned HTTP ${HTTP_STATUS}."
cat /tmp/proxmox-backup-check-discord.out >&2
exit 1
fi

View File

@ -93,6 +93,34 @@ else
fail "disk_usage" "expected 'N /path', got: '$result'"
fi
# --- --hosts flag parsing ---
echo ""
echo "=== --hosts argument parsing tests ==="
# Single host
input="vm-115:10.10.0.88"
IFS=',' read -ra entries <<<"$input"
label="${entries[0]%%:*}"
addr="${entries[0]#*:}"
if [[ "$label" == "vm-115" && "$addr" == "10.10.0.88" ]]; then
pass "--hosts single entry parsed: $label $addr"
else
fail "--hosts single" "expected 'vm-115 10.10.0.88', got: '$label $addr'"
fi
# Multiple hosts
input="vm-115:10.10.0.88,lxc-225:10.10.0.225"
IFS=',' read -ra entries <<<"$input"
label1="${entries[0]%%:*}"
addr1="${entries[0]#*:}"
label2="${entries[1]%%:*}"
addr2="${entries[1]#*:}"
if [[ "$label1" == "vm-115" && "$addr1" == "10.10.0.88" && "$label2" == "lxc-225" && "$addr2" == "10.10.0.225" ]]; then
pass "--hosts multi entry parsed: $label1 $addr1, $label2 $addr2"
else
fail "--hosts multi" "unexpected parse result"
fi
echo ""
echo "=== Results: $PASS passed, $FAIL failed ==="
((FAIL == 0))

View File

@ -92,6 +92,42 @@ CT 302 does **not** have an SSH key registered with Gitea, so SSH git remotes wo
3. Commit to Gitea, pull on CT 302
4. Add Uptime Kuma monitors if desired
## Health Check Thresholds
Thresholds are evaluated in `health_check.py`. All load thresholds use **per-core** metrics
to avoid false positives from LXC containers (which see the Proxmox host's aggregate load).
### Load Average
| Metric | Value | Rationale |
|--------|-------|-----------|
| `LOAD_WARN_PER_CORE` | `0.7` | Elevated — investigate if sustained |
| `LOAD_CRIT_PER_CORE` | `1.0` | Saturated — CPU is a bottleneck |
| Sample window | 5-minute | Filters transient spikes (not 1-minute) |
**Formula**: `load_per_core = load_5m / nproc`
**Why per-core?** Proxmox LXC containers see the host's aggregate load average via the
shared kernel. A 32-core Proxmox host at load 9 is at 0.28/core (healthy), but a naive
absolute threshold of 2× would trigger at 9 for a 4-core LXC. Using `load_5m / nproc`
where `nproc` returns the host's visible core count gives the correct ratio.
**Validation examples**:
- Proxmox host: load 9 / 32 cores = 0.28/core → no alert ✓
- VM 116 at 0.75/core → warning ✓ (above 0.7 threshold)
- VM at 1.1/core → critical ✓
### Other Thresholds
| Check | Threshold | Notes |
|-------|-----------|-------|
| Zombie processes | 5 | Single zombies are transient noise; alert only if ≥ 5 |
| Swap usage | 30% of total swap | Percentage-based to handle varied swap sizes across hosts |
| Disk warning | 85% | |
| Disk critical | 95% | |
| Memory | 90% | |
| Uptime alert | Non-urgent Discord post | Not a page-level alert |
## Related
- [monitoring/CONTEXT.md](../CONTEXT.md) — Overall monitoring architecture

View File

@ -47,12 +47,13 @@ home_network:
services: ["media", "transcoding"]
description: "Tdarr media transcoding"
vpn_docker:
hostname: "10.10.0.121"
port: 22
user: "cal"
services: ["vpn", "docker"]
description: "VPN and Docker services"
# DECOMMISSIONED: vpn_docker (10.10.0.121) - VM 105 destroyed 2026-04
# vpn_docker:
# hostname: "10.10.0.121"
# port: 22
# user: "cal"
# services: ["vpn", "docker"]
# description: "VPN and Docker services"
remote_servers:
akamai_nano:

View File

@ -23,7 +23,7 @@ servers:
pihole: 10.10.0.16 # Pi-hole DNS and ad blocking
sba_pd_bots: 10.10.0.88 # SBa and PD bot services
tdarr: 10.10.0.43 # Media transcoding
vpn_docker: 10.10.0.121 # VPN and Docker services
# vpn_docker: 10.10.0.121 # DECOMMISSIONED — VM 105 destroyed, migrated to arr-stack LXC 221
```
### Cloud Servers
@ -175,11 +175,12 @@ Host tdarr media
Port 22
IdentityFile ~/.ssh/homelab_rsa
Host docker-vpn
HostName 10.10.0.121
User cal
Port 22
IdentityFile ~/.ssh/homelab_rsa
# DECOMMISSIONED: docker-vpn (10.10.0.121) - VM 105 destroyed, migrated to arr-stack LXC 221
# Host docker-vpn
# HostName 10.10.0.121
# User cal
# Port 22
# IdentityFile ~/.ssh/homelab_rsa
# Remote Cloud Servers
Host akamai-nano akamai

View File

@ -0,0 +1,95 @@
---
title: "Autonomous Nightly Run — 2026-04-10 (run 2)"
description: "Second autonomous pipeline run of the day: 4 PRs created (1 APPROVED, 3 REQUEST_CHANGES), 11 items queued to pd-plan, 0 rejections"
type: context
domain: paper-dynasty
tags: [autonomous-pipeline, nightly-run]
---
## Run Metadata
- Date: 2026-04-10 (second run of the day; see autonomous-nightly-2026-04-10.md for run 1)
- Duration: ~25 minutes wall clock
- Slots before: 0/10 S, 0/5 M (no prior autonomous PRs open from run 1)
- Slots after: 4/10 S, 0/5 M (4 S items in_progress pending merge)
## Findings
- Analyst produced 5 findings
- Growth-po produced 10 findings
- Dedup filtered: 0 duplicates, 0 partial overlaps (haiku call skipped — both comparison lists were empty, making all 15 findings trivially novel)
## PO Decisions
| Finding ID | PO | Decision | Size | Notes |
|---|---|---|---|---|
| analyst-2026-04-10-001 | database-po | approved | M | HTTPException-200 sweep — consumer audit required |
| analyst-2026-04-10-002 | database-po | reshaped | S | Drop premature empty-table 404s; do NOT materialize large querysets |
| analyst-2026-04-10-003 | discord-po | approved | S | Bare except narrowing (high severity) |
| analyst-2026-04-10-004 | database-po | approved | M | Packs beachhead tests — sequence after 001/002 |
| analyst-2026-04-10-005 | (autonomous) | auto-approved | S | Structured rejection parser |
| growth-sweep-2026-04-10-001 | discord-po | reshaped | M | Command logging — split into db endpoint + bot middleware |
| growth-sweep-2026-04-10-002 | database-po | approved | S | Card of the week endpoint |
| growth-sweep-2026-04-10-003 | discord-po | approved | S | Gauntlet results recap |
| growth-sweep-2026-04-10-004 | discord-po | approved | S | /compare command |
| growth-sweep-2026-04-10-005 | discord-po | approved | M | /profile command — needs aggregate endpoint |
| growth-sweep-2026-04-10-006 | discord-po | approved | S | Rarity celebration embeds — use canonical rarity names |
| growth-sweep-2026-04-10-007 | discord-po | approved | S | Gauntlet schedule + reminder |
| growth-sweep-2026-04-10-008 | discord-po | approved | M | Starter pack grant — idempotent, onboarding critical |
| growth-sweep-2026-04-10-009 | discord-po | approved | M | /pack history with pack_log table |
| growth-sweep-2026-04-10-010 | database-po | reshaped | M | Webhook infra first, cardset hook as consumer |
## PRs Created
| PR | Repo | Title | Tests | Review |
|---|---|---|---|---|
| #163 | discord-app | fix(gameplay): replace bare except with NoResultFound | pre-existing collection failures (testcontainers missing locally) | **REQUEST_CHANGES** — cache_player uses session.get which returns None, not raises; new except NoResultFound is unreachable and caller crashes with AttributeError |
| #164 | discord-app | feat(gauntlet): auto-post results recap embed | PASS (14 new tests) | **REQUEST_CHANGES**`loss_max or 99` treats loss_max=0 as falsy, causing perfect-run bonus tier to show ⬜ instead of ❌ on 10-1 finish |
| #212 | database | feat(api): card of the week featured endpoint | PASS (6 new tests) | **APPROVED** — joins, AI exclusion, tiebreak, 404 handling all correct. Merge via `pd-pr merge --no-approve` |
| #165 | discord-app | feat(cogs): /compare slash command | PASS (30 new tests) | **REQUEST_CHANGES**`_is_pitcher` omits CP (Closing Pitcher), silently misclassifies closers as batters |
## Mix Ratio
- Recent history: insufficient data (first full pipeline run after the 2-PR morning run); skipped the bash ratio check to conserve budget
- Bias applied this run: none (interleaved stability/feature manually)
- Dispatched mix: 1 stability (analyst-003) + 3 feature (growth-002/003/004). 1:3 is feature-heavy; balance the next run toward stability if this trend continues
## Wishlist Additions
None. All Large items were scoped as M or smaller by POs — nothing escalated to the L wishlist this run.
## Queued to pd-plan (waiting for slot)
Added as `status=active`, `slot=autonomous`:
- #20: Sweep HTTPException(status_code=200) in routers (M, database)
- #21: Remove double-count and premature empty-table 404s (S, database)
- #22: Beachhead integration tests for packs router (M, database)
- #23: Structured rejection parser for autonomous pipeline (S, autonomous)
- #24: Command usage logging — bot middleware + db endpoint (M, multi-repo)
- #25: Player profile command /profile (M, multi-repo)
- #26: Rarity celebration embeds for pack pulls (S, discord-app)
- #27: Gauntlet schedule + reminder task (S, discord-app)
- #28: Starter pack grant for new players (M, multi-repo)
- #29: Pack opening history command /pack history (M, multi-repo)
- #30: Outbound webhook dispatcher + cardset publish hook (M, database)
Shipped as in_progress linked to PRs:
- #31 → discord-app#163
- #32 → discord-app#164
- #33 → database#212
- #34 → discord-app#165
## Rejections
None. All 15 findings passed PO review (5 reshaped, 10 approved as-is).
## Self-Improvement Notes
1. **pr-reviewer caught 3 of 4 real bugs.** This is exactly the value the review gate is supposed to provide. Worth noting that tests passed on all three REQUEST_CHANGES PRs — the bugs were specifically in code paths the author's own tests didn't exercise:
- PR #163: author didn't test a session.get cache-miss; the narrowed exception class doesn't actually match the real "not found" signal in that function
- PR #164: test asserted absence of ✅ but not presence of ❌, missing the falsy-zero substitution bug
- PR #165: test suite didn't include a CP (closer) case, so the position-gate gap was invisible
Engineer prompts should explicitly require adversarial tests that exercise the exact code path the change modifies, including zero/empty/None boundary values.
2. **Worktree contamination on PR #165.** The /compare PR diff included `gauntlets.py` and `tests/test_gauntlet_recap.py` changes from PR #164, plus a `gameplay_queries.py` formatting touch from PR #163. Parallel worktrees branching from the same mainline apparently picked up each other's state. Investigate whether `isolation: "worktree"` in the Agent tool produces a fully isolated checkout or whether engineers need to explicitly branch from `origin/main`. If worktrees share a .git, sequential dispatch may be safer for tighter commit isolation.
3. **Budget headroom tight at scale.** Dispatched only 4 of 15 approved items due to budget caution. 4 engineers + 4 reviewers consumed ~$10 (~$1.20/agent). At this rate, filling all 15 slots would require a ~$30 budget ceiling. Options: (a) use Haiku for engineers on mechanical changes like the HTTPException sweep, (b) batch multiple small fixes into one engineer invocation when they touch the same file, (c) cache common context via a prewarm step.
4. **Rejection parser finding is legit.** analyst-005's observation about rejection markdown blobs in dedup input is correct — when the rejection list grows, raw markdown will poison semantic matching quality. Auto-approved to the queue (#23). Self-improving the pipeline itself is exactly the kind of work the `autonomous` repo scope was added for.
5. **Empty dedup lists mean haiku call was dead weight.** Implement a preflight short-circuit: if `open_autonomous_prs` AND `recent_rejections` are both empty, skip the dedup haiku call entirely. Saves ~$0.05 and a few seconds per clean-slate run.
6. **Database-po reshape was substantive.** Both reshape decisions from database-po (analyst-002, growth-010) were correct and saved bad PRs. The original analyst recommendation for analyst-002 (materialize large querysets) would have regressed performance; the PO catch saved a regression. Growth-010's reshape correctly identified that the real cost of 2.6a is the webhook dispatcher plumbing, not the hook site. Keep POs in the loop for all findings — the cost is justified.

View File

@ -0,0 +1,95 @@
---
title: "Autonomous Nightly Run — 2026-04-10"
description: "First autonomous nightly run: 2 PRs shipped, 7 items queued, 0 rejections. Budget-constrained dispatch."
type: context
domain: paper-dynasty
tags: [autonomous-pipeline, nightly-run]
---
## Run Metadata
- Date: 2026-04-10
- Slots before: 10/10 S, 5/5 M (no active autonomous work)
- Slots after: 8/10 S, 5/5 M (2 S slots now in-flight via PRs)
- Open autonomous PRs before run: 0
- Recent rejections: 0
- Budget constraint: run hit the $5 USD ceiling early due to broad analyst sweep; dispatched 2 engineers instead of full slot fill.
## Findings
- Analyst produced 8 findings across database, discord-app, and autonomous pipeline
- Growth-po produced 5 findings (all discord-app, all S-sized, all Phase 2 roadmap items)
- Dedup haiku: **skipped** (0 open PRs + 0 rejections = no possible duplicates; all findings novel by construction)
## PO Decisions
### Database-po (4 findings)
| Finding ID | Decision | Size | Notes |
|---|---|---|---|
| analyst-2026-04-10-002 | approved | S | HTTPException(200) sweep across ~10 routers |
| analyst-2026-04-10-004 | approved | S | N+1 Paperdex fix; add query-count regression test |
| analyst-2026-04-10-006 | reshaped | M | Split into 3 S tickets, start with pack-opening tests |
| analyst-2026-04-10-008 | approved | S | Remove unfiltered pre-count in GET /packs **→ shipped** |
### Discord-po (8 findings)
| Finding ID | Decision | Size | Notes |
|---|---|---|---|
| analyst-2026-04-10-001 | approved | S | Delete dead gameplay_legacy.py **→ shipped** |
| analyst-2026-04-10-003 | approved | S | Economy tree.on_error override (play-lock bug) — **high priority** |
| analyst-2026-04-10-005 | reshaped | M | Two-phase cutover for economy_new/packs.py migration |
| growth-sweep-2026-04-10-001 | approved | S | Rarity celebration embeds — use canonical rarity vocab |
| growth-sweep-2026-04-10-002 | approved | S | /compare command — ephemeral by default, LHP/RHP split |
| growth-sweep-2026-04-10-003 | approved | S | Gauntlet results recap embed |
| growth-sweep-2026-04-10-004 | reshaped | M | Command usage telemetry — cross-repo, needs privacy review |
| growth-sweep-2026-04-10-005 | reshaped | S+M | Split: /gauntlet schedule (S) first, reminder scheduler (M) after scheduler approach specced |
### Self-improvement (auto-approved, no PO gate)
| Finding ID | Decision | Size | Notes |
|---|---|---|---|
| analyst-2026-04-10-007 | approved | S | Split run-nightly.sh stdout/stderr, write last-run-result.json, voice-notify on failure |
## PRs Created
- **discord-app#162**`chore(cogs): remove dead gameplay_legacy cog (4,723 lines, zero references)` — tests PASS (no new failures; 2 pre-existing SQLite path issues unchanged), labels applied, **pr-reviewer dispatch skipped (budget)** — https://git.manticorum.com/cal/paper-dynasty-discord/pulls/162
- **database#211**`fix(packs): remove unfiltered pre-count in GET /packs (3 round-trips → 2)` — tests PASS (266 passed, 13 pre-existing failures unchanged), consumer check clean (no 404 handlers in discord-app), labels applied, **pr-reviewer dispatch skipped (budget)** — https://git.manticorum.com/cal/paper-dynasty-database/pulls/211
- **Post-run diagnostic:** Pyright flagged 4 `Pack.id` attribute access errors after ruff reformatted the file. These are Peewee ORM false positives (`id` is added dynamically by Peewee's Model metaclass) and are pre-existing elsewhere in the codebase. Not a regression from this change.
## Mix Ratio
- No prior digests — this is the first autonomous nightly run. Default 1:1 interleave applied.
- This run shipped 2 stability items and 0 features. Next run should bias toward feature dispatches if budget permits.
## Wishlist Additions
- None. All approved items are S or M and could fit within a normal slot budget — no L-sized items surfaced in this sweep.
## Queued for Next Run (approved but not dispatched due to budget)
The following items are **approved and ready to ship** but were not dispatched this run. They should be picked up first thing next run:
**High priority (stability, real user impact):**
1. `analyst-2026-04-10-003` (S) — Economy cog overwrites global tree.on_error, bypassing play-lock release. **Players are getting stuck due to this bug.** Should be the first item dispatched next run.
2. `analyst-2026-04-10-002` (S) — HTTPException(200) sweep across ~10 DB routers.
3. `analyst-2026-04-10-004` (S) — N+1 Paperdex fix in players endpoints.
**Self-improvement:**
4. `analyst-2026-04-10-007` (S) — run-nightly.sh stdout/stderr split + last-run-result.json. This is a *prerequisite* for reliable future runs; should be prioritized.
**Features (growth):**
5. `growth-sweep-2026-04-10-001` (S) — Rarity celebration embeds.
6. `growth-sweep-2026-04-10-003` (S) — Gauntlet results recap embed.
7. `growth-sweep-2026-04-10-002` (S) — /compare command.
**Reshaped (needs spec work before dispatch):**
- `analyst-2026-04-10-006` (M) — first of 3 split tickets: pack-opening happy path + insufficient funds + duplicate handling.
- `analyst-2026-04-10-005` (M) — Phase 1 spec of economy.py vs economy_new/packs.py drift.
- `growth-sweep-2026-04-10-004` (M) — Cross-repo telemetry; needs privacy posture confirmation.
- `growth-sweep-2026-04-10-005` Issue A (S) — /gauntlet schedule command (pure read).
## Rejections
- None this run.
## Self-Improvement Notes
**The pipeline hit its $5 budget ceiling after dispatching analyst + growth-po + 2 POs + 2 engineers.** Breakdown of spend was top-heavy: the analyst agent alone consumed roughly half the budget due to a 411s, 104-tool-use deep audit. Observations for future runs:
1. **Analyst cap**: Consider passing a stricter cap (e.g., "limit to top 5 findings, max 30 tool uses") to the analyst to keep its spend predictable.
2. **Dedup skip was correct**: With 0 open PRs and 0 rejections, the dedup haiku call would have been pure overhead. Encoding this as an orchestrator shortcut (skip dedup when both inputs are empty) would save ~$0.10 per first-run scenario.
3. **pr-reviewer was skipped**: Engineer PRs #162 and #211 did not receive an automated review pass. Cal should manually review these before merge. Future runs should reserve ~$0.30 per PR for pr-reviewer.
4. **pd-plan CLI skipped**: Approved-but-queued items are documented in this digest only, not in the pd-plan database. Next run's preflight should parse this digest's "Queued for Next Run" section and dispatch those items first before generating new findings.
5. **Budget-aware slot filling**: Orchestrator should compute a rough budget forecast (analyst ~$2, each PO ~$0.30, each engineer ~$0.60, each pr-reviewer ~$0.30) before dispatching engineers, and cap engineer count at `(remaining_budget - digest_reserve) / (engineer_cost + reviewer_cost)`.
6. **The `analyst-2026-04-10-007` self-improvement item directly addresses observability gaps that made this digest harder to write** — prioritize it next run.

View File

@ -0,0 +1,145 @@
---
title: "Autonomous Improvement Pipeline — Build Session 2026-04-09/10"
description: "Single-session design + implementation + first smoke test of the Paper Dynasty autonomous improvement pipeline. 2 PRs shipped, system ready to run nightly pending one more test."
type: context
domain: paper-dynasty
tags: [autonomous-pipeline, session-summary, paper-dynasty, architecture]
---
## Summary
In a single session spanning 2026-04-09 evening through 2026-04-10 early morning, Cal and Claude designed, specced, planned, implemented, merged, and ran the first smoke test of a nightly autonomous improvement pipeline for the Paper Dynasty ecosystem. The goal: a system where Cal wakes up to a Monday-morning queue of "here's what Claude did for you" PRs he can review and merge, keeping momentum even when he's unavailable.
The system ships. It produced 2 real, mergeable PRs on its first run before hitting a budget ceiling. Post-run fixes are in. The systemd timer is installed but not enabled pending one more validation run.
## The arc of the session
### Phase 1 — Brainstorming (spec)
Cal arrived with a two-part idea: (1) introspection on the codebase to recommend updates, (2) recommendations for workflow/tooling optimization. Through ~15 clarifying exchanges, we landed on this shape:
- **Nightly scheduled** (not on-demand) — moves forward despite Cal's schedule
- **Autonomous PR dispatch** (not just reports) — Monday morning review queue
- **WIP slot limits** to prevent overwhelm: 10 S, 5 M, no autonomous L; L items go to a wishlist
- **1:1 stability/feature bias** — mix both types of work
- **Three repos in scope:** database, discord-app, card-creation (card-creation has its own autonomous dynamic now)
- **Separation of concerns:**
- New **analyst agent** does code audits with fresh eyes (no ownership bias)
- **growth-po** does product/roadmap sweeps in a new "sweep mode"
- **Domain POs** (database-po, discord-po, cards-po) gate findings with go/no-go decisions
- **Engineer agents** build approved S/M work in isolated worktrees
- **pr-reviewer** gates PRs before Cal sees them
- **Rolling 30-day rejection log** so the pipeline doesn't re-suggest rejected ideas
- **Hybrid tracking:** pd-plan for slot counts + wishlist, KB for digests + rejection log
- **Transparency as a core value** — every decision, rejection, and action documented so both humans and future agents have full context
### Phase 2 — Plan
20-task implementation plan written and self-reviewed against the spec. Caught one gap during self-review: the mix ratio (§9) wasn't explicitly implemented anywhere. Added a step 6b to the orchestrator prompt. Another round of refinements during plan review:
1. Wishlist → Run Digest connection (L items should appear in nightly digest)
2. Rolling 30-day rejection context fed to analyst + growth-po to avoid re-discovery
3. Pure-bash preflight for pure data lookups (slot check, git pull, PR inventory, rejection query) — no LLM spin-up on "no slots" nights
4. Dedup as a haiku call (not a script) — semantic matching catches rewording
### Phase 3 — Implementation (subagent-driven)
Created worktree `.worktrees/autonomous-pipeline` on branch `feat/autonomous-pipeline`. Executed plan via subagent-driven-development skill:
- **Task 1** (inline): scaffolded `autonomous/` directory with README
- **Batch A** (sonnet subagent, Tasks 2-5): extended `pd-plan` CLI with `slot`/`wishlist` schema columns, `slots`/`wishlist` subcommands, `--slot`/`--wishlist` flags on `add`/`update`, new summary section. 8 pytest tests, all passing.
- **Task 6** (sonnet subagent): `autonomous/lib/check_slots.py` with 3 pytest tests
- **Batch B** (sonnet subagent, Tasks 7-9): bash scripts `inventory_prs.sh`, `query_rejections.sh`, `preflight.sh`. Notable: switched from `tea pulls list` to `tea api` because the former returns labels as a flat string (not objects).
- **Batch C** (sonnet subagent, Tasks 10-14): `.claude/agents/analyst.md`, sweep-mode append to `growth-po.md`, `dedup-haiku.md`, `orchestrator.md` (284 lines), `run-nightly.sh` wrapper
- **Task 18** (inline): preflight skip smoke test — added 15 dummy initiatives, verified `preflight.sh` exits 1, cleaned up
11 commits on the feature branch. Fast-forward merged to main. Worktree force-removed. Branch deleted. Pushed to origin.
One snag worth noting: the first subagent dispatch hit a wall of permission prompts Cal had to click through. Existing memory already had the rule "code-writing subagents MUST use mode: acceptEdits" — I'd just failed to apply it. Fixed for all subsequent dispatches.
### Phase 4 — Integration (Gitea + systemd)
- **Gitea labels** created via pd-ops agent in all 3 sub-project repos: `autonomous`, `size:S`, `size:M`, `type:stability`, `type:feature` (colors: `#6366f1`, `#10b981`, `#f59e0b`, `#0891b2`, `#ec4899`). Umbrella repo got its own set later when the observability ticket was filed.
- **Scheduled task** at `~/.config/claude-scheduled/tasks/autonomous-nightly/` — settings.json (haiku outer, $1 budget, 3600s timeout), prompt.md (just runs the wrapper), mcp.json (empty; the inner claude inherits Cal's global MCP config including gitea-mcp)
- **Systemd timer** at `~/.config/systemd/user/claude-scheduled@autonomous-nightly.timer` — nightly 02:00 with 15-min random delay, Persistent=true. Registered but NOT enabled.
### Phase 5 — First smoke test
Kicked off `autonomous/run-nightly.sh` at 02:40:07 local. Ran 15 minutes. Terminated at 02:55:47 by the $5 budget ceiling.
**Despite the budget hit, the pipeline actually worked:**
- Preflight ran cleanly (slots 10S/5M free, 0 open PRs, 0 rejections)
- Analyst produced 8 findings across database, discord-app, autonomous (self-improvement)
- Growth-po produced 5 findings (all discord Phase 2 roadmap items, all S-sized)
- Dedup correctly skipped (empty inputs = no possible dupes)
- POs made real decisions: many approved, several thoughtfully reshaped
- 2 PRs shipped before budget ran out, both correctly labeled and mergeable
**PRs shipped:**
- **discord-app#162**`chore(cogs): remove dead gameplay_legacy cog (4,723 lines, zero references)` — caught that `cogs/gameplay_legacy.py` was 4,723 lines of dead code with zero inbound references
- **database#211**`fix(packs): remove unfiltered pre-count in GET /packs (3 round-trips to 2)` — caught a real correctness bug: unfiltered `Pack.select().count()` was returning 404 when no packs existed globally instead of returning empty filter results
**What went wrong:**
1. Analyst alone consumed ~$2.50 with a 411s, 104-tool-use deep sweep
2. `pr-reviewer` dispatch was skipped — budget ran out
3. Digest Write was permission-denied (inner claude wasn't running with --dangerously-skip-permissions) — manually extracted and saved from the JSON output
4. pd-plan integration skipped — approved queued items only in the digest
5. 7 approved items never dispatched, including a high-priority real bug (economy cog overwriting `tree.on_error` causing stuck play-lock)
6. Multiple Bash tool denials wasted budget on retries (compound commands, venv activation, `source`, curl, `diff <()`)
### Phase 6 — Post-run fixes
Spun up a yolo-mode `claude -p` agent to apply three critical fixes. Commit `a79efb2`:
1. Inner claude budget: $5 → $20
2. Added `--dangerously-skip-permissions` to inner claude in `run-nightly.sh`
3. Analyst scope tightened in `.claude/agents/analyst.md`: max findings 15 → 5, added 30 tool-use cap with budget starvation rationale
Also filed `cal/paper-dynasty-umbrella#3` (labels: `autonomous`, `size:S`, `type:stability`) for the observability self-improvement (split stdout/stderr, write `last-run-result.json`, voice-notify on failure). This is exactly the kind of ticket the pipeline could pick up on a future autonomous run.
## Current state (as of 2026-04-10)
- ✅ All code merged to main and pushed to origin
- ✅ 15 Gitea labels created across 4 repos (3 sub-projects + umbrella)
- ✅ Scheduled task installed
- ✅ Systemd timer unit installed
- ✅ 2 real PRs shipped (pending Cal review / reviewer pipeline)
- ✅ Observability ticket filed
- ✅ Post-run fixes applied
- ⏸️ Systemd timer **NOT ENABLED** — pending one more validation smoke test with the $20 budget + tightened analyst
## Queued work for next run
See `project_autonomous_first_run.md` memory file for the full list. Headline items:
1. `analyst-2026-04-10-003` — Economy cog `tree.on_error` bug (real stuck-user impact) — dispatch first
2. `cal/paper-dynasty-umbrella#3` — Observability improvement (unblocks future debugging) — dispatch early
3. 5 other approved items from the first run (3 features, 2 stability)
4. 4 reshaped items that need additional spec work before dispatch
## Why this matters
This was a meta-accomplishment: building the tooling that builds the tooling. The pipeline is now a standing autonomous capability in the Paper Dynasty ecosystem. Cal's availability is no longer the bottleneck for routine stability fixes, small features, and dead-code cleanup. As confidence builds, the slot limits can rise, the budget can expand, and the scope can broaden.
The first run also validated a deeper question: **can agents produce genuinely useful work without human guidance on what to build?** The answer, based on these 2 PRs, is yes — the pipeline caught a real correctness bug and a real dead-code pile that Cal had not flagged. That's the whole value proposition working on night one.
## Next session pickup
When resuming:
1. Check status of `cal/paper-dynasty-discord#162` and `cal/paper-dynasty-database#211` — merged? closed? pending?
2. Check status of `cal/paper-dynasty-umbrella#3` — has it been picked up?
3. Decide: enable the systemd timer, or run another manual smoke test first
4. If running another smoke test: expect ~$7-10 with the new config (analyst $2, growth-po $0.30, 2 POs × $0.30, 5 engineers × $0.80, 5 pr-reviewers × $0.30)
5. See `project_autonomous_pipeline.md` and `project_autonomous_first_run.md` in memory for full context
## References
- Spec: `docs/superpowers/specs/2026-04-09-autonomous-improvement-pipeline-design.md`
- Plan: `docs/superpowers/plans/2026-04-09-autonomous-improvement-pipeline.md`
- Commit log: `git log --oneline --grep='autonomous'` in paper-dynasty-umbrella
- First run digest: `autonomous-nightly-2026-04-10.md` (this same domain)
- Live system: `/mnt/NV2/Development/paper-dynasty/autonomous/`

View File

@ -178,7 +178,7 @@ When merging many PRs at once (e.g., batch pagination PRs), branch protection ru
| `LOG_LEVEL` | Logging verbosity (default: INFO) |
| `DATABASE_TYPE` | `postgresql` |
| `POSTGRES_HOST` | Container name of PostgreSQL |
| `POSTGRES_DB` | Database name (`pd_master`) |
| `POSTGRES_DB` | Database name `pd_master` (prod) / `paperdynasty_dev` (dev) |
| `POSTGRES_USER` | DB username |
| `POSTGRES_PASSWORD` | DB password |
@ -189,4 +189,6 @@ When merging many PRs at once (e.g., batch pagination PRs), branch protection ru
| Database API (prod) | `ssh akamai` | `pd_api` | 815 |
| Database API (dev) | `ssh pd-database` | `dev_pd_database` | 813 |
| PostgreSQL (prod) | `ssh akamai` | `pd_postgres` | 5432 |
| PostgreSQL (dev) | `ssh pd-database` | `pd_postgres` | 5432 |
| PostgreSQL (dev) | `ssh pd-database` | `sba_postgres` | 5432 |
**Dev database credentials:** container `sba_postgres`, database `paperdynasty_dev`, user `sba_admin`. Prod uses `pd_postgres`, database `pd_master`.

View File

@ -0,0 +1,170 @@
---
title: "Discord Bot Browser Testing via Playwright + CDP"
description: "Step-by-step workflow for automated Discord bot testing using Playwright connected to Brave browser via Chrome DevTools Protocol. Covers setup, slash command execution, and screenshot capture."
type: runbook
domain: paper-dynasty
tags: [paper-dynasty, discord, testing, playwright, automation]
---
# Discord Bot Browser Testing via Playwright + CDP
Automated testing of Paper Dynasty Discord bot commands by connecting Playwright to a running Brave browser instance with Discord open.
## Prerequisites
- Brave browser installed (`brave-browser-stable`)
- Playwright installed (`pip install playwright && playwright install chromium`)
- Discord logged in via browser (not desktop app)
- Discord bot running (locally via docker-compose or on remote host)
- Bot's `API_TOKEN` must match the target API environment
## Setup
### 1. Launch Brave with CDP enabled
Brave must be started with `--remote-debugging-port`. If Brave is already running, **kill it first** — otherwise the flag is ignored and the new process merges into the existing one.
```bash
killall brave && sleep 2 && brave-browser-stable --remote-debugging-port=9222 &
```
### 2. Verify CDP is responding
```bash
curl -s http://localhost:9222/json/version | python3 -m json.tool
```
Should return JSON with `Browser`, `webSocketDebuggerUrl`, etc.
### 3. Open Discord in browser
Navigate to `https://discord.com/channels/<server_id>/<channel_id>` in Brave.
**Paper Dynasty test server:**
- Server: Cals Test Server (`669356687294988350`)
- Channel: #pd-game-test (`982850262903451658`)
- URL: `https://discord.com/channels/669356687294988350/982850262903451658`
### 4. Verify bot is running with correct API token
```bash
# Check docker-compose.yml has the right API_TOKEN for the target environment
grep API_TOKEN /mnt/NV2/Development/paper-dynasty/discord-app/docker-compose.yml
# Dev API token lives on the dev host:
ssh pd-database "docker exec sba_postgres psql -U sba_admin -d paperdynasty_dev -c \"SELECT 1;\""
# Restart bot if token was changed:
cd /mnt/NV2/Development/paper-dynasty/discord-app && docker compose up -d
```
## Running Commands
### Find the Discord tab
```python
from playwright.sync_api import sync_playwright
import time
with sync_playwright() as p:
browser = p.chromium.connect_over_cdp('http://localhost:9222')
for ctx in browser.contexts:
for page in ctx.pages:
if 'discord' in page.url.lower():
print(f'Found: {page.url}')
break
browser.close()
```
### Execute a slash command and capture result
```python
from playwright.sync_api import sync_playwright
import time
def run_slash_command(command: str, wait_seconds: int = 5, screenshot_path: str = '/tmp/discord_result.png'):
"""
Type a slash command in Discord, select the top autocomplete option,
submit it, wait for the bot response, and take a screenshot.
"""
with sync_playwright() as p:
browser = p.chromium.connect_over_cdp('http://localhost:9222')
for ctx in browser.contexts:
for page in ctx.pages:
if 'discord' in page.url.lower():
msg_box = page.locator('[role="textbox"][data-slate-editor="true"]')
msg_box.click()
time.sleep(0.3)
# Type the command (delay simulates human typing for autocomplete)
msg_box.type(command, delay=80)
time.sleep(2)
# Tab selects the top autocomplete option
page.keyboard.press('Tab')
time.sleep(1)
# Enter submits the command
page.keyboard.press('Enter')
time.sleep(wait_seconds)
page.screenshot(path=screenshot_path)
print(f'Screenshot saved to {screenshot_path}')
break
browser.close()
# Example usage:
run_slash_command('/refractor status')
```
### Commands with parameters
After pressing Tab to select the command, Discord shows an options panel. To fill parameters:
1. The first parameter input is auto-focused after Tab
2. Type the value, then Tab to move to the next parameter
3. Press Enter when ready to submit
```python
# Example: /refractor status with tier filter
msg_box.type('/refractor status', delay=80)
time.sleep(2)
page.keyboard.press('Tab') # Select command from autocomplete
time.sleep(1)
# Now fill parameters if needed, or just submit
page.keyboard.press('Enter')
```
## Key Selectors
| Element | Selector |
|---------|----------|
| Message input box | `[role="textbox"][data-slate-editor="true"]` |
| Autocomplete popup | `[class*="autocomplete"]` |
## Gotchas
- **Brave must be killed before relaunch** — if an instance is already running, `--remote-debugging-port` is silently ignored
- **Bot token mismatch** — the bot's `API_TOKEN` in `docker-compose.yml` must match the target API (dev or prod). Symptoms: `{"detail":"Unauthorized"}` in bot logs
- **Viewport is None** — when connecting via CDP, `page.viewport_size` returns None. Use `page.evaluate('() => ({w: window.innerWidth, h: window.innerHeight})')` instead
- **Autocomplete timing** — typing too fast may not trigger Discord's autocomplete. The `delay=80` on `msg_box.type()` simulates human speed
- **Multiple bots** — if multiple bots register the same slash command (e.g. MantiTestBot and PucklTestBot), Tab selects the top option. Verify the correct bot name in the autocomplete popup before proceeding
## Test Plan Reference
The Refractor integration test plan is at:
`discord-app/tests/refractor-integration-test-plan.md`
Key test case groups:
- REF-01 to REF-06: Tier badges and display
- REF-10 to REF-15: Progress bars and filtering
- REF-40 to REF-42: Cross-command badges (card, roster)
- REF-70 to REF-72: Cross-command badge propagation (the current priority)
## Verified On
- **Date:** 2026-04-06
- **Browser:** Brave 146.0.7680.178 (Chromium-based)
- **Playwright:** Node.js driver via Python sync API
- **Bot:** MantiTestBot on Cals Test Server, #pd-game-test channel
- **API:** pddev.manticorum.com (dev environment)

View File

@ -0,0 +1,107 @@
---
title: "Refractor In-App Test Plan"
description: "Comprehensive manual test plan for the Refractor card evolution system — covers /refractor status, tier badges, post-game hooks, tier-up notifications, card art tiers, and known issues."
type: guide
domain: paper-dynasty
tags: [paper-dynasty, testing, refractor, discord, database]
---
# Refractor In-App Test Plan
Manual test plan for the Refractor (card evolution) system. All testing targets **dev** environment (`pddev.manticorum.com` / dev Discord bot).
## Prerequisites
- Dev bot running on `sba-bots`
- Dev API at `pddev.manticorum.com` (port 813)
- Team with seeded refractor data (team 31 from prior session)
- At least one game playable to trigger post-game hooks
---
## REF-10: `/refractor status` — Basic Display
| # | Test | Steps | Expected |
|---|---|---|---|
| 10 | No filters | `/refractor status` | Ephemeral embed with team branding, tier summary line, 10 cards sorted by tier DESC, pagination buttons if >10 cards |
| 11 | Card type filter | `/refractor status card_type:Batter` | Only batter cards shown, count matches |
| 12 | Tier filter | `/refractor status tier:T2—Refractor` | Only T2 cards, embed color changes to tier color |
| 13 | Progress filter | `/refractor status progress:Close to next tier` | Only cards >=80% to next threshold, fully evolved excluded |
| 14 | Combined filters | `/refractor status card_type:Batter tier:T1—Base Chrome` | Intersection of both filters |
| 15 | Empty result | `/refractor status tier:T4—Superfractor` (if none exist) | "No cards match your filters..." message with filter details |
## REF-20: `/refractor status` — Pagination
| # | Test | Steps | Expected |
|---|---|---|---|
| 20 | Page buttons appear | `/refractor status` with >10 cards | Prev/Next buttons visible |
| 21 | Next page | Click `Next >` | Page 2 shown, footer updates to "Page 2/N" |
| 22 | Prev page | From page 2, click `< Prev` | Back to page 1 |
| 23 | First page prev | On page 1, click `< Prev` | Nothing happens / stays on page 1 |
| 24 | Last page next | On last page, click `Next >` | Nothing happens / stays on last page |
| 25 | Button timeout | Wait 120s after command | Buttons become unresponsive |
| 26 | Wrong user clicks | Another user clicks buttons | Silently ignored |
## REF-30: Tier Badges in Card Embeds
| # | Test | Steps | Expected |
|---|---|---|---|
| 30 | T0 card display | View a T0 card via `/myteam` or `/roster` | No badge prefix, just player name |
| 31 | T1 badge | View a T1 card | Title shows `[BC] Player Name` |
| 32 | T2 badge | View a T2 card | Title shows `[R] Player Name` |
| 33 | T3 badge | View a T3 card | Title shows `[GR] Player Name` |
| 34 | T4 badge | View a T4 card (if exists) | Title shows `[SF] Player Name` |
| 35 | Badge in pack open | Open a pack with an evolved card | Badge appears in pack embed |
| 36 | API down gracefully | (hard to test) | Card displays normally with no badge, no error |
## REF-50: Post-Game Hook & Tier-Up Notifications
| # | Test | Steps | Expected |
|---|---|---|---|
| 50 | Game completes normally | Play a full game | No errors in bot logs; refractor evaluate-game fires after season-stats update |
| 51 | Tier-up notification | Play game where a card crosses a threshold | Embed in game channel: "Refractor Tier Up!", player name, tier name, correct color |
| 52 | No tier-up | Play game where no thresholds crossed | No refractor embed posted, game completes normally |
| 53 | Multiple tier-ups | Game where 2+ players tier up | One embed per tier-up, all posted |
| 54 | Auto-init new card | Play game with a card that has no RefractorCardState | State created automatically, player evaluated, no error |
| 55 | Superfractor notification | (may need forced data) | "SUPERFRACTOR!" title, teal color |
## REF-60: Card Art with Tiers (API-level)
| # | Test | Steps | Expected |
|---|---|---|---|
| 60 | T0 card image | `GET /api/v2/players/{id}/card-image?card_type=batting` | Base card, no tier styling |
| 61 | Tier override | `GET ...?card_type=batting&tier=2` | Refractor styling visible (border, diamond indicator) |
| 62 | Each tier visual | `?tier=1` through `?tier=4` | Correct border colors, diamond fill, header gradients per tier |
| 63 | Pitcher card | `?card_type=pitching&tier=2` | Tier styling applies correctly to pitcher layout |
## REF-70: Known Issues to Verify
| # | Issue | Check | Status |
|---|---|---|---|
| 70 | Superfractor embed says "Rating boosts coming in a future update!" | Verify — boosts ARE implemented now, text is stale | **Fix needed** |
| 71 | `on_timeout` doesn't edit message | Buttons stay visually active after 120s | **Known, low priority** |
| 72 | Card embed perf (1 API call per card) | Note latency on roster views with 10+ cards | **Monitor** |
| 73 | Season-stats failure kills refractor eval | Both in same try/except | **Known risk, verify logging** |
---
## API Endpoints Under Test
| Method | Endpoint | Used By |
|---|---|---|
| GET | `/api/v2/refractor/tracks` | Track listing |
| GET | `/api/v2/refractor/cards?team_id=X` | `/refractor status` command |
| GET | `/api/v2/refractor/cards/{card_id}` | Tier badge in card embeds |
| POST | `/api/v2/refractor/cards/{card_id}/evaluate` | Force re-evaluation |
| POST | `/api/v2/refractor/evaluate-game/{game_id}` | Post-game hook |
| GET | `/api/v2/teams/{team_id}/refractors` | Teams alias endpoint |
| GET | `/api/v2/players/{id}/card-image?tier=N` | Card art tier preview |
## Notification Embed Colors
| Tier | Name | Color |
|---|---|---|
| T1 | Base Chrome | Green (0x2ECC71) |
| T2 | Refractor | Gold (0xF1C40F) |
| T3 | Gold Refractor | Purple (0x9B59B6) |
| T4 | Superfractor | Teal (0x1ABC9C) |

View File

@ -158,6 +158,23 @@ ls -t ~/.local/share/claude-scheduled/logs/backlog-triage/ | head -1
~/.config/claude-scheduled/runner.sh backlog-triage
```
## Session Resumption
Tasks can opt into session persistence for multi-step workflows:
```json
{
"session_resumable": true,
"resume_last_session": true
}
```
When `session_resumable` is `true`, runner.sh saves the `session_id` to `$LOG_DIR/last_session_id` after each run. When `resume_last_session` is also `true`, the next run resumes that session with `--resume`.
Issue-poller and PR-reviewer capture `session_id` in logs and result JSON for manual follow-up.
See also: [Agent SDK Evaluation](agent-sdk-evaluation.md) for CLI vs SDK comparison.
## Cost Safety
- Per-task `max_budget_usd` cap — runner.sh detects `error_max_budget_usd` and warns

View File

@ -0,0 +1,175 @@
---
title: "Agent SDK Evaluation — CLI vs Python/TypeScript SDK"
description: "Comparison of Claude Code CLI invocation (claude -p) vs the native Agent SDK for programmatic use in the headless-claude and claude-scheduled systems."
type: context
domain: scheduled-tasks
tags: [claude-code, sdk, agent-sdk, python, typescript, headless, automation, evaluation]
---
# Agent SDK Evaluation: CLI vs Python/TypeScript SDK
**Date:** 2026-04-03
**Status:** Evaluation complete — recommendation below
**Related:** Issue #3 (headless-claude: Additional Agent SDK improvements)
## 1. Current Approach — CLI via `claude -p`
All headless Claude invocations use the CLI subprocess pattern:
```bash
claude -p "<prompt>" \
--model sonnet \
--output-format json \
--allowedTools "Read,Grep,Glob" \
--append-system-prompt "..." \
--max-budget-usd 2.00
```
**Pros:**
- Simple to invoke from any language (bash, n8n SSH nodes, systemd units)
- Uses Claude Max OAuth — no API key needed, no per-token billing
- Mature and battle-tested in our scheduled-tasks framework
- CLAUDE.md and settings.json are loaded automatically
- No runtime dependencies beyond the CLI binary
**Cons:**
- Structured output requires parsing JSON from stdout
- Error handling is exit-code-based with stderr parsing
- No mid-stream observability (streaming requires JSONL parsing)
- Tool approval is allowlist-only — no dynamic per-call decisions
- Session resumption requires manual `--resume` flag plumbing
## 2. Python Agent SDK
**Package:** `claude-agent-sdk` (renamed from `claude-code`)
**Install:** `pip install claude-agent-sdk`
**Requires:** Python 3.10+, `ANTHROPIC_API_KEY` env var
```python
from claude_agent_sdk import query, ClaudeAgentOptions
async for message in query(
prompt="Diagnose server health",
options=ClaudeAgentOptions(
allowed_tools=["Read", "Grep", "Bash(python3 *)"],
output_format={"type": "json_schema", "schema": {...}},
max_budget_usd=2.00,
),
):
if hasattr(message, "result"):
print(message.result)
```
**Key features:**
- Async generator with typed `SDKMessage` objects (User, Assistant, Result, System)
- `ClaudeSDKClient` for stateful multi-turn conversations
- `can_use_tool` callback for dynamic per-call tool approval
- In-process hooks (`PreToolUse`, `PostToolUse`, `Stop`, etc.)
- `rewindFiles()` to restore filesystem to any prior message point
- Typed exception hierarchy (`CLINotFoundError`, `ProcessError`, etc.)
**Limitation:** Shells out to the Claude Code CLI binary — it is NOT a pure HTTP client. The binary must be installed.
## 3. TypeScript Agent SDK
**Package:** `@anthropic-ai/claude-agent-sdk` (renamed from `@anthropic-ai/claude-code`)
**Install:** `npm install @anthropic-ai/claude-agent-sdk`
**Requires:** Node 18+, `ANTHROPIC_API_KEY` env var
```typescript
import { query } from "@anthropic-ai/claude-agent-sdk";
for await (const message of query({
prompt: "Diagnose server health",
options: {
allowedTools: ["Read", "Grep", "Bash(python3 *)"],
maxBudgetUsd: 2.00,
}
})) {
if ("result" in message) console.log(message.result);
}
```
**Key features (superset of Python):**
- Same async generator pattern
- `"auto"` permission mode (model classifier per tool call) — TS-only
- `spawnClaudeCodeProcess` hook for remote/containerized execution
- `setMcpServers()` for dynamic MCP server swapping mid-session
- V2 preview: `send()` / `stream()` patterns for simpler multi-turn
- Bundles the Claude Code binary — no separate install needed
## 4. Comparison Matrix
| Capability | `claude -p` CLI | Python SDK | TypeScript SDK |
|---|---|---|---|
| **Auth** | OAuth (Claude Max) | API key only | API key only |
| **Invocation** | Shell subprocess | Async generator | Async generator |
| **Structured output** | `--json-schema` flag | Schema in options | Schema in options |
| **Streaming** | JSONL parsing | Typed messages | Typed messages |
| **Tool approval** | `--allowedTools` only | `can_use_tool` callback | `canUseTool` callback + auto mode |
| **Session resume** | `--resume` flag | `resume: sessionId` | `resume: sessionId` |
| **Cost tracking** | Parse result JSON | `ResultMessage.total_cost_usd` | Same + per-model breakdown |
| **Error handling** | Exit codes + stderr | Typed exceptions | Typed exceptions |
| **Hooks** | External shell scripts | In-process callbacks | In-process callbacks |
| **Custom tools** | Not available | `tool()` decorator | `tool()` + Zod schemas |
| **Subagents** | Not programmatic | `agents` option | `agents` option |
| **File rewind** | Not available | `rewindFiles()` | `rewindFiles()` |
| **MCP servers** | `--mcp-config` file | Inline config object | Inline + dynamic swap |
| **CLAUDE.md loading** | Automatic | Must opt-in (`settingSources`) | Must opt-in |
| **Dependencies** | CLI binary | CLI binary + Python | Node 18+ (bundles CLI) |
## 5. Integration Paths
### A. n8n Code Nodes
The n8n Code node supports JavaScript (not TypeScript directly, but the SDK's JS output works). This would replace the current SSH → CLI pattern:
```
Schedule Trigger → Code Node (JS, uses SDK) → IF → Discord
```
**Trade-off:** Eliminates the SSH hop to CT 300, but requires `ANTHROPIC_API_KEY` and n8n to have the npm package installed. Current n8n runs in a Docker container on CT 210 — would need the SDK and CLI binary in the image.
### B. Standalone Python Scripts
Replace `claude -p` subprocess calls in custom dispatchers with the Python SDK:
```python
# Instead of: subprocess.run(["claude", "-p", prompt, ...])
async for msg in query(prompt=prompt, options=opts):
...
```
**Trade-off:** Richer error handling and streaming, but our dispatchers are bash scripts, not Python. Would require rewriting `runner.sh` and dispatchers in Python.
### C. Systemd-triggered Tasks (Current Architecture)
Keep systemd timers → bash scripts, but optionally invoke a thin Python wrapper that uses the SDK instead of `claude -p` directly.
**Trade-off:** Adds Python as a dependency for scheduled tasks that currently only need bash + the CLI binary. Marginal benefit unless we need hooks or dynamic tool approval.
## 6. Recommendation
**Stay with CLI invocation for now. Revisit the Python SDK when we need dynamic tool approval or in-process hooks.**
### Rationale
1. **Auth is the blocker.** The SDK requires `ANTHROPIC_API_KEY` (API billing). Our entire scheduled-tasks framework runs on Claude Max OAuth at zero marginal cost. Switching to the SDK means paying per-token for every scheduled task, issue-worker, and PR-reviewer invocation. This alone makes the SDK non-viable for our current architecture.
2. **The CLI covers our needs.** With `--append-system-prompt` (done), `--resume` (this PR), `--json-schema`, and `--allowedTools`, the CLI provides everything we currently need. Session resumption was the last missing piece.
3. **Bash scripts are the right abstraction.** Our runners are launched by systemd timers. Bash + CLI is the natural fit — no runtime dependencies, no async event loops, no package management.
### When to Revisit
- If Anthropic adds OAuth support to the SDK (eliminating the billing difference)
- If we need dynamic tool approval (e.g., "allow this Bash command but deny that one" at runtime)
- If we build a long-running Python service that orchestrates multiple Claude sessions (the `ClaudeSDKClient` stateful pattern would be valuable there)
- If we move to n8n custom nodes written in TypeScript (the TS SDK bundles the CLI binary)
### Migration Path (If Needed Later)
1. Start with the Python SDK in a single task (e.g., `backlog-triage`) as a proof of concept
2. Create a thin `sdk-runner.py` wrapper that reads the same `settings.json` and `prompt.md` files
3. Swap the systemd unit's `ExecStart` from `runner.sh` to `sdk-runner.py`
4. Expand to other tasks if the POC proves valuable

View File

@ -0,0 +1,46 @@
---
title: "Backlog triage sandbox fix — repos.json outside working directory"
description: "Fix for backlog-triage scheduled task failing to read repos.json because the file was outside the claude -p sandbox (working_dir). Resolved by symlinking into the working directory."
type: troubleshooting
domain: scheduled-tasks
tags: [claude-code, backlog-triage, sandbox, runner, troubleshooting]
---
# Backlog Triage — repos.json Outside Sandbox
**Date**: 2026-04-07
## Problem
The `backlog-triage` scheduled task reported:
> `~/.config/claude-scheduled/repos.json` is outside the allowed session directories and couldn't be read.
The task fell back to querying all discoverable repos via Gitea instead of using the curated repo list.
## Root Cause
`claude -p` sandboxes file access to the **working directory** (`/mnt/NV2/Development/claude-home`). The `repos.json` file lives at `~/.config/claude-scheduled/repos.json` (`/home/cal/`), which is outside the sandbox.
The `--allowedTools "Read(~/.config/claude-scheduled/repos.json)"` flag controls **tool permissions** (which tools the session can call), not **filesystem access**. The sandbox boundary is set by the working directory, and `allowedTools` cannot override it.
## Fix
1. **Symlinked** `repos.json` into the working directory:
```bash
ln -sf /home/cal/.config/claude-scheduled/repos.json \
/mnt/NV2/Development/claude-home/.claude/repos.json
```
2. **Updated** `tasks/backlog-triage/prompt.md` to reference `.claude/repos.json` instead of the absolute home-dir path.
3. **Updated** `tasks/backlog-triage/settings.json` allowed_tools to `Read(.claude/repos.json)`.
## Key Lesson
For `runner.sh` template tasks, any file the task needs to read **must be inside the working directory** or reachable via a symlink within it. The `--allowedTools` flag is a permissions layer on top of the sandbox — it cannot grant access to paths outside the sandbox.
## Also Changed (same session)
- Removed `cognitive-memory` MCP from backlog-triage; replaced with `kb-search` (HTTP MCP at `10.10.0.226:8001/mcp`) for cross-referencing issue context against the knowledge base.
- Removed all `mcp__cognitive-memory__*` tools from allowed_tools; added `mcp__kb-search__search` and `mcp__kb-search__get_document`.

View File

@ -0,0 +1,81 @@
---
title: "Fix: pr-reviewer leaving ai-reviewing label stuck on PRs"
description: "Duplicate Gitea labels caused _get_label_id to SIGPIPE under pipefail, making remove_label silently bail and orphaning the ai-reviewing tag across 6 PRs."
type: troubleshooting
domain: scheduled-tasks
tags:
- troubleshooting
- pr-reviewer
- gitea
- labels
- bash
- pipefail
- claude-scheduled
---
# Fix: pr-reviewer leaving `ai-reviewing` label stuck on PRs
**Date:** 2026-04-10
**Severity:** Medium — pr-reviewer-dispatcher skipped any PR that already carried `ai-reviewing`, so stuck PRs were never re-reviewed. Six PRs across `major-domo-database` and `paper-dynasty-database` were silently wedged for weeks.
## Problem
Open PRs across the tracked repos accumulated the orange `ai-reviewing` label with no corresponding `ai-reviewed` / `ai-changes-requested` outcome. Because `pr-reviewer-dispatcher.sh` filters out any PR that already has one of those three labels, stuck PRs stayed invisible to future runs.
Two distinct stuck patterns were observable:
1. **Both labels attached** (`major-domo-database` #128, #124): the reviewer clearly ran to completion — `ai-reviewed` was added — but `ai-reviewing` was never removed.
2. **Only `ai-reviewing` attached** (`paper-dynasty-database` #207, #126, #125; `major-domo-database` #122): no review outcome label at all. Looked like a mid-run crash.
## Root Cause
Two compounding bugs in `~/.config/claude-scheduled/gitea-lib.sh`.
### 1. Duplicate labels accumulated in repos
`ensure_label` had no de-duplication check. Any transient failure in `_get_label_id` (bad response, jq parse, pipeline issue) fell through and created a *new* label with the same name. Over time two `ai-reviewing` rows existed in both `major-domo-database` (ids 30, 31) and `paper-dynasty-database` (ids 60, 35); `paper-dynasty-discord` had the same issue with `ai-working`.
### 2. `_get_label_id` SIGPIPE under pipefail
The original helper was:
```bash
_get_label_id() {
gitea_get "repos/$owner/$repo/labels?limit=50" |
jq -r --arg name "$name" '.[] | select(.name == $name) | .id' 2>/dev/null |
head -1
}
```
The dispatcher runs under `set -euo pipefail`. With duplicate labels present, `jq` emits multiple id lines. `head -1` closes the pipe after the first line → `jq` hits SIGPIPE on the next write → pipeline exits non-zero → `pipefail` propagates. Inside `remove_label`:
```bash
label_id=$(_get_label_id ...) || return 0
```
…the `|| return 0` guard then **silently returned without ever calling DELETE**. The reviewer continued on and added `ai-reviewed`, leaving `ai-reviewing` orphaned. Same mechanism in the cleanup trap meant crashed runs also couldn't remove the label.
Additionally, even when the pipe didn't fire SIGPIPE, `remove_label` was resolving the label id against the *repo label catalog* rather than the labels actually attached to the PR — so for `paper-dynasty-database` #125 (which had id=35 attached), `head -1` returned id=60 and the DELETE was a no-op on an id that wasn't even there.
## Fix
**`gitea-lib.sh` hardened (three helpers):**
- **`_get_label_id`** — replaced `head -1` with `jq 'sort_by(.id) | .[0].id // empty'`. No pipeline truncation → no SIGPIPE. Also bumped `limit=50``limit=200` so large repos aren't silently truncated. On duplicates the *oldest* id is returned (the canonical row).
- **`remove_label`** — now queries `repos/{o}/{r}/issues/{n}/labels` (labels actually attached to the PR), matches by name, and deletes every matching id. Can no longer DELETE the wrong id, and handles the theoretical case where both duplicates got attached.
- **`ensure_label`** — counts existing labels with the target name before lookup, logs `WARNING: $repo has N labels named '$name' — reusing oldest` so the dispatcher log surfaces future dupes.
**Repo cleanup:**
- Cleared stale `ai-reviewing` from the 6 stuck PRs via the patched `remove_label`.
- Deleted duplicate label rows (kept the oldest id in each repo): `major-domo-database` id 31, `paper-dynasty-database` id 60, `paper-dynasty-discord` id 52 (`ai-working`).
- Swept all tracked repos for other `ai-*` label dupes — none remaining.
**Verification:** `bash -n`, then `pr-reviewer-dispatcher.sh --dry-run` — correctly re-discovered the 5 PRs that had only `ai-reviewing` (now clean) and properly skipped the 2 that already had `ai-reviewed`.
## Lessons
- **`set -o pipefail` + `head -N` is a foot-gun.** Whenever a downstream stage can close the pipe early, upstream producers will get SIGPIPE and fail the pipeline. Use `jq '.[0]'`, `awk 'NR==1{print; exit}'`, or read into a variable and slice — never `| head -1` in a pipefail script.
- **Resolve label ids from the issue, not the repo catalog.** Gitea allows duplicate label names per repo. Any helper that maps a name → id from the repo catalog and then acts on an issue is ambiguous. Always query `issues/{n}/labels` when you need to mutate an attachment.
- **"Get or create" helpers need a de-dup guard.** `ensure_label` should either tolerate duplicates by reusing the oldest (what we did) or hard-error and force a human to clean up; silently creating a new row on any transient failure accumulates garbage state over weeks.
- **Skip-label dispatchers need a staleness timeout.** The dispatcher currently treats `ai-reviewing` as a permanent skip signal. A stuck label wedges a PR forever. Consider adding a timestamp check (e.g., `ai-reviewing` older than 1 hour → force re-review) as a belt-and-suspenders guard against future variants of this bug.

View File

@ -245,11 +245,25 @@ hosts:
- sqlite-major-domo
- temp-postgres
# Docker Home Servers VM (Proxmox) - decommission candidate
# VM 116: Only Jellyfin remains after 2026-04-03 cleanup (watchstate removed — duplicate of manticore's canonical instance)
# Jellyfin on manticore already covers this service. VM 116 + VM 110 are candidates to reclaim 8 vCPUs + 16 GB RAM.
# See issue #31 for cleanup details.
docker-home-servers:
type: docker
ip: 10.10.0.124
vmid: 116
user: cal
description: "Legacy home servers VM — Jellyfin only, decommission candidate"
config_paths:
docker-compose: /home/cal/container-data
services:
- jellyfin # only remaining service; duplicate of ubuntu-manticore jellyfin
decommission_candidate: true
notes: "watchstate removed 2026-04-03 (duplicate of manticore); 3.36 GB images pruned; see issue #31"
# Decommissioned hosts (kept for reference)
# decommissioned:
# tdarr-old:
# ip: 10.10.0.43
# note: "Replaced by ubuntu-manticore tdarr"
# docker-home:
# ip: 10.10.0.124
# note: "Decommissioned"

View File

@ -0,0 +1,246 @@
---
title: "Proxmox Monthly Maintenance Reboot"
description: "Runbook for the first-Sunday-of-the-month Proxmox host reboot — dependency-aware shutdown/startup order, validation checklist, and Ansible automation."
type: runbook
domain: server-configs
tags: [proxmox, maintenance, reboot, ansible, operations, systemd]
---
# Proxmox Monthly Maintenance Reboot
## Overview
| Detail | Value |
|--------|-------|
| **Schedule** | 1st Sunday of every month, 3:00 AM ET (08:00 UTC) |
| **Expected downtime** | ~15 minutes (host reboot + VM/LXC startup) |
| **Orchestration** | Ansible on LXC 304 — shutdown playbook → host reboot → post-reboot startup playbook |
| **Calendar** | Google Calendar recurring event: "Proxmox Monthly Maintenance Reboot" |
| **HA DNS** | ubuntu-manticore (10.10.0.226) provides Pi-hole 2 during Proxmox downtime |
## Why
- Kernel updates accumulate without reboot and never take effect
- Long uptimes allow memory leaks and process state drift (e.g., avahi busy-loops)
- Validates that all VMs/LXCs auto-start cleanly with `onboot: 1`
## Architecture
The reboot is split into two playbooks because LXC 304 (the Ansible controller) is itself a guest on the Proxmox host being rebooted:
1. **`monthly-reboot.yml`** — Snapshots all guests, shuts them down in dependency order, issues a fire-and-forget `reboot` to the Proxmox host, then exits. LXC 304 is killed when the host reboots.
2. **`post-reboot-startup.yml`** — After the host reboots, LXC 304 auto-starts via `onboot: 1`. A systemd service (`ansible-post-reboot.service`) waits 120 seconds for the Proxmox API to stabilize, then starts all guests in dependency order with staggered delays.
The `onboot: 1` flag on all production guests acts as a safety net — even if the post-reboot playbook fails, Proxmox will start everything (though without controlled ordering).
## Prerequisites (Before Maintenance)
- [ ] Verify no active Tdarr transcodes on ubuntu-manticore
- [ ] Verify no running database backups
- [ ] Ensure workstation has Pi-hole 2 (10.10.0.226) as a fallback DNS server so it fails over automatically during downtime
- [ ] Confirm ubuntu-manticore Pi-hole 2 is healthy: `ssh manticore "docker exec pihole pihole status"`
## `onboot` Audit
All production VMs and LXCs must have `onboot: 1` so they restart automatically as a safety net.
**Check VMs:**
```bash
ssh proxmox "for id in \$(qm list | awk 'NR>1{print \$1}'); do \
name=\$(qm config \$id | grep '^name:' | awk '{print \$2}'); \
onboot=\$(qm config \$id | grep '^onboot:'); \
echo \"VM \$id (\$name): \${onboot:-onboot NOT SET}\"; \
done"
```
**Check LXCs:**
```bash
ssh proxmox "for id in \$(pct list | awk 'NR>1{print \$1}'); do \
name=\$(pct config \$id | grep '^hostname:' | awk '{print \$2}'); \
onboot=\$(pct config \$id | grep '^onboot:'); \
echo \"LXC \$id (\$name): \${onboot:-onboot NOT SET}\"; \
done"
```
**Audit results (2026-04-03):**
| ID | Name | Type | `onboot` | Status |
|----|------|------|----------|--------|
| 106 | docker-home | VM | 1 | OK |
| 109 | homeassistant | VM | 1 | OK (fixed 2026-04-03) |
| 110 | discord-bots | VM | 1 | OK |
| 112 | databases-bots | VM | 1 | OK |
| 115 | docker-sba | VM | 1 | OK |
| 116 | docker-home-servers | VM | 1 | OK |
| 210 | docker-n8n-lxc | LXC | 1 | OK |
| 221 | arr-stack | LXC | 1 | OK (fixed 2026-04-03) |
| 222 | memos | LXC | 1 | OK |
| 223 | foundry-lxc | LXC | 1 | OK (fixed 2026-04-03) |
| 225 | gitea | LXC | 1 | OK |
| 227 | uptime-kuma | LXC | 1 | OK |
| 301 | claude-discord-coordinator | LXC | 1 | OK |
| 302 | claude-runner | LXC | 1 | OK |
| 303 | mcp-gateway | LXC | 0 | Intentional (on-demand) |
| 304 | ansible-controller | LXC | 1 | OK |
**If any production guest is missing `onboot: 1`:**
```bash
ssh proxmox "qm set <VMID> --onboot 1" # for VMs
ssh proxmox "pct set <CTID> --onboot 1" # for LXCs
```
## Shutdown Order (Dependency-Aware)
Reverse of the validated startup sequence. Stop consumers before their dependencies. Each tier polls per-guest status rather than using fixed waits.
```
Tier 4 — Media & Others (no downstream dependents)
VM 109 homeassistant
LXC 221 arr-stack
LXC 222 memos
LXC 223 foundry-lxc
LXC 302 claude-runner
Tier 3 — Applications (depend on databases + infra)
VM 115 docker-sba (Paper Dynasty, Major Domo)
VM 110 discord-bots
LXC 301 claude-discord-coordinator
Tier 2 — Infrastructure + DNS (depend on databases)
VM 106 docker-home (Pi-hole 1, NPM)
LXC 225 gitea
LXC 210 docker-n8n-lxc
LXC 227 uptime-kuma
VM 116 docker-home-servers
Tier 1 — Databases (no dependencies, shut down last)
VM 112 databases-bots (force-stop after 90s if ACPI ignored)
→ LXC 304 issues fire-and-forget reboot to Proxmox host, then is killed
```
**Known quirks:**
- VM 112 (databases-bots) may ignore ACPI shutdown — playbook force-stops after 90s
- VM 109 (homeassistant) is self-managed via HA Supervisor, excluded from Ansible inventory
- LXC 303 (mcp-gateway) has `onboot: 0` and is operator-managed — not included in shutdown/startup. If it was running before maintenance, bring it up manually afterward
## Startup Order (Staggered)
After the Proxmox host reboots, LXC 304 auto-starts and the `ansible-post-reboot.service` waits 120s before running the controlled startup:
```
Tier 1 — Databases first
VM 112 databases-bots
→ wait 30s for DB to accept connections
Tier 2 — Infrastructure + DNS
VM 106 docker-home (Pi-hole 1, NPM)
LXC 225 gitea
LXC 210 docker-n8n-lxc
LXC 227 uptime-kuma
VM 116 docker-home-servers
→ wait 30s
Tier 3 — Applications
VM 115 docker-sba
VM 110 discord-bots
LXC 301 claude-discord-coordinator
→ wait 30s
Pi-hole fix — restart container via SSH to clear UDP DNS bug
ssh docker-home "docker restart pihole"
→ wait 10s
Tier 4 — Media & Others
VM 109 homeassistant
LXC 221 arr-stack
LXC 222 memos
LXC 223 foundry-lxc
LXC 302 claude-runner
```
## Post-Reboot Validation
- [ ] Pi-hole 1 DNS resolving: `ssh docker-home "docker exec pihole dig google.com @127.0.0.1"`
- [ ] Gitea accessible: `curl -sf https://git.manticorum.com/api/v1/version`
- [ ] n8n healthy: `ssh docker-n8n-lxc "docker ps --filter name=n8n --format '{{.Status}}'"`
- [ ] Discord bots responding (check Discord)
- [ ] Uptime Kuma dashboard green: `curl -sf http://10.10.0.227:3001/api/status-page/homelab`
- [ ] Home Assistant running: `curl -sf http://10.10.0.109:8123/api/ -H 'Authorization: Bearer <token>'`
- [ ] Maintenance snapshots cleaned up (auto, 7-day retention)
## Automation
### Ansible Playbooks
Both located at `/opt/ansible/playbooks/` on LXC 304.
```bash
# Dry run — shutdown only
ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --check"
# Manual full execution — shutdown + reboot
ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml"
# Manual post-reboot startup (if automatic startup failed)
ssh ansible "ansible-playbook /opt/ansible/playbooks/post-reboot-startup.yml"
# Shutdown only — skip the host reboot
ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --tags shutdown"
```
### Systemd Units (on LXC 304)
| Unit | Purpose | Schedule |
|------|---------|----------|
| `ansible-monthly-reboot.timer` | Triggers shutdown + reboot playbook | 1st Sunday of month, 08:00 UTC |
| `ansible-monthly-reboot.service` | Runs `monthly-reboot.yml` | Activated by timer |
| `ansible-post-reboot.service` | Runs `post-reboot-startup.yml` | On boot (multi-user.target), only if uptime < 10 min |
```bash
# Check timer status
ssh ansible "systemctl status ansible-monthly-reboot.timer"
# Next scheduled run
ssh ansible "systemctl list-timers ansible-monthly-reboot.timer"
# Check post-reboot service status
ssh ansible "systemctl status ansible-post-reboot.service"
# Disable for a month (e.g., during an incident)
ssh ansible "systemctl stop ansible-monthly-reboot.timer"
```
### Deployment (one-time setup on LXC 304)
```bash
# Copy playbooks
scp ansible/playbooks/monthly-reboot.yml ansible:/opt/ansible/playbooks/
scp ansible/playbooks/post-reboot-startup.yml ansible:/opt/ansible/playbooks/
# Copy and enable systemd units
scp ansible/systemd/ansible-monthly-reboot.timer ansible:/etc/systemd/system/
scp ansible/systemd/ansible-monthly-reboot.service ansible:/etc/systemd/system/
scp ansible/systemd/ansible-post-reboot.service ansible:/etc/systemd/system/
ssh ansible "sudo systemctl daemon-reload && \
sudo systemctl enable --now ansible-monthly-reboot.timer && \
sudo systemctl enable ansible-post-reboot.service"
# Verify SSH key access from LXC 304 to docker-home (needed for Pi-hole restart)
ssh ansible "ssh -o BatchMode=yes docker-home 'echo ok'"
```
## Rollback
If a guest fails to start after reboot:
1. Check Proxmox web UI or `pvesh get /nodes/proxmox/qemu/<VMID>/status/current`
2. Review guest logs: `ssh proxmox "journalctl -u pve-guests -n 50"`
3. Manual start: `ssh proxmox "pvesh create /nodes/proxmox/qemu/<VMID>/status/start"`
4. If guest is corrupted, restore from the pre-reboot Proxmox snapshot
5. If post-reboot startup failed entirely, run manually: `ssh ansible "ansible-playbook /opt/ansible/playbooks/post-reboot-startup.yml"`
## Related Documentation
- [Ansible Controller Setup](../../vm-management/ansible-controller-setup.md) — LXC 304 details and inventory
- [Proxmox 7→9 Upgrade Plan](../../vm-management/proxmox-upgrades/proxmox-7-to-9-upgrade-plan.md) — original startup order and Phase 1 lessons
- [VM Decommission Runbook](../../vm-management/vm-decommission-runbook.md) — removing VMs from the rotation

View File

@ -1,15 +0,0 @@
agent: 1
boot: order=scsi0;net0
cores: 8
memory: 16384
meta: creation-qemu=6.1.0,ctime=1646688596
name: docker-vpn
net0: virtio=76:36:85:A7:6A:A3,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-105-disk-0,size=256G
scsihw: virtio-scsi-pci
smbios1: uuid=55061264-b9b1-4ce4-8d44-9c187affcb1d
sockets: 1
vmgenid: 30878bdf-66f9-41bf-be34-c31b400340f9

View File

@ -1,7 +1,7 @@
agent: 1
boot: order=scsi0;net0
cores: 4
memory: 16384
memory: 6144
meta: creation-qemu=6.1.0,ctime=1646083628
name: docker-home
net0: virtio=BA:65:DF:88:85:4C,bridge=vmbr0,firewall=1
@ -11,5 +11,5 @@ ostype: l26
scsi0: local-lvm:vm-106-disk-0,size=256G
scsihw: virtio-scsi-pci
smbios1: uuid=54ef12fc-edcc-4744-a109-dd2de9a6dc03
sockets: 2
sockets: 1
vmgenid: a13c92a2-a955-485e-a80e-391e99b19fbd

View File

@ -12,5 +12,5 @@ ostype: l26
scsi0: local-lvm:vm-115-disk-0,size=256G
scsihw: virtio-scsi-pci
smbios1: uuid=19be98ee-f60d-473d-acd2-9164717fcd11
sockets: 2
sockets: 1
vmgenid: 682dfeab-8c63-4f0b-8ed2-8828c2f808ef

View File

@ -0,0 +1,141 @@
---
title: "VM 106 (docker-home) Right-Sizing Runbook"
description: "Runbook for right-sizing VM 106 from 16 GB/8 vCPU to 6 GB/4 vCPU — pre-checks, resize commands, and post-resize validation."
type: runbook
domain: server-configs
tags: [proxmox, infra-audit, right-sizing, docker-home]
---
# VM 106 (docker-home) Right-Sizing Runbook
## Context
Infrastructure audit (2026-04-02) found VM 106 severely overprovisioned:
| Resource | Allocated | Actual Usage | Target |
|----------|-----------|--------------|--------|
| RAM | 16 GB | 1.11.5 GB | 6 GB (4× headroom) |
| vCPUs | 8 (2 sockets × 4 cores) | load 0.12/core | 4 (1 socket × 4 cores) |
**Services**: Pi-hole, Nginx Proxy Manager, Portainer
## Pre-Check Results (2026-04-03)
Automated checks were run before resizing. **All clear.**
### Container memory limits
```bash
docker inspect pihole nginx-proxy-manager_app_1 portainer \
| python3 -c "import json,sys; c=json.load(sys.stdin); \
[print(x['Name'], 'MemoryLimit:', x['HostConfig']['Memory']) for x in c]"
```
Result:
```
/pihole MemoryLimit: 0
/nginx-proxy-manager_app_1 MemoryLimit: 0
/portainer MemoryLimit: 0
```
`0` = no limit — no containers will OOM at 6 GB.
### Docker Compose memory reservations
```bash
grep -rn 'memory\|mem_limit\|memswap' /home/cal/container-data/*/docker-compose.yml
```
Result: **no matches** — no compose-level memory reservations.
### Live memory usage at audit time
```
total: 15 GiB used: 1.1 GiB free: 6.8 GiB buff/cache: 7.7 GiB
Pi-hole: 463 MiB
NPM: 367 MiB
Portainer: 12 MiB
Total containers: ~842 MiB
```
## Resize Procedure
Brief downtime: Pi-hole and NPM will be unavailable during shutdown.
Manticore runs Pi-hole 2 (10.10.0.226) for HA DNS — clients fail over automatically.
### Step 1 — Shut down the VM
```bash
ssh proxmox "qm shutdown 106 --timeout 60"
# Wait for shutdown
ssh proxmox "qm status 106" # Should show: status: stopped
```
### Step 2 — Apply new hardware config
```bash
# Reduce RAM: 16384 MB → 6144 MB
ssh proxmox "qm set 106 --memory 6144"
# Reduce vCPUs: 2 sockets × 4 cores → 1 socket × 4 cores (8 → 4 vCPUs)
ssh proxmox "qm set 106 --sockets 1 --cores 4"
# Verify
ssh proxmox "qm config 106 | grep -E 'memory|cores|sockets'"
```
Expected output:
```
cores: 4
memory: 6144
sockets: 1
```
### Step 3 — Start the VM
```bash
ssh proxmox "qm start 106"
```
Wait ~30 seconds for Docker to come up.
### Step 4 — Verify services
```bash
# Pi-hole DNS resolution
ssh pihole "docker exec pihole dig google.com @127.0.0.1 | grep -E 'SERVER|ANSWER'"
# NPM — check it's running
ssh pihole "docker ps --filter name=nginx-proxy-manager --format '{{.Status}}'"
# Portainer
ssh pihole "docker ps --filter name=portainer --format '{{.Status}}'"
# Memory usage post-resize
ssh pihole "free -h"
```
### Step 5 — Monitor for 24h
Check memory doesn't approach the 6 GB limit:
```bash
ssh pihole "free -h && docker stats --no-stream --format 'table {{.Name}}\t{{.MemUsage}}'"
```
Alert threshold: if `used` exceeds 4.5 GB (75% of 6 GB), consider increasing to 8 GB.
## Rollback
If services fail to come up after resizing:
```bash
# Restore original allocation
ssh proxmox "qm set 106 --memory 16384 --sockets 2 --cores 4"
ssh proxmox "qm start 106"
```
## Related
- [Maintenance Reboot Runbook](maintenance-reboot.md) — VM 106 is Tier 2 (shut down after apps, before databases)
- Issue: cal/claude-home#19

View File

@ -28,8 +28,8 @@ tags: [proxmox, upgrade, pve, backup, rollback, infrastructure]
**Production Services** (7 LXC + 7 VMs) — cleaned up 2026-02-19:
- **Critical**: Paper Dynasty/Major Domo (VM 115), Discord bots (VM 110), Gitea (LXC 225), n8n (LXC 210), Home Assistant (VM 109), Databases (VM 112), docker-home/Pi-hole 1 (VM 106)
- **Important**: Claude Discord Coordinator (LXC 301), arr-stack (LXC 221), Uptime Kuma (LXC 227), Foundry VTT (LXC 223), Memos (LXC 222)
- **Stopped/Investigate**: docker-vpn (VM 105, decommissioning), docker-home-servers (VM 116, needs investigation)
- **Removed (2026-02-19)**: 108 (ansible), 224 (openclaw), 300 (openclaw-migrated), 101/102/104/111/211 (game servers), 107 (plex), 113 (tdarr - moved to .226), 114 (duplicate arr-stack), 117 (unused), 100/103 (old templates)
- **Decommission Candidate**: docker-home-servers (VM 116) — Jellyfin-only after 2026-04-03 cleanup; watchstate removed (duplicate of manticore); see issue #31
- **Removed (2026-02-19)**: 108 (ansible), 224 (openclaw), 300 (openclaw-migrated), 101/102/104/111/211 (game servers), 107 (plex), 113 (tdarr - moved to .226), 114 (duplicate arr-stack), 117 (unused), 100/103 (old templates), 105 (docker-vpn - decommissioned 2026-04)
**Key Constraints**:
- Home Assistant VM 109 requires dual network (vmbr1 for Matter support)

View File

@ -67,10 +67,15 @@ runcmd:
# Add cal user to docker group (will take effect after next login)
- usermod -aG docker cal
# Test Docker installation
- docker run --rm hello-world
# Mask avahi-daemon — not needed in a static-IP homelab with Pi-hole DNS,
# and has a known kernel busy-loop bug that wastes CPU
- systemctl stop avahi-daemon || true
- systemctl mask avahi-daemon
# Write configuration files
write_files:
# SSH hardening configuration

View File

@ -0,0 +1,163 @@
---
title: "VM Decommission Runbook"
description: "Step-by-step procedure for safely decommissioning a Proxmox VM — dependency checks, destruction, and repo cleanup."
type: runbook
domain: vm-management
tags: [proxmox, decommission, infrastructure, cleanup]
---
# VM Decommission Runbook
Procedure for safely removing a stopped Proxmox VM and reclaiming its disk space. Derived from the VM 105 (docker-vpn) decommission (2026-04-02, issue #20).
## Prerequisites
- VM must already be **stopped** on Proxmox
- Services previously running on the VM must be confirmed migrated or no longer needed
- SSH access to Proxmox host (`ssh proxmox`)
## Phase 1 — Dependency Verification
Run all checks before destroying anything. A clean result on all five means safe to proceed.
### 1.1 Pi-hole DNS
Check both primary and secondary Pi-hole for DNS records pointing to the VM's IP:
```bash
ssh pihole "grep '<VM_IP>' /etc/pihole/custom.list || echo 'No DNS entries'"
ssh pihole "pihole -q <VM_HOSTNAME>"
```
### 1.2 Nginx Proxy Manager (NPM)
Check NPM for any proxy hosts with the VM's IP as an upstream:
- NPM UI: https://npm.manticorum.com → Proxy Hosts → search for VM IP
- Or via API: `ssh npm-pihole "curl -s http://localhost:81/api/nginx/proxy-hosts" | grep <VM_IP>`
### 1.3 Proxmox Firewall Rules
```bash
ssh proxmox "cat /etc/pve/firewall/<VMID>.fw 2>/dev/null || echo 'No firewall rules'"
```
### 1.4 Backup Existence
```bash
ssh proxmox "ls -la /var/lib/vz/dump/ | grep <VMID>"
```
### 1.5 VPN / Tunnel References
Check if any WireGuard or VPN configs on other hosts reference this VM:
```bash
ssh proxmox "grep -r '<VM_IP>' /etc/wireguard/ 2>/dev/null || echo 'No WireGuard refs'"
```
Also check SSH config and any automation scripts in the claude-home repo:
```bash
grep -r '<VM_IP>\|<VM_HOSTNAME>' ~/Development/claude-home/
```
## Phase 2 — Safety Measures
### 2.1 Disable Auto-Start
Prevent the VM from starting on Proxmox reboot while you work:
```bash
ssh proxmox "qm set <VMID> --onboot 0"
```
### 2.2 Record Disk Space (Before)
```bash
ssh proxmox "lvs | grep pve"
```
Save this output for comparison after destruction.
### 2.3 Optional: Take a Final Backup
If the VM might contain anything worth preserving:
```bash
ssh proxmox "vzdump <VMID> --mode snapshot --storage home-truenas --compress zstd"
```
Skip if the VM has been stopped for a long time and all services are confirmed migrated.
## Phase 3 — Destroy
```bash
ssh proxmox "qm destroy <VMID> --purge"
```
The `--purge` flag removes the disk along with the VM config. Verify:
```bash
ssh proxmox "qm list | grep <VMID>" # Should return nothing
ssh proxmox "lvs | grep vm-<VMID>-disk" # Should return nothing
ssh proxmox "lvs | grep pve" # Compare with Phase 2.2
```
## Phase 4 — Repo Cleanup
Update these files in the `claude-home` repo:
| File | Action |
|------|--------|
| `~/.ssh/config` | Comment out Host block, add `# DECOMMISSIONED: <name> (<IP>) - <reason>` |
| `server-configs/proxmox/qemu/<VMID>.conf` | Delete the file |
| Migration results (if applicable) | Check off decommission tasks |
| `vm-management/proxmox-upgrades/proxmox-7-to-9-upgrade-plan.md` | Move from Stopped/Investigate to Decommissioned |
| `networking/examples/ssh-homelab-setup.md` | Comment out or remove entry |
| `networking/examples/server_inventory.yaml` | Comment out or remove entry |
Leave historical/planning docs (migration plans, wave results) as-is — they serve as historical records.
## Phase 5 — Commit and PR
Branch naming: `chore/<ISSUE_NUMBER>-decommission-<vm-name>`
Commit message format:
```
chore: decommission VM <VMID> (<name>) — reclaim <SIZE> disk (#<ISSUE>)
Closes #<ISSUE>
```
This is typically a docs-only PR (all `.md` and config files) which gets auto-approved by the `auto-merge-docs` workflow.
## Checklist Template
Copy this for each decommission:
```markdown
### VM <VMID> (<name>) Decommission
**Pre-deletion verification:**
- [ ] Pi-hole DNS — no records
- [ ] NPM upstreams — no proxy hosts
- [ ] Proxmox firewall — no rules
- [ ] Backup status — verified
- [ ] VPN/tunnel references — none
**Execution:**
- [ ] Disabled onboot
- [ ] Recorded disk space before
- [ ] Took backup (or confirmed skip)
- [ ] Destroyed VM with --purge
- [ ] Verified disk space reclaimed
**Cleanup:**
- [ ] SSH config updated
- [ ] VM config file deleted from repo
- [ ] Migration docs updated
- [ ] Upgrade plan updated
- [ ] Example files updated
- [ ] Committed, pushed, PR created
```

View File

@ -262,7 +262,7 @@ When connecting Jellyseerr to arr apps, be careful with tag configurations - inv
- [x] Test movie/show requests through Jellyseerr
### After 48 Hours
- [ ] Decommission VM 121 (docker-vpn)
- [x] Decommission VM 121 (docker-vpn)
- [ ] Clean up local migration temp files (`/tmp/arr-config-migration/`)
---

View File

@ -152,11 +152,13 @@ Both accounts can run simultaneously in separate terminal windows.
## Current Configuration on This Workstation
**Status: DISABLED** (as of 2026-04-06). The `.envrc` file is still in place but direnv has been denied (`direnv deny ~/work`). To re-enable: `direnv allow ~/work`.
| Location | Account | Purpose |
|----------|---------|---------|
| `~/.claude` | Primary (cal.corum@gmail.com) | All projects except ~/work |
| `~/.claude-ac` | Alternate | ~/work projects |
| `~/work/.envrc` | — | direnv trigger for CLAUDE_CONFIG_DIR |
| `~/work/.envrc` | — | direnv trigger for CLAUDE_CONFIG_DIR (currently denied) |
## How It All Fits Together

View File

@ -0,0 +1,67 @@
---
title: "llama.cpp Installation and Setup"
description: "llama.cpp b8680 Vulkan build installation on workstation with RTX 4080 Super, including model download workflow."
type: reference
domain: workstation
tags: [llama-cpp, vulkan, nvidia, gguf, local-inference]
---
## Installation
Installed from pre-built release binary (no CUDA build available for Linux — Vulkan is the correct choice for NVIDIA GPUs):
```bash
# Extract to /opt
sudo mkdir -p /opt/llama.cpp
sudo tar -xzf llama-b8680-bin-ubuntu-vulkan-x64.tar.gz -C /opt/llama.cpp --strip-components=1
# Symlink all binaries to PATH
for bin in /opt/llama.cpp/llama-*; do
sudo ln -sf "$bin" /usr/local/bin/$(basename "$bin")
done
```
**Version**: b8680
**Backends loaded**: Vulkan (GPU), CPU (Zen4 for 7800X3D), RPC
**Source**: https://github.com/ggml-org/llama.cpp/releases
## Release Binary Options (Linux x64)
| Build | Use case |
|-------|----------|
| `ubuntu-x64` | CPU only |
| `ubuntu-vulkan-x64` | NVIDIA/AMD GPU via Vulkan |
| `ubuntu-rocm-x64` | AMD GPU via ROCm |
| `ubuntu-openvino-x64` | Intel CPU/GPU/NPU |
No pre-built CUDA binary exists — Vulkan is the NVIDIA option. For native CUDA, build from source with `-DGGML_CUDA=ON`.
## Models
Stored in `/home/cal/Models/`.
| Model | File | Size |
|-------|------|------|
| Qwen3.5-9B Q4_K_M | `Qwen3.5-9B-Q4_K_M.gguf` | 5.3 GB |
## Downloading Models
The built-in `-hf` downloader can stall. Use `curl` with resume support instead:
```bash
curl -L -C - --progress-bar \
-o /home/cal/Models/<model>.gguf \
"https://huggingface.co/<org>/<repo>/resolve/main/<model>.gguf"
```
`-C -` enables resume if the download is interrupted.
## Running
```bash
# Full GPU offload
llama-cli -m /home/cal/Models/Qwen3.5-9B-Q4_K_M.gguf -ngl 99
# Server mode
llama-server -m /home/cal/Models/Qwen3.5-9B-Q4_K_M.gguf -ngl 99 --port 8080
```

View File

@ -0,0 +1,33 @@
---
title: "Workstation Troubleshooting"
description: "Troubleshooting notes for Nobara/KDE Wayland workstation issues."
type: troubleshooting
domain: workstation
tags: [troubleshooting, wayland, kde]
---
# Workstation Troubleshooting
## Discord screen sharing shows no windows on KDE Wayland (2026-04-03)
**Severity:** Medium — cannot share screen via Discord desktop app
**Problem:** Clicking "Share Your Screen" in Discord desktop app (v0.0.131, Electron 37) opens the Discord picker but shows zero windows/screens. Same behavior in both the desktop app and the web app when using Discord's own picker. Affects both native Wayland and XWayland modes.
**Root Cause:** Discord's built-in screen picker uses Electron's `desktopCapturer.getSources()` which relies on X11 window enumeration. On KDE Wayland:
- In native Wayland mode: no X11 windows exist, so the picker is empty
- In forced X11/XWayland mode (`ELECTRON_OZONE_PLATFORM_HINT=x11`): Discord can only see other XWayland windows (itself, Android emulator), not native Wayland apps
- Discord ignores `--use-fake-ui-for-media-stream` and other Chromium flags that should force portal usage
- The `discord-flags.conf` file is **not read** by the Nobara/RPM Discord package — flags must go in the `.desktop` file `Exec=` line
**Fix:** Use **Discord web app in Firefox** for screen sharing. Firefox natively delegates to the XDG Desktop Portal via PipeWire, which shows the KDE screen picker with all windows. The desktop app's own picker remains broken on Wayland as of v0.0.131.
Configuration applied (for general Discord Wayland support):
- `~/.local/share/applications/discord.desktop` — overrides system `.desktop` with Wayland flags
- `~/.config/discord-flags.conf` — created but not read by this Discord build
**Lesson:**
- Discord desktop on Linux Wayland cannot do screen sharing through its own picker — always use the web app in Firefox for this
- Electron's `desktopCapturer` API is fundamentally X11-only; the PipeWire/portal path requires the app to use `getDisplayMedia()` instead, which Discord's desktop app does not do
- `discord-flags.conf` is unreliable across distros — always verify flags landed in `/proc/<pid>/cmdline`
- Vesktop (community client) is an alternative that properly implements portal-based screen sharing, if the web app is insufficient