Compare commits

...

55 Commits

Author SHA1 Message Date
cal
d34bc01305 Merge pull request 'feat: right-size VM 115 (docker-sba) 16→8 vCPUs' (#44) from enhancement/18-rightsize-vm115-vcpus into main
Reviewed-on: #44
Reviewed-by: Claude <cal.corum+openclaw@gmail.com>
2026-04-06 15:41:34 +00:00
Cal Corum
01e6302709 feat: right-size VM 115 config and add --hosts flag to audit script
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 1s
Reduce VM 115 (docker-sba) from 16 vCPUs (2×8) to 8 vCPUs (1×8) to
match actual workload (0.06 load/core). Add --hosts flag to
homelab-audit.sh for targeted post-change audits.

Closes #18

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 15:41:16 +00:00
cal
024aea82c4 Merge pull request 'feat: add monthly Docker prune cron Ansible playbook (#29)' (#45) from issue/29-docker-image-prune-cron-on-all-docker-hosts into main
Reviewed-on: #45
2026-04-06 15:41:04 +00:00
Cal Corum
d4ee899c1d feat: add monthly Docker prune cron Ansible playbook (#29)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 1s
Closes #29

Deploys /etc/cron.monthly/docker-prune to all six Docker hosts via
Ansible. Uses 720h (30-day) age filter on containers and images, with
volume pruning exempt for `keep`-labeled volumes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-06 15:40:33 +00:00
cal
d7987a90ff Merge pull request 'docs: right-size VM 106 (docker-home) — 16 GB/8 vCPU → 6 GB/4 vCPU (#19)' (#47) from issue/19-right-size-vm-106-docker-home-16-gb-6-8-gb-ram into main
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 2s
Reviewed-on: #47
2026-04-06 15:40:20 +00:00
cal
5b23d92435 Merge branch 'main' into issue/19-right-size-vm-106-docker-home-16-gb-6-8-gb-ram
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 3s
2026-04-06 15:40:07 +00:00
cal
29238f3ddf Merge pull request 'feat: weekly Proxmox backup verification → Discord (#27)' (#48) from issue/27-set-up-weekly-proxmox-backup-verification-discord into main
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
Reviewed-on: #48
2026-04-06 15:39:53 +00:00
Cal Corum
dd7c68c13a docs: sync KB — discord-browser-testing-workflow.md 2026-04-06 02:00:38 -05:00
Cal Corum
acb8fef084 docs: sync KB — database-deployment-guide.md,refractor-in-app-test-plan.md 2026-04-06 00:00:03 -05:00
Cal Corum
cacf4a9043 feat: add weekly Gitea disk cleanup Ansible playbook
Gitea LXC 225 hit 100% disk from accumulated Docker buildx volumes,
repo-archive cache, and journal logs. Adds automated weekly cleanup
managed by systemd timer on the Ansible controller (Wed 04:00 UTC).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 19:24:59 -05:00
Cal Corum
95bae33309 feat: add weekly Proxmox backup verification and CT 302 self-health check (#27)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
Closes #27

- proxmox-backup-check.sh: SSHes to Proxmox, queries pvesh task history,
  classifies each running VM/CT as green/yellow/red by backup recency,
  posts a Discord embed summary. Designed for weekly cron on CT 302.

- ct302-self-health.sh: Checks disk usage on CT 302 itself, silently
  exits when healthy, posts a Discord alert when any filesystem exceeds
  80% threshold. Closes the blind spot where the monitoring system
  cannot monitor itself externally.

- Updated monitoring/scripts/CONTEXT.md with full operational docs,
  install instructions, and cron schedules for both new scripts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-04 06:07:57 -05:00
Cal Corum
29a20fbe06 feat: add monthly Proxmox maintenance reboot automation (#26)
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 2s
Establishes a first-Sunday-of-the-month maintenance window orchestrated
by Ansible on LXC 304. Split into two playbooks to handle the self-reboot
paradox (the controller is a guest on the host being rebooted):

- monthly-reboot.yml: snapshots, tiered shutdown with per-guest polling,
  fire-and-forget host reboot
- post-reboot-startup.yml: controlled tiered startup with staggered delays,
  Pi-hole UDP DNS fix, validation, and snapshot cleanup

Also fixes onboot:1 on VM 109, LXC 221, LXC 223 and creates a recurring
Google Calendar event for the maintenance window.

Closes #26

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 23:33:59 -05:00
Cal Corum
9b47f0c027 docs: right-size VM 106 (docker-home) — 16 GB/8 vCPU → 6 GB/4 vCPU (#19)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 5s
Pre-checks confirmed safe to right-size: no container --memory limits,
no Docker Compose memory reservations. Live usage 1.1 GB / 15 GB (7%).

- Update 106.conf: memory 16384 → 6144, sockets 2 → 1 (8 → 4 vCPUs)
- Add right-sizing-vm-106.md runbook with pre-check results and resize commands

Closes #19

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 23:05:43 -05:00
cal
fdc44acb28 Merge pull request 'chore: add --hosts test coverage and right-size VM 115 socket config' (#46) from chore/26-proxmox-monthly-maintenance-reboot into main 2026-04-04 00:35:31 +00:00
Cal Corum
48a804dda2 feat: right-size VM 115 config and add --hosts flag to audit script
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
Reduce VM 115 (docker-sba) from 16 vCPUs (2×8) to 8 vCPUs (1×8) to
match actual workload (0.06 load/core). Add --hosts flag to
homelab-audit.sh for targeted post-change audits.

Closes #18

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 17:33:01 -05:00
Cal Corum
7a0c264f27 feat: add monthly Proxmox maintenance reboot automation (#26)
Establishes a first-Sunday-of-the-month maintenance window orchestrated
by Ansible on LXC 304. Split into two playbooks to handle the self-reboot
paradox (the controller is a guest on the host being rebooted):

- monthly-reboot.yml: snapshots, tiered shutdown with per-guest polling,
  fire-and-forget host reboot
- post-reboot-startup.yml: controlled tiered startup with staggered delays,
  Pi-hole UDP DNS fix, validation, and snapshot cleanup

Also fixes onboot:1 on VM 109, LXC 221, LXC 223 and creates a recurring
Google Calendar event for the maintenance window.

Closes #26

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 16:17:55 -05:00
Cal Corum
64f299aa1a docs: sync KB — maintenance-reboot.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 2s
2026-04-03 16:00:22 -05:00
cal
a9a778f53c Merge pull request 'feat: dynamic summary, --hosts filter, and --json output (#24)' (#38) from issue/24-homelab-audit-sh-dynamic-summary-and-hosts-filter into main 2026-04-03 20:22:24 +00:00
Cal Corum
1a3785f01a feat: dynamic summary, --hosts filter, and --json output (#24)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
Closes #24

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 20:08:07 +00:00
cal
938240e1f9 Merge pull request 'fix: clean up VM 116 watchstate duplicate and document decommission candidacy (#31)' (#41) from issue/31-vm-116-resolve-watchstate-duplicate-and-clean-up-r into main
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 1s
Reviewed-on: #41
2026-04-03 20:01:27 +00:00
Cal Corum
66143f6090 fix: clean up VM 116 watchstate duplicate and document decommission candidacy (#31)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
- Removed stopped watchstate container from VM 116 (duplicate of manticore's canonical instance)
- Pruned 5 orphan images (watchstate, freetube, pihole, hello-world): 3.36 GB reclaimed
- Confirmed manticore watchstate is healthy and syncing Jellyfin state
- VM 116 now runs only Jellyfin (also runs on manticore)
- Added VM 116 (docker-home-servers) to hosts.yml as decommission candidate
- Updated proxmox-7-to-9-upgrade-plan.md status from Stopped/Investigate to Decommission Candidate

Closes #31

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 20:01:13 +00:00
cal
13483157a9 Merge pull request 'feat: session resumption + Agent SDK evaluation' (#43) from feature/3-agent-sdk-improvements into main
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
Reviewed-on: #43
2026-04-03 20:00:12 +00:00
Cal Corum
e321e7bd47 feat: add session resumption and Agent SDK evaluation
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
- runner.sh: opt-in session persistence via session_resumable and
  resume_last_session settings; fix read_setting to normalize booleans
- issue-poller.sh: capture and log session_id from worker invocations,
  include in result JSON
- pr-reviewer-dispatcher.sh: capture and log session_id from reviews
- n8n workflow: add --append-system-prompt to initial SSH node, add
  Follow Up Diagnostics node using --resume for deeper investigation,
  update Discord Alert with remediation details
- Add Agent SDK evaluation doc (CLI vs Python/TS SDK comparison)
- Update CONTEXT.md with session resumption documentation

Closes #3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 19:59:44 +00:00
cal
4e33e1cae3 Merge pull request 'fix: document per-core load threshold policy for health monitoring (#22)' (#42) from issue/22-tune-n8n-alert-thresholds-to-per-core-load-metrics into main
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 2s
2026-04-03 18:36:14 +00:00
Cal Corum
193ae68f96 docs: document per-core load threshold policy for server health monitoring (#22)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 5s
Closes #22

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 13:35:23 -05:00
Cal Corum
7c9c96eb52 docs: sync KB — troubleshooting.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
2026-04-03 12:00:22 -05:00
cal
a8c85a8d91 Merge pull request 'chore: decommission VM 105 (docker-vpn) — repo cleanup' (#40) from chore/20-decommission-vm-105-docker-vpn into main
Some checks failed
Reindex Knowledge Base / reindex (push) Failing after 17s
2026-04-03 12:56:43 +00:00
Cal Corum
9e8346a8ab chore: decommission VM 105 (docker-vpn) — repo cleanup (#20)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
VM 105 was already destroyed on Proxmox. This removes stale references:
- Delete server-configs/proxmox/qemu/105.conf
- Comment out docker-vpn entries in example SSH config and server inventory
- Move VM 105 from Stopped/Investigate to Removed in upgrade plan
- Check off decommission task in wave2 migration results

Closes #20

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 23:57:55 -05:00
Cal Corum
4234351cfa feat: add Ansible playbook to mask avahi-daemon on all Ubuntu VMs (#28)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
Closes #28

Adds mask-avahi.yml targeting the vms:physical inventory groups (all
Ubuntu QEMU VMs + ubuntu-manticore). Also adds avahi masking to the
cloud-init template so future VMs are hardened from first boot.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 23:32:47 -05:00
Cal Corum
a97f443f60 docs: sync KB — vm-decommission-runbook.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
2026-04-02 22:00:04 -05:00
cal
1db2c2b168 Merge pull request 'feat: add backup recency, cert expiry, and I/O wait checks (#25)' (#36) from issue/25-homelab-audit-sh-add-backup-recency-and-certificat into main 2026-04-03 02:15:41 +00:00
Cal Corum
ae5da035f6 feat: add backup recency, cert expiry, OOM, and I/O wait checks (#25)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
Closes #25

- check_backup_recency(): queries pvesh vzdump task history; flags VMs
  with no backup (CRIT) or no backup in 7 days (WARN)
- check_cert_expiry(): probes ports 443/8443 per host via openssl;
  flags certs expiring ≤14 days (WARN) or ≤7 days (CRIT)
- io_wait_pct() in COLLECTOR_SCRIPT: uses vmstat 1 2 to sample I/O
  wait; flagged as WARN when > 20%
- OOM kill history was already collected via journalctl; no changes needed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 21:06:44 -05:00
cal
3e3d2ada31 Merge pull request 'feat: zombie parent, swap, and OOM metrics + Tdarr hardening' (#35) from chore/30-investigate-manticore-zombies-swap into main 2026-04-03 02:05:46 +00:00
Cal Corum
e58c5b8cc1 fix: address PR review — move memory limits to deploy block, handle swap-less hosts
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
Move mem_limit/memswap_limit to deploy.resources.limits.memory so the
constraint is actually enforced under Compose v3. Add END clause to
swap_mb() so hosts without a Swap line report 0 instead of empty output.
Fix test script header comment accuracy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 21:05:12 -05:00
Cal Corum
f28dfeb4bf feat: add zombie parent, swap, and OOM metrics to audit; harden Tdarr containers
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 3s
Extend homelab-audit.sh collector with zombie_parents(), swap_mb(), and
oom_events() functions so the audit identifies which process spawns zombies,
flags high swap usage, and reports recent OOM kills. Add init: true to both
Tdarr docker-compose services so tini reaps orphaned ffmpeg children, and
cap tdarr-node at 28g RAM / 30g total to prevent unbounded memory use.

Closes #30

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 21:02:05 -05:00
Cal Corum
1ed911e61b fix: single-quote awk program in stuck_procs() collector
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 3s
Reindex Knowledge Base / reindex (push) Successful in 3s
The awk program was double-quoted inside the single-quoted
COLLECTOR_SCRIPT, causing $1/$2/$3 to be expanded by the remote
shell as empty positional parameters instead of awk field references.
This made the D-state process filter silently match nothing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 20:48:56 -05:00
Cal Corum
7c801f6c3b fix: guard --output-dir arg and use configurable ZOMBIE_WARN threshold
- Validate --output-dir has a following argument before accessing $2
  (prevents unbound variable crash under set -u)
- Add ZOMBIE_WARN config variable (default: 1) and use it in the zombie
  check instead of hardcoding 0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 20:48:56 -05:00
Cal Corum
9a39abd64c fix: add homelab-audit.sh with variable interpolation and collector fixes (#23)
Closes #23

- Fix STUCK_PROC_CPU_WARN not reaching remote collector: COLLECTOR_SCRIPT
  heredoc stays single-quoted; threshold is passed as $1 to the remote
  bash session so it is evaluated correctly on the collecting host
- Fix LXC IP discovery for static-IP containers: lxc-info result now falls
  back to parsing pct config when lxc-info returns empty
- Fix SSH failures silently dropped: stderr redirected to
  $REPORT_DIR/ssh-failures.log; SSH_FAILURE entries counted and printed
  in the summary
- Add explicit comment explaining why -e is omitted from set options

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 20:48:56 -05:00
Cal Corum
def437f0cb docs: sync KB — troubleshooting.md 2026-04-02 20:48:39 -05:00
Cal Corum
2e86864e94 docs: sync KB — ace-step-local-network.md 2026-04-02 20:48:06 -05:00
Cal Corum
016683cc35 docs: sync KB — release-2026.4.02.md 2026-04-02 20:48:06 -05:00
Cal Corum
51389c612a docs: sync KB — database-release-2026.4.1.md 2026-04-02 20:48:06 -05:00
Cal Corum
98c69617ff docs: sync KB — troubleshooting-gunicorn-worker-timeouts.md 2026-04-02 20:48:06 -05:00
Cal Corum
50125d8b39 docs: sync KB — release-2026.3.31.md,release-2026.4.01.md 2026-04-02 20:48:06 -05:00
Cal Corum
7bdaa0e002 docs: sync KB — troubleshooting.md 2026-04-02 20:48:06 -05:00
Cal Corum
2cb1ced842 docs: sync KB — troubleshooting.md 2026-04-02 20:48:06 -05:00
Cal Corum
ad6adf7a4c docs: sync KB — release-2026.3.31-2.md 2026-04-02 20:48:06 -05:00
Cal Corum
acb1a35170 docs: sync KB — release-2026.3.31.md 2026-04-02 20:48:06 -05:00
Cal Corum
1d85ed26b9 docs: sync KB — release-2026.3.31.md 2026-04-02 20:48:06 -05:00
Cal Corum
1e7f99269e docs: sync KB — 2026-03-30.md 2026-04-02 20:48:06 -05:00
Cal Corum
f5eab93f7b docs: sync KB — subagent-write-permission-blocked.md,release-2026.3.28.md 2026-04-02 20:48:06 -05:00
Cal Corum
bf4b7dc8b7 docs: sync KB — codex-agents-marketplace.md 2026-04-02 20:48:06 -05:00
Cal Corum
3ac33d0046 docs: sync KB — open-packs-checkin-crash.md 2026-04-02 20:48:06 -05:00
cal
d730ea28bc Merge pull request 'docs: Roku WiFi buffering fix in troubleshooting' (#17) from docs/roku-wifi-buffering-fix into main
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
2026-03-26 11:55:30 +00:00
cal
43e72fc1b6 Merge pull request 'docs: add Ansible controller LXC setup guide' (#16) from docs/ansible-controller-setup into main
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
2026-03-26 03:27:29 +00:00
49 changed files with 3818 additions and 45 deletions

View File

@ -0,0 +1,55 @@
---
# Monthly Docker Prune — Deploy Cleanup Cron to All Docker Hosts
#
# Deploys /etc/cron.monthly/docker-prune to each VM running Docker.
# The script prunes stopped containers, unused images, and orphaned volumes
# older than 30 days (720h). Volumes labeled `keep` are exempt.
#
# Resolves accumulated disk waste from stopped containers and stale images.
# The `--filter "until=720h"` age gate prevents removing recently-pulled
# images that haven't started yet. `docker image prune -a` only removes
# images not referenced by any container (running or stopped), so the
# age filter adds an extra safety margin.
#
# Hosts: VM 106 (docker-home), VM 110 (discord-bots), VM 112 (databases-bots),
# VM 115 (docker-sba), VM 116 (docker-home-servers), manticore
#
# Controller: LXC 304 (ansible-controller) at 10.10.0.232
#
# Usage:
# # Dry run (shows what would change, skips writes)
# ansible-playbook /opt/ansible/playbooks/docker-prune.yml --check
#
# # Single host
# ansible-playbook /opt/ansible/playbooks/docker-prune.yml --limit docker-sba
#
# # All Docker hosts
# ansible-playbook /opt/ansible/playbooks/docker-prune.yml
#
# To undo: rm /etc/cron.monthly/docker-prune on target hosts
- name: Deploy Docker monthly prune cron to all Docker hosts
hosts: docker-home:discord-bots:databases-bots:docker-sba:docker-home-servers:manticore
become: true
tasks:
- name: Deploy docker-prune cron script
ansible.builtin.copy:
dest: /etc/cron.monthly/docker-prune
owner: root
group: root
mode: "0755"
content: |
#!/bin/bash
# Monthly Docker cleanup — deployed by Ansible (issue #29)
# Prunes stopped containers, unused images (>30 days), and orphaned volumes.
# Volumes labeled `keep` are exempt from volume pruning.
set -euo pipefail
docker container prune -f --filter "until=720h"
docker image prune -a -f --filter "until=720h"
docker volume prune -f --filter "label!=keep"
- name: Verify docker-prune script is executable
ansible.builtin.command: test -x /etc/cron.monthly/docker-prune
changed_when: false

View File

@ -0,0 +1,80 @@
---
# gitea-cleanup.yml — Weekly cleanup of Gitea server disk space
#
# Removes stale Docker buildx volumes, unused images, Gitea repo-archive
# cache, and vacuums journal logs to prevent disk exhaustion on LXC 225.
#
# Schedule: Weekly via systemd timer on LXC 304 (ansible-controller)
#
# Usage:
# ansible-playbook /opt/ansible/playbooks/gitea-cleanup.yml # full run
# ansible-playbook /opt/ansible/playbooks/gitea-cleanup.yml --check # dry run
- name: Gitea server disk cleanup
hosts: gitea
gather_facts: false
tasks:
- name: Check current disk usage
ansible.builtin.shell: df --output=pcent / | tail -1
register: disk_before
changed_when: false
- name: Display current disk usage
ansible.builtin.debug:
msg: "Disk usage before cleanup: {{ disk_before.stdout | trim }}"
- name: Clear Gitea repo-archive cache
ansible.builtin.find:
paths: /var/lib/gitea/data/repo-archive
file_type: any
register: repo_archive_files
- name: Remove repo-archive files
ansible.builtin.file:
path: "{{ item.path }}"
state: absent
loop: "{{ repo_archive_files.files }}"
loop_control:
label: "{{ item.path | basename }}"
when: repo_archive_files.files | length > 0
- name: Remove orphaned Docker buildx volumes
ansible.builtin.shell: |
volumes=$(docker volume ls -q --filter name=buildx_buildkit)
if [ -n "$volumes" ]; then
echo "$volumes" | xargs docker volume rm 2>&1
else
echo "No buildx volumes to remove"
fi
register: buildx_cleanup
changed_when: "'No buildx volumes' not in buildx_cleanup.stdout"
- name: Prune unused Docker images
ansible.builtin.command: docker image prune -af
register: image_prune
changed_when: "'Total reclaimed space: 0B' not in image_prune.stdout"
- name: Prune unused Docker volumes
ansible.builtin.command: docker volume prune -f
register: volume_prune
changed_when: "'Total reclaimed space: 0B' not in volume_prune.stdout"
- name: Vacuum journal logs to 500M
ansible.builtin.command: journalctl --vacuum-size=500M
register: journal_vacuum
changed_when: "'freed 0B' not in journal_vacuum.stderr"
- name: Check disk usage after cleanup
ansible.builtin.shell: df --output=pcent / | tail -1
register: disk_after
changed_when: false
- name: Display cleanup summary
ansible.builtin.debug:
msg: >-
Cleanup complete.
Disk: {{ disk_before.stdout | default('N/A') | trim }} → {{ disk_after.stdout | default('N/A') | trim }}.
Buildx: {{ (buildx_cleanup.stdout_lines | default(['N/A'])) | last }}.
Images: {{ (image_prune.stdout_lines | default(['N/A'])) | last }}.
Journal: {{ (journal_vacuum.stderr_lines | default(['N/A'])) | last }}.

View File

@ -0,0 +1,43 @@
---
# Mask avahi-daemon on all Ubuntu hosts
#
# Avahi (mDNS/Bonjour) is not needed in a static-IP homelab with Pi-hole DNS.
# A kernel busy-loop bug in avahi-daemon was found consuming ~1.7 CPU cores
# across 5 VMs. Masking prevents it from ever starting again, surviving reboots.
#
# Targets: vms + physical (all Ubuntu QEMU VMs and ubuntu-manticore)
# Controller: ansible-controller (LXC 304 at 10.10.0.232)
#
# Usage:
# # Dry run
# ansible-playbook /opt/ansible/playbooks/mask-avahi.yml --check
#
# # Test on a single host first
# ansible-playbook /opt/ansible/playbooks/mask-avahi.yml --limit discord-bots
#
# # Roll out to all Ubuntu hosts
# ansible-playbook /opt/ansible/playbooks/mask-avahi.yml
#
# To undo: systemctl unmask avahi-daemon
- name: Mask avahi-daemon on all Ubuntu hosts
hosts: vms:physical
become: true
tasks:
- name: Stop avahi-daemon
ansible.builtin.systemd:
name: avahi-daemon
state: stopped
ignore_errors: true
- name: Mask avahi-daemon
ansible.builtin.systemd:
name: avahi-daemon
masked: true
- name: Verify avahi is masked
ansible.builtin.command: systemctl is-enabled avahi-daemon
register: avahi_status
changed_when: false
failed_when: avahi_status.stdout | trim != 'masked'

View File

@ -0,0 +1,265 @@
---
# Monthly Proxmox Maintenance Reboot — Shutdown & Reboot
#
# Orchestrates a graceful shutdown of all guests in dependency order,
# then issues a fire-and-forget reboot to the Proxmox host.
#
# After the host reboots, LXC 304 auto-starts via onboot:1 and the
# post-reboot-startup.yml playbook runs automatically via the
# ansible-post-reboot.service systemd unit (triggered by @reboot).
#
# Schedule: 1st Sunday of each month, 08:00 UTC (3 AM ET)
# Controller: LXC 304 (ansible-controller) at 10.10.0.232
#
# Usage:
# # Dry run
# ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --check
#
# # Full execution
# ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml
#
# # Shutdown only (skip the host reboot)
# ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --tags shutdown
#
# Note: VM 109 (homeassistant) is excluded from Ansible inventory
# (self-managed via HA Supervisor) but is included in pvesh start/stop.
- name: Pre-reboot health check and snapshots
hosts: pve-node
gather_facts: false
tags: [pre-reboot, shutdown]
tasks:
- name: Check Proxmox cluster health
ansible.builtin.command: pvesh get /cluster/status --output-format json
register: cluster_status
changed_when: false
- name: Get list of running QEMU VMs
ansible.builtin.shell: >
pvesh get /nodes/proxmox/qemu --output-format json |
python3 -c "import sys,json; [print(vm['vmid']) for vm in json.load(sys.stdin) if vm.get('status')=='running']"
register: running_vms
changed_when: false
- name: Get list of running LXC containers
ansible.builtin.shell: >
pvesh get /nodes/proxmox/lxc --output-format json |
python3 -c "import sys,json; [print(ct['vmid']) for ct in json.load(sys.stdin) if ct.get('status')=='running']"
register: running_lxcs
changed_when: false
- name: Display running guests
ansible.builtin.debug:
msg: "Running VMs: {{ running_vms.stdout_lines }} | Running LXCs: {{ running_lxcs.stdout_lines }}"
- name: Snapshot running VMs
ansible.builtin.command: >
pvesh create /nodes/proxmox/qemu/{{ item }}/snapshot
--snapname pre-maintenance-{{ lookup('pipe', 'date +%Y-%m-%d') }}
--description "Auto snapshot before monthly maintenance reboot"
loop: "{{ running_vms.stdout_lines }}"
when: running_vms.stdout_lines | length > 0
ignore_errors: true
- name: Snapshot running LXCs
ansible.builtin.command: >
pvesh create /nodes/proxmox/lxc/{{ item }}/snapshot
--snapname pre-maintenance-{{ lookup('pipe', 'date +%Y-%m-%d') }}
--description "Auto snapshot before monthly maintenance reboot"
loop: "{{ running_lxcs.stdout_lines }}"
when: running_lxcs.stdout_lines | length > 0
ignore_errors: true
- name: "Shutdown Tier 4 — Media & Others"
hosts: pve-node
gather_facts: false
tags: [shutdown]
vars:
tier4_vms: [109]
# LXC 303 (mcp-gateway) is onboot=0 and operator-managed — not included here
tier4_lxcs: [221, 222, 223, 302]
tasks:
- name: Shutdown Tier 4 VMs
ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/shutdown
loop: "{{ tier4_vms }}"
ignore_errors: true
- name: Shutdown Tier 4 LXCs
ansible.builtin.command: pvesh create /nodes/proxmox/lxc/{{ item }}/status/shutdown
loop: "{{ tier4_lxcs }}"
ignore_errors: true
- name: Wait for Tier 4 VMs to stop
ansible.builtin.shell: >
pvesh get /nodes/proxmox/qemu/{{ item }}/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
register: t4_vm_status
until: t4_vm_status.stdout.strip() == "stopped"
retries: 12
delay: 5
loop: "{{ tier4_vms }}"
ignore_errors: true
- name: Wait for Tier 4 LXCs to stop
ansible.builtin.shell: >
pvesh get /nodes/proxmox/lxc/{{ item }}/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
register: t4_lxc_status
until: t4_lxc_status.stdout.strip() == "stopped"
retries: 12
delay: 5
loop: "{{ tier4_lxcs }}"
ignore_errors: true
- name: "Shutdown Tier 3 — Applications"
hosts: pve-node
gather_facts: false
tags: [shutdown]
vars:
tier3_vms: [115, 110]
tier3_lxcs: [301]
tasks:
- name: Shutdown Tier 3 VMs
ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/shutdown
loop: "{{ tier3_vms }}"
ignore_errors: true
- name: Shutdown Tier 3 LXCs
ansible.builtin.command: pvesh create /nodes/proxmox/lxc/{{ item }}/status/shutdown
loop: "{{ tier3_lxcs }}"
ignore_errors: true
- name: Wait for Tier 3 VMs to stop
ansible.builtin.shell: >
pvesh get /nodes/proxmox/qemu/{{ item }}/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
register: t3_vm_status
until: t3_vm_status.stdout.strip() == "stopped"
retries: 12
delay: 5
loop: "{{ tier3_vms }}"
ignore_errors: true
- name: Wait for Tier 3 LXCs to stop
ansible.builtin.shell: >
pvesh get /nodes/proxmox/lxc/{{ item }}/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
register: t3_lxc_status
until: t3_lxc_status.stdout.strip() == "stopped"
retries: 12
delay: 5
loop: "{{ tier3_lxcs }}"
ignore_errors: true
- name: "Shutdown Tier 2 — Infrastructure"
hosts: pve-node
gather_facts: false
tags: [shutdown]
vars:
tier2_vms: [106, 116]
tier2_lxcs: [225, 210, 227]
tasks:
- name: Shutdown Tier 2 VMs
ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/shutdown
loop: "{{ tier2_vms }}"
ignore_errors: true
- name: Shutdown Tier 2 LXCs
ansible.builtin.command: pvesh create /nodes/proxmox/lxc/{{ item }}/status/shutdown
loop: "{{ tier2_lxcs }}"
ignore_errors: true
- name: Wait for Tier 2 VMs to stop
ansible.builtin.shell: >
pvesh get /nodes/proxmox/qemu/{{ item }}/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
register: t2_vm_status
until: t2_vm_status.stdout.strip() == "stopped"
retries: 12
delay: 5
loop: "{{ tier2_vms }}"
ignore_errors: true
- name: Wait for Tier 2 LXCs to stop
ansible.builtin.shell: >
pvesh get /nodes/proxmox/lxc/{{ item }}/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
register: t2_lxc_status
until: t2_lxc_status.stdout.strip() == "stopped"
retries: 12
delay: 5
loop: "{{ tier2_lxcs }}"
ignore_errors: true
- name: "Shutdown Tier 1 — Databases"
hosts: pve-node
gather_facts: false
tags: [shutdown]
vars:
tier1_vms: [112]
tasks:
- name: Shutdown database VMs
ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/shutdown
loop: "{{ tier1_vms }}"
ignore_errors: true
- name: Wait for database VMs to stop (up to 90s)
ansible.builtin.shell: >
pvesh get /nodes/proxmox/qemu/{{ item }}/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
register: t1_vm_status
until: t1_vm_status.stdout.strip() == "stopped"
retries: 18
delay: 5
loop: "{{ tier1_vms }}"
ignore_errors: true
- name: Force stop database VMs if still running
ansible.builtin.shell: >
status=$(pvesh get /nodes/proxmox/qemu/{{ item }}/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))");
if [ "$status" = "running" ]; then
pvesh create /nodes/proxmox/qemu/{{ item }}/status/stop;
echo "Force stopped VM {{ item }}";
else
echo "VM {{ item }} already stopped";
fi
loop: "{{ tier1_vms }}"
register: force_stop_result
changed_when: force_stop_result.results | default([]) | selectattr('stdout', 'defined') | selectattr('stdout', 'search', 'Force stopped') | list | length > 0
- name: "Verify and reboot Proxmox host"
hosts: pve-node
gather_facts: false
tags: [reboot]
tasks:
- name: Verify all guests are stopped (excluding LXC 304)
ansible.builtin.shell: >
running_vms=$(pvesh get /nodes/proxmox/qemu --output-format json |
python3 -c "import sys,json; vms=[v for v in json.load(sys.stdin) if v.get('status')=='running']; print(len(vms))");
running_lxcs=$(pvesh get /nodes/proxmox/lxc --output-format json |
python3 -c "import sys,json; cts=[c for c in json.load(sys.stdin) if c.get('status')=='running' and c['vmid'] != 304]; print(len(cts))");
echo "Running VMs: $running_vms, Running LXCs: $running_lxcs";
if [ "$running_vms" != "0" ] || [ "$running_lxcs" != "0" ]; then exit 1; fi
register: verify_stopped
- name: Issue fire-and-forget reboot (controller will be killed)
ansible.builtin.shell: >
nohup bash -c 'sleep 10 && reboot' &>/dev/null &
echo "Reboot scheduled in 10 seconds"
register: reboot_issued
when: not ansible_check_mode
- name: Log reboot issued
ansible.builtin.debug:
msg: "{{ reboot_issued.stdout }} — Ansible process will terminate when host reboots. Post-reboot startup handled by ansible-post-reboot.service on LXC 304."

View File

@ -0,0 +1,214 @@
---
# Post-Reboot Startup — Controlled Guest Startup After Proxmox Reboot
#
# Starts all guests in dependency order with staggered delays to avoid
# I/O storms. Runs automatically via ansible-post-reboot.service on
# LXC 304 after the Proxmox host reboots.
#
# Can also be run manually:
# ansible-playbook /opt/ansible/playbooks/post-reboot-startup.yml
#
# Note: VM 109 (homeassistant) is excluded from Ansible inventory
# (self-managed via HA Supervisor) but is included in pvesh start/stop.
- name: Wait for Proxmox API to be ready
hosts: pve-node
gather_facts: false
tags: [startup]
tasks:
- name: Wait for Proxmox API
ansible.builtin.command: pvesh get /version --output-format json
register: pve_version
until: pve_version.rc == 0
retries: 30
delay: 10
changed_when: false
- name: Display Proxmox version
ansible.builtin.debug:
msg: "Proxmox API ready: {{ pve_version.stdout | from_json | json_query('version') | default('unknown') }}"
- name: "Startup Tier 1 — Databases"
hosts: pve-node
gather_facts: false
tags: [startup]
tasks:
- name: Start database VM (112)
ansible.builtin.command: pvesh create /nodes/proxmox/qemu/112/status/start
ignore_errors: true
- name: Wait for VM 112 to be running
ansible.builtin.shell: >
pvesh get /nodes/proxmox/qemu/112/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
register: db_status
until: db_status.stdout.strip() == "running"
retries: 12
delay: 5
changed_when: false
- name: Wait for database services to initialize
ansible.builtin.pause:
seconds: 30
- name: "Startup Tier 2 — Infrastructure"
hosts: pve-node
gather_facts: false
tags: [startup]
vars:
tier2_vms: [106, 116]
tier2_lxcs: [225, 210, 227]
tasks:
- name: Start Tier 2 VMs
ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/start
loop: "{{ tier2_vms }}"
ignore_errors: true
- name: Start Tier 2 LXCs
ansible.builtin.command: pvesh create /nodes/proxmox/lxc/{{ item }}/status/start
loop: "{{ tier2_lxcs }}"
ignore_errors: true
- name: Wait for infrastructure to come up
ansible.builtin.pause:
seconds: 30
- name: "Startup Tier 3 — Applications"
hosts: pve-node
gather_facts: false
tags: [startup]
vars:
tier3_vms: [115, 110]
tier3_lxcs: [301]
tasks:
- name: Start Tier 3 VMs
ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/start
loop: "{{ tier3_vms }}"
ignore_errors: true
- name: Start Tier 3 LXCs
ansible.builtin.command: pvesh create /nodes/proxmox/lxc/{{ item }}/status/start
loop: "{{ tier3_lxcs }}"
ignore_errors: true
- name: Wait for applications to start
ansible.builtin.pause:
seconds: 30
- name: Restart Pi-hole container via SSH (UDP DNS fix)
ansible.builtin.command: ssh docker-home "docker restart pihole"
ignore_errors: true
- name: Wait for Pi-hole to stabilize
ansible.builtin.pause:
seconds: 10
- name: "Startup Tier 4 — Media & Others"
hosts: pve-node
gather_facts: false
tags: [startup]
vars:
tier4_vms: [109]
tier4_lxcs: [221, 222, 223, 302]
tasks:
- name: Start Tier 4 VMs
ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/start
loop: "{{ tier4_vms }}"
ignore_errors: true
- name: Start Tier 4 LXCs
ansible.builtin.command: pvesh create /nodes/proxmox/lxc/{{ item }}/status/start
loop: "{{ tier4_lxcs }}"
ignore_errors: true
- name: Post-reboot validation
hosts: pve-node
gather_facts: false
tags: [startup, validate]
tasks:
- name: Wait for all services to initialize
ansible.builtin.pause:
seconds: 60
- name: Check all expected VMs are running
ansible.builtin.shell: >
pvesh get /nodes/proxmox/qemu --output-format json |
python3 -c "
import sys, json
vms = json.load(sys.stdin)
expected = {106, 109, 110, 112, 115, 116}
running = {v['vmid'] for v in vms if v.get('status') == 'running'}
missing = expected - running
if missing:
print(f'WARN: VMs not running: {missing}')
sys.exit(1)
print(f'All expected VMs running: {running & expected}')
"
register: vm_check
ignore_errors: true
- name: Check all expected LXCs are running
ansible.builtin.shell: >
pvesh get /nodes/proxmox/lxc --output-format json |
python3 -c "
import sys, json
cts = json.load(sys.stdin)
# LXC 303 (mcp-gateway) intentionally excluded — onboot=0, operator-managed
expected = {210, 221, 222, 223, 225, 227, 301, 302, 304}
running = {c['vmid'] for c in cts if c.get('status') == 'running'}
missing = expected - running
if missing:
print(f'WARN: LXCs not running: {missing}')
sys.exit(1)
print(f'All expected LXCs running: {running & expected}')
"
register: lxc_check
ignore_errors: true
- name: Clean up old maintenance snapshots (older than 7 days)
ansible.builtin.shell: >
cutoff=$(date -d '7 days ago' +%s);
for vmid in $(pvesh get /nodes/proxmox/qemu --output-format json |
python3 -c "import sys,json; [print(v['vmid']) for v in json.load(sys.stdin)]"); do
for snap in $(pvesh get /nodes/proxmox/qemu/$vmid/snapshot --output-format json |
python3 -c "import sys,json; [print(s['name']) for s in json.load(sys.stdin) if s['name'].startswith('pre-maintenance-')]" 2>/dev/null); do
snap_date=$(echo $snap | sed 's/pre-maintenance-//');
snap_epoch=$(date -d "$snap_date" +%s 2>/dev/null);
if [ -z "$snap_epoch" ]; then
echo "WARN: could not parse date for snapshot $snap on VM $vmid";
elif [ "$snap_epoch" -lt "$cutoff" ]; then
pvesh delete /nodes/proxmox/qemu/$vmid/snapshot/$snap && echo "Deleted $snap from VM $vmid";
fi
done
done;
for ctid in $(pvesh get /nodes/proxmox/lxc --output-format json |
python3 -c "import sys,json; [print(c['vmid']) for c in json.load(sys.stdin)]"); do
for snap in $(pvesh get /nodes/proxmox/lxc/$ctid/snapshot --output-format json |
python3 -c "import sys,json; [print(s['name']) for s in json.load(sys.stdin) if s['name'].startswith('pre-maintenance-')]" 2>/dev/null); do
snap_date=$(echo $snap | sed 's/pre-maintenance-//');
snap_epoch=$(date -d "$snap_date" +%s 2>/dev/null);
if [ -z "$snap_epoch" ]; then
echo "WARN: could not parse date for snapshot $snap on LXC $ctid";
elif [ "$snap_epoch" -lt "$cutoff" ]; then
pvesh delete /nodes/proxmox/lxc/$ctid/snapshot/$snap && echo "Deleted $snap from LXC $ctid";
fi
done
done;
echo "Snapshot cleanup complete"
ignore_errors: true
- name: Display validation results
ansible.builtin.debug:
msg:
- "VM status: {{ vm_check.stdout }}"
- "LXC status: {{ lxc_check.stdout }}"
- "Maintenance reboot complete — post-reboot startup finished"

View File

@ -0,0 +1,15 @@
[Unit]
Description=Monthly Proxmox maintenance reboot (Ansible)
After=network-online.target
Wants=network-online.target
[Service]
Type=oneshot
User=cal
WorkingDirectory=/opt/ansible
ExecStart=/usr/bin/ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml
StandardOutput=append:/opt/ansible/logs/monthly-reboot.log
StandardError=append:/opt/ansible/logs/monthly-reboot.log
TimeoutStartSec=900
# No [Install] section — this service is activated exclusively by ansible-monthly-reboot.timer

View File

@ -0,0 +1,13 @@
[Unit]
Description=Monthly Proxmox maintenance reboot timer
Documentation=https://git.manticorum.com/cal/claude-home/src/branch/main/server-configs/proxmox/maintenance-reboot.md
[Timer]
# First Sunday of the month at 08:00 UTC (3:00 AM ET during EDT)
# Day range 01-07 ensures it's always the first occurrence of that weekday
OnCalendar=Sun *-*-01..07 08:00:00
Persistent=true
RandomizedDelaySec=600
[Install]
WantedBy=timers.target

View File

@ -0,0 +1,21 @@
[Unit]
Description=Post-reboot controlled guest startup (Ansible)
After=network-online.target
Wants=network-online.target
# Only run after a fresh boot — not on service restart
ConditionUpTimeSec=600
[Service]
Type=oneshot
User=cal
WorkingDirectory=/opt/ansible
# Delay 120s to let Proxmox API stabilize and onboot guests settle
ExecStartPre=/bin/sleep 120
ExecStart=/usr/bin/ansible-playbook /opt/ansible/playbooks/post-reboot-startup.yml
StandardOutput=append:/opt/ansible/logs/post-reboot-startup.log
StandardError=append:/opt/ansible/logs/post-reboot-startup.log
TimeoutStartSec=1800
[Install]
# Runs automatically on every boot of LXC 304
WantedBy=multi-user.target

View File

@ -0,0 +1,95 @@
---
title: "ACE-Step 1.5 — Local Network Setup Guide"
description: "How to run ACE-Step AI music generator on the local network via Gradio UI or REST API, including .env configuration and startup notes."
type: guide
domain: development
tags: [ace-step, ai, music-generation, gradio, gpu, cuda]
---
# ACE-Step 1.5 — Local Network Setup
ACE-Step is an open-source AI music generation model. This guide covers running it on the workstation and serving the Gradio web UI to the local network.
## Location
```
/mnt/NV2/Development/ACE-Step-1.5/
```
Cloned from GitHub. Uses `uv` for dependency management — the `.venv` is created automatically on first run.
## Quick Start (Gradio UI)
```bash
cd /mnt/NV2/Development/ACE-Step-1.5
./start_gradio_ui.sh
```
Accessible from any device on the network at **http://10.10.0.41:7860** (or whatever the workstation IP is).
## .env Configuration
The `.env` file in the project root persists settings across git updates. Current config:
```env
SERVER_NAME=0.0.0.0
PORT=7860
LANGUAGE=en
```
### Key Settings
| Variable | Default | Description |
|----------|---------|-------------|
| `SERVER_NAME` | `127.0.0.1` | Set to `0.0.0.0` for LAN access |
| `PORT` | `7860` | Gradio UI port |
| `LANGUAGE` | `en` | UI language (`en`, `zh`, `he`, `ja`). **Must be set** — empty value causes `unbound variable` error with the launcher's `set -u` |
| `ACESTEP_CONFIG_PATH` | `acestep-v15-turbo` | DiT model variant |
| `ACESTEP_LM_MODEL_PATH` | `acestep-5Hz-lm-0.6B` | Language model for lyrics/prompts |
| `ACESTEP_INIT_LLM` | `auto` | `auto` / `true` / `false` — auto detects based on VRAM |
| `CHECK_UPDATE` | `true` | Set to `false` to skip interactive update prompt (useful for background/automated starts) |
See `.env.example` for the full list.
## REST API Server (Alternative)
For programmatic access instead of the web UI:
```bash
cd /mnt/NV2/Development/ACE-Step-1.5
./start_api_server.sh
```
Default: `http://127.0.0.1:8001`. To serve on LAN, edit `start_api_server.sh` line 12:
```bash
HOST="0.0.0.0"
```
API docs available at `http://<ip>:8001/docs`.
## Hardware Profile (Workstation)
- **GPU**: NVIDIA RTX 4080 SUPER (16 GB VRAM)
- **Tier**: 16GB class — auto-enables CPU offload, INT8 quantization, LLM
- **Max batch (with LM)**: 4
- **Max batch (without LM)**: 8
- **Max duration (with LM)**: 480s (8 min)
- **Max duration (without LM)**: 600s (10 min)
## Startup Behavior
1. Loads `.env` configuration
2. Checks for git updates (interactive prompt — set `CHECK_UPDATE=false` to skip)
3. Creates `.venv` via `uv sync` if missing (slow on first run)
4. Runs legacy NVIDIA torch compatibility check
5. Loads DiT model → quantizes to INT8 → loads LM → allocates KV cache
6. Launches Gradio with queue for multi-user support
Full startup takes ~30-40 seconds after first run.
## Gotchas
- **LANGUAGE must be set in `.env`**: The system `$LANGUAGE` locale variable can be empty, causing the launcher to crash with `unbound variable` due to `set -u`. Always include `LANGUAGE=en` in `.env`.
- **Update prompt blocks background execution**: If running headlessly or from a script, set `CHECK_UPDATE=false` to avoid the interactive Y/N prompt.
- **Model downloads**: First run downloads ~4-5 GB of model weights from HuggingFace. Subsequent runs use cached checkpoints in `./checkpoints/`.

View File

@ -0,0 +1,80 @@
---
title: "Fix: Subagent Write/Edit tools blocked by permission mode mismatch"
description: "Claude Code subagents cannot use Write or Edit tools unless spawned with mode: acceptEdits — other permission modes (dontAsk, auto, bypassPermissions) do not grant file-write capability."
type: troubleshooting
domain: development
tags: [troubleshooting, claude-code, permissions, agents, subagents]
---
# Fix: Subagent Write/Edit tools blocked by permission mode mismatch
**Date:** 2026-03-28
**Severity:** Medium — blocks all agent-driven code generation workflows until identified
## Problem
When orchestrating multi-agent code generation (spawning engineer agents to write code in parallel), all subagents could Read/Glob/Grep files but Write and Edit tool calls were silently denied. Agents would complete their analysis, prepare the full file content, then report "blocked on Write/Edit permission."
This happened across **every** permission mode tried:
- `mode: bypassPermissions` — denied (with worktree isolation)
- `mode: auto` — denied (with and without worktree isolation)
- `mode: dontAsk` — denied (with and without worktree isolation)
## Root Cause
Claude Code's Agent tool has multiple permission modes that control different things:
| Mode | What it controls | Grants Write/Edit? |
|------|-----------------|-------------------|
| `default` | User prompted for each tool call | No — user must approve each |
| `dontAsk` | Suppresses user prompts | **No** — suppresses prompts but doesn't grant capability |
| `auto` | Auto-approves based on context | **No** — same issue |
| `bypassPermissions` | Skips permission-manager hooks | **No** — only bypasses plugin hooks, not tool-level gates |
| `acceptEdits` | Grants file modification capability | **Yes** — this is the correct mode |
The key distinction: `dontAsk`/`auto`/`bypassPermissions` control the **user-facing permission prompt** (whether the user gets asked to approve). But Write/Edit tools have an **internal capability gate** that checks whether the agent was explicitly authorized to modify files. Only `acceptEdits` provides that authorization.
## Additional Complication: permission-manager plugin
The `permission-manager@agent-toolkit` plugin (`cmd-gate` PreToolUse hook) adds a second layer that blocks Bash-based file writes (output redirection `>`, `tee`, etc.). When agents fell back to Bash after Write/Edit failed, the plugin caught those too.
- `bypassPermissions` mode is documented to skip cmd-gate entirely, but this didn't work reliably in worktree isolation
- Disabling the plugin (`/plugin` → toggle off `permission-manager@agent-toolkit`, then `/reload-plugins`) removed the Bash-level blocks but did NOT fix Write/Edit
## Fix
**Use `mode: acceptEdits`** when spawning any agent that needs to create or modify files:
```
Agent(
subagent_type="engineer",
mode="acceptEdits", # <-- This is the critical setting
prompt="..."
)
```
**Additional recommendations:**
- Worktree isolation (`isolation: "worktree"`) may compound permission issues — avoid it unless the agents genuinely need isolation (e.g., conflicting file edits)
- For agents that only read (reviewers, validators), any mode works
- If the permission-manager plugin is also blocking Bash fallbacks, disable it temporarily or add classifiers for the specific commands needed
## Reproduction
1. Spawn an engineer agent with `mode: dontAsk` and a prompt to create a new file
2. Agent will Read reference files successfully, prepare content, then report Write tool denied
3. Change to `mode: acceptEdits` — same prompt succeeds immediately
## Environment
- Claude Code CLI on Linux (Nobara/Fedora)
- Plugins: permission-manager@agent-toolkit (St0nefish/agent-toolkit)
- Agent types tested: engineer, general-purpose
- Models tested: sonnet subagents
## Lessons
- **Always use `acceptEdits` for code-writing agents.** The mode name is the clue — it's not just "accepting" edits from the user, it's granting the agent the capability to make edits.
- **`dontAsk` ≠ "can do anything."** It means "don't prompt the user" — but the capability to write files is a separate authorization layer.
- **Test agent permissions early.** When building a multi-agent orchestration workflow, verify the first agent can actually write before launching a full wave. A quick single-file test agent saves time.
- **Worktree isolation adds complexity.** Only use it when agents would genuinely conflict on the same files. For non-overlapping file changes, skip isolation.
- **The permission-manager plugin is a separate concern.** It blocks Bash file-write commands (>, tee, cat heredoc). Disabling it fixes Bash fallbacks but not Write/Edit tool calls. Both layers must be addressed independently.

View File

@ -0,0 +1,50 @@
---
title: "MLB The Show Grind — 2026.4.02"
description: "Pack opening command, full cycle orchestrator, keyboard dismiss fix, package split."
type: reference
domain: gaming
tags: [release-notes, deployment, mlb-the-show, automation]
---
# MLB The Show Grind — 2026.4.02
**Date:** 2026-04-02
**Project:** mlb-the-show (`/mnt/NV2/Development/mlb-the-show`)
## Release Summary
Added pack opening automation and a full buy→exchange→open cycle command. Fixed a critical bug where KEYCODE_BACK was closing the buy order modal instead of dismissing the keyboard, preventing all order placement. Split the 1600-line single-file script into a proper Python package.
## Changes
### New Features
- **`open-packs` command** — navigates to My Packs, finds the target pack by name (default: Exchange - Live Series Gold), rapid-taps Open Next at ~0.3s/pack with periodic verification
- **`cycle` command** — full orchestrated flow: buy silvers for specified OVR tiers → exchange all dupes into gold packs → open all gold packs
- **`DEFAULT_PACK_NAME` constant** — `"Exchange - Live Series Gold"` extracted from inline strings
### Bug Fixes
- **Keyboard dismiss fix**`KEYCODE_BACK` was closing the entire buy order modal instead of just dismissing the numeric keyboard. Replaced with `tap(540, 900)` to tap a neutral area. This was the root cause of all buy orders silently failing (0 orders placed despite cards having room).
- **`full_cycle` passed no args to `open_packs()`** — now passes `packs_exchanged` count to bound the open loop
- **`isinstance(result, dict)` dead code** removed from `full_cycle``grind_exchange` always returns `int`
- **`_find_nearest_open_button`** — added x-column constraint (200px) and zero-width element filtering to prevent matching ghost buttons from collapsed packs
### Refactoring
- **Package split**`scripts/grind.py` (1611 lines) → `scripts/grind/` package:
- `constants.py` (104 lines) — coordinates, price gates, UI element maps
- `adb_utils.py` (125 lines) — ADB shell, tap, swipe, dump_ui, element finders
- `navigation.py` (107 lines) — screen navigation (nav_to, nav_tab, FAB)
- `exchange.py` (283 lines) — gold exchange logic
- `market.py` (469 lines) — market scanning and buy order placement
- `packs.py` (131 lines) — pack opening
- `__main__.py` (390 lines) — CLI entry point and orchestrators (grind_loop, full_cycle)
- `scripts/grind.py` retained as a thin wrapper for `uv run` backward compatibility
- Invocation changed from `uv run scripts/grind.py` to `PYTHONPATH=scripts python3 -m grind`
- Raw `adb("input swipe ...")` calls replaced with `swipe()` helper
## Session Stats
- **Buy orders placed:** 532 orders across two runs (474 + 58)
- **Stubs spent:** ~63,655
- **Gold packs exchanged:** 155 (94 + 61)
- **Gold packs opened:** 275
- **OVR tiers worked:** 77 (primary), 78 (all above max price)

View File

@ -214,6 +214,58 @@ For full HDR setup (vk-hdr-layer, KDE config, per-API env vars), see the **steam
**Diagnostic tip**: Look for rapid retry patterns in Pi-hole logs (same domain queried every 1-3s from the Xbox IP) — this signals a blocked domain causing timeout loops.
## Gray Zone Warfare — EAC Failures on Proton (2026-03-31) [RESOLVED]
**Severity:** High — game unplayable online
**Status:** RESOLVED — corrupted prebuild world cache file
**Problem:** EAC errors when connecting to servers on Linux/Proton. Three error codes observed across attempts:
- `0x0002000A` — "The client failed an anti-cheat client runtime check" (the actual root cause)
- `0x0002000F` — "The client failed to register in time" (downstream timeout)
- `0x00020011` — "The client failed to start the session" (downstream session failure)
Game launches fine, EAC bootstrapper reports success, but fails when joining a server at "Synchronizing Live Data".
**Root Cause:** A corrupted/stale prebuild world cache file that EAC flagged during runtime checks:
```
LogEOSAntiCheat: [AntiCheatClient] [PollStatusInternal] Client Violation with Type: 5
Message: Unknown file version (GZW/Content/SKALLA/PrebuildWorldData/World/cache/0xb9af63cee2e43b6c_0x3cb3b3354fb31606.dat)
```
EAC scanned this file, found an unrecognized version, and flagged a client violation. The other errors (`0x0002000F`, `0x00020011`) were downstream consequences — EAC couldn't complete session registration after the violation.
Compounding factors that made diagnosis harder:
- Epic EOS scheduled maintenance (Fortnite v40.10, Apr 1 08:00-09:30 UTC) returned 503s from `api.epicgames.dev/auth/v1/oauth/token`, masking the real issue
- `steam_api64.dll` EOS SDK errors at startup are **benign noise** under Proton — red herring
- Nuking the compatdata prefix and upgrading Proton happened concurrently, adding confusion
**Fix:**
1. Delete the specific cache file: `rm "GZW/Content/SKALLA/PrebuildWorldData/World/cache/0xb9af63cee2e43b6c_0x3cb3b3354fb31606.dat"`
2. Verify game files in Steam — Steam redownloads a fresh copy with different hash
3. Launch game — clean logs, no EAC errors
Key detail: the file was the same size (60.7MB) before and after, but different md5 hash — Steam's verify replaced it with a corrected version.
**Log locations:**
- EAC bootstrapper: `compatdata/2479810/pfx/drive_c/users/steamuser/AppData/Roaming/EasyAntiCheat/.../anticheatlauncher.log`
- Game log: `compatdata/2479810/pfx/drive_c/users/steamuser/AppData/Local/GZW/Saved/Logs/GZW.log`
- STL launch log: `~/.config/steamtinkerlaunch/logs/gamelaunch/id/2479810.log`
**What did NOT fix it (for reference):**
1. Installing Proton EasyAntiCheat Runtime (AppID 1826330) — good to have but not the issue
2. Deleting the entire cache directory without re-verifying — Steam verify re-downloaded the same bad file the first time (20 files fixed); needed a second targeted delete + verify
3. Nuking compatdata prefix for clean rebuild
4. Switching Proton versions (GE-Proton9-25 ↔ GE-Proton10-25)
**Lessons:**
- When EAC logs show "Unknown file version" for a specific `.dat` file, delete that file and verify — don't nuke the whole cache or prefix
- `steam_api64.dll` EOS errors are benign under Proton and not related to EAC failures
- Check Epic's status page for scheduled maintenance before deep-diving Proton issues
- Multiple verify-and-fix cycles may be needed — the first verify can redownload a stale cached version from Steam's CDN
**Game version:** 0.4.0.0-231948-H (EA Pre-Alpha)
**Working Proton:** GE-Proton10-25
**STL config:** `~/.config/steamtinkerlaunch/gamecfgs/id/2479810.conf`
## Useful Commands
### Check Running Game Process

View File

@ -21,7 +21,7 @@
{
"parameters": {
"operation": "executeCommand",
"command": "/root/.local/bin/claude -p \"Run python3 ~/.claude/skills/server-diagnostics/client.py health paper-dynasty and analyze the results. If any containers are not running or there are critical issues, summarize them. Otherwise just say 'All systems healthy'.\" --output-format json --json-schema '{\"type\":\"object\",\"properties\":{\"status\":{\"type\":\"string\",\"enum\":[\"healthy\",\"issues_found\"]},\"summary\":{\"type\":\"string\"},\"root_cause\":{\"type\":\"string\"},\"severity\":{\"type\":\"string\",\"enum\":[\"low\",\"medium\",\"high\",\"critical\"]},\"affected_services\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}},\"actions_taken\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}}},\"required\":[\"status\",\"severity\",\"summary\"]}' --allowedTools \"Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)\"",
"command": "/root/.local/bin/claude -p \"Run python3 ~/.claude/skills/server-diagnostics/client.py health paper-dynasty and analyze the results. If any containers are not running or there are critical issues, summarize them. Otherwise just say 'All systems healthy'.\" --output-format json --json-schema '{\"type\":\"object\",\"properties\":{\"status\":{\"type\":\"string\",\"enum\":[\"healthy\",\"issues_found\"]},\"summary\":{\"type\":\"string\"},\"root_cause\":{\"type\":\"string\"},\"severity\":{\"type\":\"string\",\"enum\":[\"low\",\"medium\",\"high\",\"critical\"]},\"affected_services\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}},\"actions_taken\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}}},\"required\":[\"status\",\"severity\",\"summary\"]}' --allowedTools \"Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)\" --append-system-prompt \"You are a server diagnostics agent. Use the server-diagnostics skill client.py for all operations. Never run destructive commands.\"",
"options": {}
},
"id": "ssh-claude-code",
@ -75,20 +75,48 @@
"typeVersion": 2,
"position": [660, 0]
},
{
"parameters": {
"operation": "executeCommand",
"command": "=/root/.local/bin/claude -p \"The previous health check found issues. Investigate deeper: check container logs, resource usage, and recent events. Provide a detailed root cause analysis and recommended remediation steps.\" --resume \"{{ $json.session_id }}\" --output-format json --json-schema '{\"type\":\"object\",\"properties\":{\"root_cause_detail\":{\"type\":\"string\"},\"container_logs\":{\"type\":\"string\"},\"resource_status\":{\"type\":\"string\"},\"remediation_steps\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}},\"requires_human\":{\"type\":\"boolean\"}},\"required\":[\"root_cause_detail\",\"remediation_steps\",\"requires_human\"]}' --allowedTools \"Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)\" --max-turns 15 --append-system-prompt \"You are a server diagnostics agent performing a follow-up investigation. The initial health check found issues. Dig deeper into logs and metrics. Never run destructive commands.\"",
"options": {}
},
"id": "ssh-followup",
"name": "Follow Up Diagnostics",
"type": "n8n-nodes-base.ssh",
"typeVersion": 1,
"position": [880, -200],
"credentials": {
"sshPassword": {
"id": "REPLACE_WITH_CREDENTIAL_ID",
"name": "Claude Code LXC"
}
}
},
{
"parameters": {
"jsCode": "// Parse follow-up diagnostics response\nconst stdout = $input.first().json.stdout || '';\nconst initial = $('Parse Claude Response').first().json;\n\ntry {\n const response = JSON.parse(stdout);\n const data = response.structured_output || JSON.parse(response.result || '{}');\n \n return [{\n json: {\n ...initial,\n followup: {\n root_cause_detail: data.root_cause_detail || 'No detail available',\n container_logs: data.container_logs || '',\n resource_status: data.resource_status || '',\n remediation_steps: data.remediation_steps || [],\n requires_human: data.requires_human || false,\n cost_usd: response.total_cost_usd,\n session_id: response.session_id\n },\n total_cost_usd: (initial.cost_usd || 0) + (response.total_cost_usd || 0)\n }\n }];\n} catch (e) {\n return [{\n json: {\n ...initial,\n followup: {\n error: e.message,\n root_cause_detail: 'Follow-up parse failed',\n remediation_steps: [],\n requires_human: true\n },\n total_cost_usd: initial.cost_usd || 0\n }\n }];\n}"
},
"id": "parse-followup",
"name": "Parse Follow-up Response",
"type": "n8n-nodes-base.code",
"typeVersion": 2,
"position": [1100, -200]
},
{
"parameters": {
"method": "POST",
"url": "https://discord.com/api/webhooks/1451783909409816763/O9PMDiNt6ZIWRf8HKocIZ_E4vMGV_lEwq50aAiZ9HVFR2UGwO6J1N9_wOm82p0MetIqT",
"sendBody": true,
"specifyBody": "json",
"jsonBody": "={\n \"embeds\": [{\n \"title\": \"{{ $json.severity === 'critical' ? '🔴' : $json.severity === 'high' ? '🟠' : '🟡' }} Server Alert\",\n \"description\": {{ JSON.stringify($json.summary) }},\n \"color\": {{ $json.severity === 'critical' ? 15158332 : $json.severity === 'high' ? 15105570 : 16776960 }},\n \"fields\": [\n {\n \"name\": \"Severity\",\n \"value\": \"{{ $json.severity.toUpperCase() }}\",\n \"inline\": true\n },\n {\n \"name\": \"Server\",\n \"value\": \"paper-dynasty (10.10.0.88)\",\n \"inline\": true\n },\n {\n \"name\": \"Cost\",\n \"value\": \"${{ $json.cost_usd ? $json.cost_usd.toFixed(4) : '0.0000' }}\",\n \"inline\": true\n },\n {\n \"name\": \"Root Cause\",\n \"value\": \"{{ $json.root_cause || 'N/A' }}\",\n \"inline\": false\n },\n {\n \"name\": \"Affected Services\",\n \"value\": \"{{ $json.affected_services.length ? $json.affected_services.join(', ') : 'None' }}\",\n \"inline\": false\n },\n {\n \"name\": \"Actions Taken\",\n \"value\": \"{{ $json.actions_taken.length ? $json.actions_taken.join('\\n') : 'None' }}\",\n \"inline\": false\n }\n ],\n \"timestamp\": \"{{ new Date().toISOString() }}\"\n }]\n}",
"jsonBody": "={\n \"embeds\": [{\n \"title\": \"{{ $json.severity === 'critical' ? '🔴' : $json.severity === 'high' ? '🟠' : '🟡' }} Server Alert\",\n \"description\": {{ JSON.stringify($json.summary) }},\n \"color\": {{ $json.severity === 'critical' ? 15158332 : $json.severity === 'high' ? 15105570 : 16776960 }},\n \"fields\": [\n {\n \"name\": \"Severity\",\n \"value\": \"{{ $json.severity.toUpperCase() }}\",\n \"inline\": true\n },\n {\n \"name\": \"Server\",\n \"value\": \"paper-dynasty (10.10.0.88)\",\n \"inline\": true\n },\n {\n \"name\": \"Cost\",\n \"value\": \"${{ $json.total_cost_usd ? $json.total_cost_usd.toFixed(4) : '0.0000' }}\",\n \"inline\": true\n },\n {\n \"name\": \"Root Cause\",\n \"value\": {{ JSON.stringify(($json.followup && $json.followup.root_cause_detail) || $json.root_cause || 'N/A') }},\n \"inline\": false\n },\n {\n \"name\": \"Affected Services\",\n \"value\": \"{{ $json.affected_services.length ? $json.affected_services.join(', ') : 'None' }}\",\n \"inline\": false\n },\n {\n \"name\": \"Remediation Steps\",\n \"value\": {{ JSON.stringify(($json.followup && $json.followup.remediation_steps.length) ? $json.followup.remediation_steps.map((s, i) => (i+1) + '. ' + s).join('\\n') : ($json.actions_taken.length ? $json.actions_taken.join('\\n') : 'None')) }},\n \"inline\": false\n },\n {\n \"name\": \"Requires Human?\",\n \"value\": \"{{ ($json.followup && $json.followup.requires_human) ? '⚠️ Yes' : '✅ No' }}\",\n \"inline\": true\n }\n ],\n \"timestamp\": \"{{ new Date().toISOString() }}\"\n }]\n}",
"options": {}
},
"id": "discord-alert",
"name": "Discord Alert",
"type": "n8n-nodes-base.httpRequest",
"typeVersion": 4.2,
"position": [880, -100]
"position": [1320, -200]
},
{
"parameters": {
@ -145,7 +173,7 @@
"main": [
[
{
"node": "Discord Alert",
"node": "Follow Up Diagnostics",
"type": "main",
"index": 0
}
@ -158,6 +186,28 @@
}
]
]
},
"Follow Up Diagnostics": {
"main": [
[
{
"node": "Parse Follow-up Response",
"type": "main",
"index": 0
}
]
]
},
"Parse Follow-up Response": {
"main": [
[
{
"node": "Discord Alert",
"type": "main",
"index": 0
}
]
]
}
},
"settings": {

View File

@ -0,0 +1,34 @@
---
title: "Database API Release — 2026.4.1"
description: "Query limit caps to prevent worker timeouts, plus hotfix to exempt /players endpoint."
type: reference
domain: major-domo
tags: [release-notes, deployment, database, hotfix]
---
# Database API Release — 2026.4.1
**Date:** 2026-04-01
**Tag:** `2026.3.7` + 3 post-tag commits (CI auto-generates CalVer on merge)
**Image:** `manticorum67/major-domo-database`
**Server:** akamai (`~/container-data/sba-database`)
**Deploy method:** `docker compose pull && docker compose down && docker compose up -d`
## Release Summary
Added bounded pagination (`MAX_LIMIT=500`, `DEFAULT_LIMIT=200`) to all list endpoints to prevent Gunicorn worker timeouts caused by unbounded queries. Two follow-up fixes corrected response `count` fields in fieldingstats that were computed after the limit was applied. A hotfix (PR #103) then removed the caps from the `/players` endpoint specifically, since the bot and website depend on fetching full player lists.
## Changes
### Bug Fixes
- **PR #99** — Fix unbounded API queries causing Gunicorn worker timeouts. Added `MAX_LIMIT=500` and `DEFAULT_LIMIT=200` constants in `dependencies.py`, enforced `le=MAX_LIMIT` on all list endpoints. Added middleware to strip empty query params preventing validation bypass.
- **PR #100** — Fix fieldingstats `get_fieldingstats` count: captured `total_count` before `.limit()` so the response reflects total rows, not page size.
- **PR #101** — Fix fieldingstats `get_totalstats`: removed line that overwrote `count` with `len(page)` after it was correctly set from `total_count`.
### Hotfix
- **PR #103** — Remove output caps from `GET /api/v3/players`. Reverted `limit` param to `Optional[int] = Query(default=None, ge=1)` (no ceiling). The `/players` table is a bounded dataset (~1500 rows/season) and consumers depend on uncapped results. All other endpoints retain their caps.
## Deployment Notes
- No migrations required
- No config changes
- Rollback: `docker compose pull manticorum67/major-domo-database:<previous-tag> && docker compose down && docker compose up -d`

View File

@ -0,0 +1,38 @@
---
title: "Discord Bot Release — 2026.3.13"
description: "Enforce free agency lock deadline — block /dropadd FA pickups after week 14, plus performance batch from backlog issues."
type: reference
domain: major-domo
tags: [release-notes, deployment, discord, major-domo]
---
# Discord Bot Release — 2026.3.13
**Date:** 2026-03-31
**Tag:** `2026.3.13`
**Image:** `manticorum67/major-domo-discordapp:2026.3.13` / `:production`
**Server:** akamai (`~/container-data/major-domo`)
**Deploy method:** `.scripts/deploy.sh -y` (docker compose pull + up)
## Release Summary
Enforces the previously unused `fa_lock_week` config (week 14) in the transaction builder. After the deadline, `/dropadd` blocks adding players FROM Free Agency while still allowing drops TO FA. Also includes a batch of performance PRs from the backlog that were merged between 2026.3.12 and this tag.
## Changes
### New Features
- **Free agency lock enforcement**`TransactionBuilder.add_move()` now checks `current_week >= fa_lock_week` and rejects FA pickups after the deadline. Dropping to FA remains allowed. Config already existed at `fa_lock_week = 14` but was never enforced. (PR #122)
### Performance
- Eliminate redundant API calls in trade views (PR #116, issue #94)
- Eliminate redundant GET after create/update and parallelize stats (PR #112, issue #95)
- Parallelize N+1 player/creator lookups with `asyncio.gather()` (PR #118, issue #89)
- Consolidate duplicate `league_service.get_current_state()` calls in `add_move()` into a single shared fetch (PR #122)
### Bug Fixes
- Fix race condition: use per-user dict for `_checked_teams` in trade views (PR #116)
## Deployment Notes
- No migrations required
- No config changes needed — `fa_lock_week = 14` already existed in config
- Rollback: `ssh akamai "cd ~/container-data/major-domo && docker pull manticorum67/major-domo-discordapp@sha256:94d59135f127d5863b142136aeeec9d63b06ee63e214ef59f803cedbd92b473e && docker tag manticorum67/major-domo-discordapp@sha256:94d59135f127d5863b142136aeeec9d63b06ee63e214ef59f803cedbd92b473e manticorum67/major-domo-discordapp:production && docker compose up -d discord-app"`

View File

@ -0,0 +1,86 @@
---
title: "Discord Bot Release — 2026.3.12"
description: "Major catch-up release: trade deadline enforcement, performance parallelization, security fixes, CI/CD migration to CalVer, and 148 commits of accumulated improvements."
type: reference
domain: major-domo
tags: [release-notes, deployment, discord, major-domo]
---
# Discord Bot Release — 2026.3.12
**Date:** 2026-03-31
**Tag:** `2026.3.12`
**Image:** `manticorum67/major-domo-discordapp:2026.3.12` / `:production`
**Server:** akamai (`~/container-data/major-domo`)
**Deploy method:** `.scripts/deploy.sh -y` (docker compose pull + up)
**Previous tag:** `v2.29.4` (148 commits behind)
## Release Summary
Large catch-up release covering months of accumulated work since the last tag. The headline feature is trade deadline enforcement — `/trade` commands are now blocked after the configured deadline week, with fail-closed behavior when API data is unavailable. Also includes significant performance improvements (parallelized API calls, cached signatures, Redis SCAN), security hardening, dependency pinning, and a full CI/CD migration from version-file bumps to CalVer tag-triggered builds.
## Changes
### New Features
- **Trade deadline enforcement**`is_past_trade_deadline` property on Current model; guards on `/trade initiate`, submit button, and `_finalize_trade`. Fail-closed when API returns no data. 4 new tests. (PR #121)
- `is_admin()` helper in `utils/permissions.py` (#55)
- Team ownership verification on `/injury set-new` and `/injury clear` (#18)
- Current week number included in weekly-info channel posts
- Local deploy script for production deploys
### Performance
- Parallelize independent API calls with `asyncio.gather()` (#90)
- Cache `inspect.signature()` at decoration time (#97)
- Replace `json.dumps` serialization test with `isinstance` fast path (#96)
- Use `channel.purge()` instead of per-message delete loops (#93)
- Parallelize schedule_service week fetches (#88)
- Replace Redis `KEYS` with `SCAN` in `clear_prefix` (#98)
- Reuse persistent `aiohttp.ClientSession` in GiphyService (#26)
- Cache user team lookup in player_autocomplete, reduce limit to 25
### Bug Fixes
- Fix chart_service path from `data/` to `storage/`
- Make ScorecardTracker methods async to match await callers
- Prevent partial DB writes and show detailed errors on scorecard submission failure
- Add trailing slashes to API URLs to prevent 307 redirects dropping POST bodies
- Trade validation: check against next week's projected roster, include pending trades and org affiliate transactions
- Prefix trade validation errors with team abbreviation
- Auto-detect player roster type in trade commands instead of assuming ML
- Fix key plays score text ("tied at X" instead of "Team up X-X") (#48)
- Fix scorebug stale data, win probability parsing, and read-failure tolerance (#39, #40)
- Batch quick-wins: 4 issues resolved (#37, #27, #25, #38)
- Fix ContextualLogger crash when callers pass `exc_info=True`
- Fix thaw report posting to use channel ID instead of channel names
- Use explicit America/Chicago timezone for freeze/thaw scheduling
- Replace broken `@self.tree.interaction_check` with MaintenanceAwareTree subclass
- Implement actual maintenance mode flag in `/admin-maintenance` (#28)
- Validate and sanitize pitching decision data from Google Sheets
- Fix `/player` autocomplete timeout by using current season only
- Split read-only data volume to allow state file writes (#85)
- Update roster labels to use Minor League and Injured List (#59)
### Security
- Address 7 security issues across the codebase
- Remove 226 unused imports (#33)
- Pin all Python dependency versions in `requirements.txt` (#76)
### Refactoring & Cleanup
- Extract duplicate command hash logic into `_compute_command_hash` (#31)
- Move 42 unnecessary lazy imports to top-level
- Remove dead maintenance mode artifacts in bot.py (#104)
- Remove unused `weeks_ahead` parameter from `get_upcoming_games`
- Invalidate roster cache after submission instead of force-refreshing
## Infrastructure Changes
- **CI/CD migration**: Switched from version-file bumps to CalVer tag-triggered Docker builds
- Added `.scripts/release.sh` for creating CalVer tags
- Updated `.scripts/deploy.sh` for tag-triggered releases
- Docker build cache switched from `type=gha` to `type=registry`
- Used `docker-tags` composite action for multi-channel release support
- Fixed act_runner auth with short-form local actions + full GitHub URLs
- Use Gitea API for tag creation to avoid branch protection failures
## Deployment Notes
- No migrations required
- No config changes needed
- Rollback: `ssh akamai "cd ~/container-data/major-domo && docker pull manticorum67/major-domo-discordapp@<previous-digest> && docker tag <digest> manticorum67/major-domo-discordapp:production && docker compose up -d discord-app"`

View File

@ -0,0 +1,59 @@
---
title: "Fix: Gunicorn Worker Timeouts from Unbounded API Queries"
description: "External clients sent limit=99999 and empty filter params through the reverse proxy, causing API workers to timeout and get killed."
type: troubleshooting
domain: major-domo
tags: [troubleshooting, major-domo, database, deployment, docker]
---
# Fix: Gunicorn Worker Timeouts from Unbounded API Queries
**Date:** 2026-04-01
**PR:** cal/major-domo-database#99
**Issues:** #98 (main), #100 (fieldingstats count bug), #101 (totalstats count overwrite, pre-existing)
**Severity:** Critical — active production instability during Season 12, 12 worker timeouts in 2 days and accelerating
## Problem
The monitoring app kept flagging the SBA API container (`sba_db_api`) as unhealthy and restarting it. Container logs showed repeated `CRITICAL WORKER TIMEOUT` and `WARNING Worker was sent SIGABRT` messages from Gunicorn. The container itself wasn't restarting (0 Docker restarts, up 2 weeks), but individual workers were being killed and respawned, causing brief API unavailability windows.
## Root Cause
External clients (via nginx-proxy-manager at `172.25.0.3`) were sending requests with `limit=99999` and empty filter parameters (e.g., `?game_id=&pitcher_id=`). The API had no defenses:
- **No max limit cap** on any endpoint except `/players/search` (which had `le=50`). Clients could request 99,999 rows.
- **Empty string params passed validation** — FastAPI parsed `game_id=` as `['']`, which passed `if param is not None` checks but generated wasteful full-table-scan queries.
- **`/transactions` had no limit parameter at all** — always returned every matching row with recursive serialization (`model_to_dict(recurse=True)`).
- **Recursive serialization amplified cost** — each row triggered additional DB lookups for FK relations (player, team, etc.).
Combined, these caused queries to exceed the 120-second Gunicorn timeout, killing the worker.
### IP Attribution Gotcha
Initial assumption was the Discord bot was the source (IP `172.25.0.3` was assumed to be the bot container). Docker IP mapping revealed `172.25.0.3` was actually **nginx-proxy-manager** — the queries came from external clients through the reverse proxy. The Discord bot is at `172.18.0.2` on a completely separate Docker network and generates none of these queries.
```bash
# Command to map container IPs
docker inspect --format='{{.Name}} {{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' $(docker ps -q)
```
## Fix
PR #99 merged into main with the following changes (27 files, 503 insertions):
1. **`MAX_LIMIT=500` and `DEFAULT_LIMIT=200` constants** in `app/dependencies.py`, enforced with `le=MAX_LIMIT` across all list endpoints
2. **`strip_empty_query_params` middleware** in `app/main.py` — strips empty string values from query params before FastAPI parses them, so `?game_id=` is treated as absent
3. **`limit`/`offset` added to `/transactions`** — previously returned all rows; now defaults to 200, max 500, with `total_count` computed before pagination
4. **11 existing limit params capped** with `le=MAX_LIMIT`
5. **13 endpoints with no limit** received `limit`/`offset` params
6. **Manual `if limit < 1` guards removed** — now handled by FastAPI's `ge=1` validation
7. **5 unit tests** covering limit validation (422 on exceeding max, zero, negative), transaction response shape, and empty string stripping
8. **fieldingstats count bug fixed**`.count()` was being called after `.limit()`, capping the reported count at the page size instead of total matching rows (#100)
## Lessons
- **Always verify container IP attribution** before investigating the wrong service. `docker inspect` with format string is the canonical way to map IPs to container names. Don't assume based on Docker network proximity.
- **APIs should never trust client-provided limits** — enforce `le=MAX_LIMIT` on every list endpoint. The only safe endpoint was `/players/search` which had been properly capped at `le=50`.
- **Empty string params are a silent danger** — FastAPI parses `?param=` as `['']`, not `None`. A global middleware is the right fix since it protects all endpoints including future ones.
- **Recursive serialization (`model_to_dict(recurse=True)`) is O(n * related_objects)** — dangerous on unbounded queries. Consider forcing `short_output=True` for large result sets.
- **Heavy reformatting mixed with functional changes obscures bugs** — the fieldingstats count bug was missed in review because the file had 262 lines of diff from quote/formatting changes. Separate cosmetic and functional changes into different commits.

View File

@ -562,6 +562,20 @@ tar -czf ~/jellyfin-config-backup-$(date +%Y%m%d).tar.gz ~/docker/jellyfin/confi
---
## PGS Subtitle Default Flags Causing Roku Playback Hang (2026-04-01)
**Severity:** Medium — affects all Roku/Apple TV clients attempting to play remuxes with PGS subtitles
**Problem:** Playback on Roku hangs at "Loading" and stops at 0 ms. Jellyfin logs show ffmpeg extracting all subtitle streams (including PGS) from the full-length movie before playback can begin. User Staci reported Jurassic Park (1993) taking forever to start on the living room Roku.
**Root Cause:** PGS (hdmv_pgs_subtitle) tracks flagged as `default` in MKV files cause the Roku client to auto-select them. Roku can't decode PGS natively, so Jellyfin must burn them in — triggering a full subtitle extraction pass and video transcode before any data reaches the client. 178 out of ~400 movies in the library had this flag set, mostly remuxes that predate the Tdarr `clrSubDef` flow plugin.
**Fix:**
1. **Batch fix (existing library):** Wrote `fix-pgs-defaults.sh` — scans all MKVs with `mkvmerge -J`, finds PGS tracks with `default_track: true`, clears via `mkvpropedit --edit track:N --set flag-default=0`. Key gotcha: mkvpropedit uses 1-indexed track numbers (`track_id + 1`), NOT `track:=ID` (which matches by UID). Script is on manticore at `/tmp/fix-pgs-defaults.sh`. Fixed 178 files, no re-encoding needed.
2. **Going forward (Tdarr):** The flow already has a "Clear Subtitle Default Flags" custom function plugin (`clrSubDef`) that clears default disposition on non-forced subtitle tracks during transcoding. New files processed by Tdarr are handled automatically.
**Lesson:** Remux files from automated downloaders almost always have PGS defaults set. Any bulk import of remuxes should be followed by a PGS default flag sweep. The CIFS media mount on manticore is read-only inside the Jellyfin container — mkvpropedit must run from the host against `/mnt/truenas/media/Movies`.
## Related Documentation
- **Setup Guide**: `/media-servers/jellyfin-ubuntu-manticore.md`
- **NVIDIA Driver Management**: See jellyfin-ubuntu-manticore.md

View File

@ -0,0 +1,37 @@
---
title: "MLB The Show Market Tracker — 0.1.0"
description: "Initial release of the CLI market scanner with flip scanning and exchange program support."
type: reference
domain: gaming
tags: [release-notes, deployment, mlb-the-show, rust]
---
# MLB The Show Market Tracker — 0.1.0
**Date:** 2026-03-28
**Version:** `0.1.0`
**Repo:** `cal/mlb-the-show-market-tracker` on Gitea
**Deploy method:** Local CLI tool — `cargo build --release` on workstation
## Release Summary
Initial release of `showflip`, a Rust CLI tool for scanning the MLB The Show 26 Community Market. Supports finding profitable card flips and identifying silver cards at target buy-order prices for the gold pack exchange program.
## Changes
### New Features
- **`scan` command** — Concurrent market scanner that finds profitable flip opportunities. Supports filters for rarity, team, position, budget, and sorting by profit/margin. Includes watch mode for repeated scans and optional Discord webhook alerts.
- **`exchange` command** — Scans for silver cards (OVR 77-79) priced within configurable buy-order gates for the gold pack exchange program. Tiers: 79 OVR (target 170/max 175), 78 OVR (target 140/max 145), 77 OVR (target 117/max 122). Groups results by OVR with color-coded target/OK status.
- **`detail` command** — Shows price history and recent sales for a specific card by name or UUID.
- **`meta` command** — Lists available series, brands, and sets for use as filter values.
- OVR-based price floor calculation for live and non-live series cards
- 10% Community Market tax built into all profit calculations
- Handles API price format inconsistencies (integers vs comma-formatted strings)
- HTTP client with 429 retry handling
## Deployment Notes
- No server deployment — runs locally via `cargo run -- <subcommand>`
- API is public at `https://mlb26.theshow.com/apis/` — no auth required
- No tests or CI configured yet

View File

@ -0,0 +1,45 @@
---
title: "MLB The Show Companion Automation — 2026.3.31"
description: "Fix gold exchange navigation, add grind harness for automated buy→exchange loops, CLI cleanup."
type: reference
domain: gaming
tags: [release-notes, deployment, mlb-the-show, python, automation]
---
# MLB The Show Companion Automation — 2026.3.31
**Date:** 2026-03-31
**Repo:** `cal/mlb-the-show-market-tracker` on Gitea
**Branch:** `main` (merge commit `ea66e2c`)
**Deploy method:** Local script — `uv run scripts/grind.py`
## Release Summary
Major fixes to the companion app automation (`grind.py`). The gold exchange navigation was broken — the script thought it had entered the card grid when it was still on the exchange selection list. Added a new `grind` command that orchestrates the full buy→exchange loop with multi-tier OVR rotation.
## Changes
### Bug Fixes
- Fixed `_is_on_exchange_grid()` to require `Exchange Value` card labels, distinguishing the card grid from the Exchange Players list page (`d4c038b`)
- Added retry loop (3 attempts, 2s apart) in `ensure_on_exchange_grid()` for variable load times
- Added `time.sleep(2)` after tapping into the Gold Exchange grid
- Removed low-OVR bail logic — the grid is sorted ascending, so bail fired on first screen before scrolling to profitable cards
- Fixed buy-orders market scroll — retry loop attempts up to 10 scrolls before giving up (was 1) (`6912a7e`). Note: scroll method itself was still broken (KEYCODE_PAGE_DOWN); fixed in 2026.4.01 release.
- Restored `_has_low_ovr_cards` fix lost during PR #2 merge (`c29af78`)
### New Features
- **`grind` command** — automated buy→exchange loop with OVR tier rotation (`6912a7e`)
- Rotates through OVR tiers in descending order (default: 79, 78, 77)
- Buys 2 tiers per round, then exchanges all available dupes
- Flags: `--ovrs`, `--rounds`, `--max-players`, `--max-price`, `--budget`, `--max-packs`
- Per-round and cumulative summary output
- Clean Ctrl+C handling with final totals
### CLI Changes
- Renamed `grind``exchange` (bulk exchange command)
- Removed redundant single-exchange command (use `exchange 1` instead)
- `grind` now refers to the full buy→exchange orchestration loop
## Known Issues
- Default price gates (`MAX_BUY_PRICES`) may be too low during market inflation periods. Current gates: 79→170, 78→140, 77→125. Use `--max-price` to override.
- No order fulfillment polling — the grind loop relies on natural timing (2 buy rounds ≈ 2-5 min gives orders time to fill)

View File

@ -0,0 +1,26 @@
---
title: "MLB The Show Companion Automation — 2026.4.01"
description: "Fix buy-orders scroll to use touch swipes, optimize exchange card selection."
type: reference
domain: gaming
tags: [release-notes, deployment, mlb-the-show, python, automation]
---
# MLB The Show Companion Automation — 2026.4.01
**Date:** 2026-04-01
**Repo:** `cal/mlb-the-show-market-tracker` on Gitea
**Branch:** `main` (latest `f15e98a`)
**Deploy method:** Local script — `uv run scripts/grind.py`
## Release Summary
Two fixes to the companion app automation. The buy-orders command couldn't scroll through the market list because it used keyboard events instead of touch swipes. The exchange command now stops selecting cards once it has enough points for a pack.
## Changes
### Bug Fixes
- **Fixed buy-orders market scrolling** — replaced `KEYCODE_PAGE_DOWN` (keyboard event ignored by WebView) with `scroll_load_jiggle()` which uses touch swipes + a reverse micro-swipe to trigger lazy loading. This matches the working exchange scroll strategy. (`49fe7b6`)
### Optimizations
- **Early break in exchange card selection** — the selection loop now stops as soon as accumulated points meet the exchange threshold, avoiding unnecessary taps on additional card types the app won't consume. (`f15e98a`)

View File

@ -1,9 +1,9 @@
---
title: "Monitoring Scripts Context"
description: "Operational context for all monitoring scripts: Jellyfin GPU health monitor, NVIDIA driver update checker, Tdarr API/file monitors, and Windows reboot detection. Includes cron schedules, Discord integration patterns, and troubleshooting."
description: "Operational context for all monitoring scripts: Proxmox backup checker, CT 302 self-health, Jellyfin GPU health monitor, NVIDIA driver update checker, Tdarr API/file monitors, and Windows reboot detection. Includes cron schedules, Discord integration patterns, and troubleshooting."
type: context
domain: monitoring
tags: [jellyfin, gpu, nvidia, tdarr, discord, cron, python, windows, scripts]
tags: [proxmox, backup, jellyfin, gpu, nvidia, tdarr, discord, cron, python, bash, windows, scripts]
---
# Monitoring Scripts - Operational Context
@ -13,6 +13,77 @@ This directory contains active operational scripts for system monitoring, health
## Core Monitoring Scripts
### Proxmox Backup Verification
**Script**: `proxmox-backup-check.sh`
**Purpose**: Weekly check that every running VM/CT has a successful vzdump backup within 7 days. Posts a color-coded Discord embed with per-guest status.
**Key Features**:
- SSHes to Proxmox host and queries `pvesh` task history + guest lists via API
- Categorizes each guest: 🟢 green (backed up), 🟡 yellow (overdue), 🔴 red (no backup)
- Sorts output by VMID; only posts to Discord — no local side effects
- `--dry-run` mode prints the Discord payload without sending
- `--days N` overrides the default 7-day window
**Schedule**: Weekly on Monday 08:00 UTC (CT 302 cron)
```bash
0 8 * * 1 DISCORD_WEBHOOK="<url>" /root/scripts/proxmox-backup-check.sh >> /var/log/proxmox-backup-check.log 2>&1
```
**Usage**:
```bash
# Dry run (no Discord)
proxmox-backup-check.sh --dry-run
# Post to Discord
DISCORD_WEBHOOK="https://discord.com/api/webhooks/..." proxmox-backup-check.sh
# Custom window
proxmox-backup-check.sh --days 14 --discord-webhook "https://..."
```
**Dependencies**: `jq`, `curl`, SSH access to Proxmox host alias `proxmox`
**Install on CT 302**:
```bash
cp proxmox-backup-check.sh /root/scripts/
chmod +x /root/scripts/proxmox-backup-check.sh
```
### CT 302 Self-Health Monitor
**Script**: `ct302-self-health.sh`
**Purpose**: Monitors disk usage on CT 302 (claude-runner) itself. Alerts to Discord when any filesystem exceeds the threshold (default 80%). Runs silently when healthy — no Discord spam on green.
**Key Features**:
- Checks all non-virtual filesystems (`df`, excludes tmpfs/devtmpfs/overlay)
- Only sends a Discord alert when a filesystem is at or above threshold
- `--always-post` flag forces a post even when healthy (useful for testing)
- `--dry-run` mode prints payload without sending
**Schedule**: Daily at 07:00 UTC (CT 302 cron)
```bash
0 7 * * * DISCORD_WEBHOOK="<url>" /root/scripts/ct302-self-health.sh >> /var/log/ct302-self-health.log 2>&1
```
**Usage**:
```bash
# Check and alert if over 80%
DISCORD_WEBHOOK="https://discord.com/api/webhooks/..." ct302-self-health.sh
# Lower threshold test
ct302-self-health.sh --threshold 50 --dry-run
# Always post (weekly status report pattern)
ct302-self-health.sh --always-post --discord-webhook "https://..."
```
**Dependencies**: `jq`, `curl`, `df`
**Install on CT 302**:
```bash
cp ct302-self-health.sh /root/scripts/
chmod +x /root/scripts/ct302-self-health.sh
```
### Jellyfin GPU Health Monitor
**Script**: `jellyfin_gpu_monitor.py`
**Purpose**: Monitor Jellyfin container GPU access with Discord alerts and auto-restart capability
@ -235,6 +306,17 @@ python3 tdarr_file_monitor.py >> /mnt/NV2/Development/claude-home/logs/tdarr-fil
0 9 * * 1 /usr/bin/python3 /home/cal/scripts/nvidia_update_checker.py --check --discord-alerts >> /home/cal/logs/nvidia-update-checker.log 2>&1
```
**Active Cron Jobs** (on CT 302 / claude-runner, root user):
```bash
# Proxmox backup verification - Weekly (Mondays at 8 AM UTC)
0 8 * * 1 DISCORD_WEBHOOK="<homelab-alerts-webhook>" /root/scripts/proxmox-backup-check.sh >> /var/log/proxmox-backup-check.log 2>&1
# CT 302 self-health disk check - Daily at 7 AM UTC (alerts only when >80%)
0 7 * * * DISCORD_WEBHOOK="<homelab-alerts-webhook>" /root/scripts/ct302-self-health.sh >> /var/log/ct302-self-health.log 2>&1
```
**Note**: Scripts must be installed manually on CT 302. Source of truth is `monitoring/scripts/` in this repo — copy to `/root/scripts/` on CT 302 to deploy.
**Manual/On-Demand**:
- `tdarr_monitor.py` - Run as needed for Tdarr health checks
- `tdarr_file_monitor.py` - Can be scheduled if automatic backup needed

View File

@ -0,0 +1,158 @@
#!/usr/bin/env bash
# ct302-self-health.sh — CT 302 (claude-runner) disk self-check → Discord
#
# Monitors disk usage on CT 302 itself and alerts to Discord when any
# filesystem exceeds the threshold. Closes the blind spot where the
# monitoring system cannot monitor itself via external health checks.
#
# Designed to run silently when healthy (no Discord spam on green).
# Only posts when a filesystem is at or above THRESHOLD.
#
# Usage:
# ct302-self-health.sh [--discord-webhook URL] [--threshold N] [--dry-run] [--always-post]
#
# Environment overrides:
# DISCORD_WEBHOOK Discord webhook URL (required unless --dry-run)
# DISK_THRESHOLD Disk usage % alert threshold (default: 80)
#
# Install on CT 302 (daily, 07:00 UTC):
# 0 7 * * * /root/scripts/ct302-self-health.sh >> /var/log/ct302-self-health.log 2>&1
set -uo pipefail
DISK_THRESHOLD="${DISK_THRESHOLD:-80}"
DISCORD_WEBHOOK="${DISCORD_WEBHOOK:-}"
DRY_RUN=0
ALWAYS_POST=0
while [[ $# -gt 0 ]]; do
case "$1" in
--discord-webhook)
if [[ $# -lt 2 ]]; then
echo "Error: --discord-webhook requires a value" >&2
exit 1
fi
DISCORD_WEBHOOK="$2"
shift 2
;;
--threshold)
if [[ $# -lt 2 ]]; then
echo "Error: --threshold requires a value" >&2
exit 1
fi
DISK_THRESHOLD="$2"
shift 2
;;
--dry-run)
DRY_RUN=1
shift
;;
--always-post)
ALWAYS_POST=1
shift
;;
*)
echo "Unknown option: $1" >&2
exit 1
;;
esac
done
if [[ "$DRY_RUN" -eq 0 && -z "$DISCORD_WEBHOOK" ]]; then
echo "Error: DISCORD_WEBHOOK not set. Use --discord-webhook URL or set env var." >&2
exit 1
fi
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; }
# ---------------------------------------------------------------------------
# Check disk usage on all real filesystems
# ---------------------------------------------------------------------------
# df output: Filesystem Use% Mounted-on (skipping tmpfs, devtmpfs, overlay)
TRIGGERED=()
ALL_FS=()
while IFS= read -r line; do
fs=$(echo "$line" | awk '{print $1}')
pct=$(echo "$line" | awk '{print $5}' | tr -d '%')
mount=$(echo "$line" | awk '{print $6}')
ALL_FS+=("${pct}% ${mount} (${fs})")
if [[ "$pct" -ge "$DISK_THRESHOLD" ]]; then
TRIGGERED+=("${pct}% used — ${mount} (${fs})")
fi
done < <(df -h --output=source,size,used,avail,pcent,target |
tail -n +2 |
awk '$1 !~ /^(tmpfs|devtmpfs|overlay|udev)/' |
awk '{print $1, $5, $6}')
HOSTNAME=$(hostname -s)
TRIGGERED_COUNT=${#TRIGGERED[@]}
log "Disk check complete: ${TRIGGERED_COUNT} filesystem(s) above ${DISK_THRESHOLD}%"
# Exit cleanly with no Discord post if everything is healthy
if [[ "$TRIGGERED_COUNT" -eq 0 && "$ALWAYS_POST" -eq 0 && "$DRY_RUN" -eq 0 ]]; then
log "All filesystems healthy — no alert needed."
exit 0
fi
# ---------------------------------------------------------------------------
# Build Discord payload
# ---------------------------------------------------------------------------
if [[ "$TRIGGERED_COUNT" -gt 0 ]]; then
EMBED_COLOR=15548997 # 0xED4245 red
TITLE="🔴 ${HOSTNAME}: Disk usage above ${DISK_THRESHOLD}%"
alert_lines=$(printf '⚠️ %s\n' "${TRIGGERED[@]}")
FIELDS=$(jq -n \
--arg name "Filesystems Over Threshold" \
--arg value "$alert_lines" \
'[{"name": $name, "value": $value, "inline": false}]')
else
EMBED_COLOR=5763719 # 0x57F287 green
TITLE="🟢 ${HOSTNAME}: All filesystems healthy"
FIELDS='[]'
fi
# Add summary of all filesystems
all_lines=$(printf '%s\n' "${ALL_FS[@]}")
FIELDS=$(echo "$FIELDS" | jq \
--arg name "All Filesystems" \
--arg value "$all_lines" \
'. + [{"name": $name, "value": $value, "inline": false}]')
FOOTER="$(date -u '+%Y-%m-%d %H:%M UTC') · CT 302 self-health · threshold: ${DISK_THRESHOLD}%"
PAYLOAD=$(jq -n \
--arg title "$TITLE" \
--argjson color "$EMBED_COLOR" \
--argjson fields "$FIELDS" \
--arg footer "$FOOTER" \
'{
"embeds": [{
"title": $title,
"color": $color,
"fields": $fields,
"footer": {"text": $footer}
}]
}')
if [[ "$DRY_RUN" -eq 1 ]]; then
log "DRY RUN — Discord payload:"
echo "$PAYLOAD" | jq .
exit 0
fi
log "Posting to Discord..."
HTTP_STATUS=$(curl -s -o /tmp/ct302-self-health-discord.out \
-w "%{http_code}" \
-X POST "$DISCORD_WEBHOOK" \
-H "Content-Type: application/json" \
-d "$PAYLOAD")
if [[ "$HTTP_STATUS" -ge 200 && "$HTTP_STATUS" -lt 300 ]]; then
log "Discord notification sent (HTTP ${HTTP_STATUS})."
else
log "Warning: Discord returned HTTP ${HTTP_STATUS}."
cat /tmp/ct302-self-health-discord.out >&2
exit 1
fi

View File

@ -0,0 +1,532 @@
#!/usr/bin/env bash
# homelab-audit.sh — SSH-based homelab health audit
#
# Runs on the Proxmox host. Discovers running LXCs and VMs, SSHes into each
# to collect system metrics, then generates a summary report.
#
# Usage:
# homelab-audit.sh [--output-dir DIR] [--hosts label:ip,label:ip,...]
#
# Environment overrides:
# STUCK_PROC_CPU_WARN CPU% at which a D-state process is flagged (default: 10)
# REPORT_DIR Output directory for per-host reports and logs
# SSH_USER Remote user (default: root)
# -e omitted intentionally — unreachable hosts should not abort the full audit
set -uo pipefail
# ---------------------------------------------------------------------------
# Configuration
# ---------------------------------------------------------------------------
STUCK_PROC_CPU_WARN="${STUCK_PROC_CPU_WARN:-10}"
REPORT_DIR="${REPORT_DIR:-/tmp/homelab-audit-$(date +%Y%m%d-%H%M%S)}"
SSH_USER="${SSH_USER:-root}"
SSH_OPTS="-o StrictHostKeyChecking=accept-new -o ConnectTimeout=10 -o BatchMode=yes"
DISK_WARN=80
DISK_CRIT=90
LOAD_WARN=2.0
MEM_WARN=85
ZOMBIE_WARN=1
SWAP_WARN=512
HOSTS_FILTER="" # comma-separated host list from --hosts; empty = audit all
JSON_OUTPUT=0 # set to 1 by --json
while [[ $# -gt 0 ]]; do
case "$1" in
--output-dir)
if [[ $# -lt 2 ]]; then
echo "Error: --output-dir requires an argument" >&2
exit 1
fi
REPORT_DIR="$2"
shift 2
;;
--hosts)
if [[ $# -lt 2 ]]; then
echo "Error: --hosts requires an argument (label:ip,label:ip,...)" >&2
exit 1
fi
HOSTS_FILTER="$2"
shift 2
;;
--json)
JSON_OUTPUT=1
shift
;;
*)
echo "Unknown option: $1" >&2
exit 1
;;
esac
done
mkdir -p "$REPORT_DIR"
SSH_FAILURES_LOG="$REPORT_DIR/ssh-failures.log"
FINDINGS_FILE="$REPORT_DIR/findings.txt"
AUDITED_HOSTS=() # populated in main; used by generate_summary for per-host counts
# ---------------------------------------------------------------------------
# Remote collector script
#
# Kept single-quoted so no local variables are interpolated into the heredoc.
# STUCK_PROC_CPU_WARN is passed as $1 when invoking the remote bash session,
# so the configurable threshold reaches the collector without escaping issues.
# ---------------------------------------------------------------------------
COLLECTOR_SCRIPT='#!/usr/bin/env bash
STUCK_PROC_CPU_WARN="${1:-10}"
cpu_load() {
uptime | awk -F"load average:" "{print \$2}" | awk -F"[, ]+" "{print \$2}"
}
mem_pct() {
free | awk "/^Mem:/ {printf \"%.0f\", \$3/\$2*100}"
}
disk_usage() {
df --output=pcent,target -x tmpfs -x devtmpfs 2>/dev/null | tail -n +2 | \
while read -r pct mnt; do echo "${pct%%%} $mnt"; done
}
zombie_count() {
ps -eo stat= | grep -c "^Z" || true
}
stuck_procs() {
ps -eo stat=,pcpu=,comm= | \
awk -v t="$STUCK_PROC_CPU_WARN" '\''$1 ~ /^D/ && $2+0 >= t+0 {print $3}'\'' | \
paste -sd,
}
zombie_parents() {
ps -eo pid=,ppid=,stat= | awk '\''$3 ~ /^Z/ {print $2}'\'' | sort -u | \
xargs -I{} ps -o comm= -p {} 2>/dev/null | paste -sd,
}
swap_mb() {
free | awk '\''/^Swap:/ {printf "%.0f", $3/1024; found=1} END {if (!found) print "0"}'\''
}
oom_events() {
local count
count=$(journalctl -k --since "7 days ago" 2>/dev/null | grep -ci "out of memory") || true
echo "${count:-0}"
}
io_wait_pct() {
vmstat 1 2 2>/dev/null | tail -1 | awk '\''{print $16}'\''
}
echo "CPU_LOAD=$(cpu_load)"
echo "MEM_PCT=$(mem_pct)"
echo "ZOMBIES=$(zombie_count)"
echo "STUCK_PROCS=$(stuck_procs)"
echo "ZOMBIE_PARENTS=$(zombie_parents)"
echo "SWAP_MB=$(swap_mb)"
echo "OOM_EVENTS=$(oom_events)"
echo "IO_WAIT=$(io_wait_pct)"
disk_usage | while read -r pct mnt; do
echo "DISK $pct $mnt"
done
'
# ---------------------------------------------------------------------------
# SSH helper — logs stderr to ssh-failures.log instead of silently discarding
# ---------------------------------------------------------------------------
ssh_cmd() {
local host="$1"
shift
# shellcheck disable=SC2086
ssh $SSH_OPTS "${SSH_USER}@${host}" "$@" 2>>"$SSH_FAILURES_LOG"
}
# ---------------------------------------------------------------------------
# LXC IP discovery
#
# lxc-info only returns IPs for containers using Proxmox-managed DHCP bridges.
# Containers with static IPs defined inside the container (not via Proxmox
# network config) return nothing. Fall back to parsing `pct config` in that
# case to find the ip= field from the container's network interface config.
# ---------------------------------------------------------------------------
get_lxc_ip() {
local ctid="$1"
local ip
ip=$(lxc-info -n "$ctid" -iH 2>/dev/null | head -1)
if [[ -z "$ip" ]]; then
ip=$(pct config "$ctid" 2>/dev/null | grep -oP '(?<=ip=)[^/,]+' | head -1)
fi
echo "$ip"
}
# ---------------------------------------------------------------------------
# Inventory: running LXCs and VMs
# Returns lines of "label ip"
# ---------------------------------------------------------------------------
collect_inventory() {
# LXCs
pct list 2>/dev/null | tail -n +2 | while read -r ctid status _name; do
[[ "$status" != "running" ]] && continue
local ip
ip=$(get_lxc_ip "$ctid")
[[ -n "$ip" ]] && echo "lxc-${ctid} $ip"
done
# VMs — use agent network info if available, fall back to qm config
qm list 2>/dev/null | tail -n +2 | while read -r vmid _name status _mem _bootdisk _pid; do
[[ "$status" != "running" ]] && continue
local ip
ip=$(qm guest cmd "$vmid" network-get-interfaces 2>/dev/null |
python3 -c "
import sys, json
try:
data = json.load(sys.stdin)
for iface in data:
for addr in iface.get('ip-addresses', []):
if addr['ip-address-type'] == 'ipv4' and not addr['ip-address'].startswith('127.'):
print(addr['ip-address'])
raise SystemExit
except Exception:
pass
" 2>/dev/null)
[[ -n "$ip" ]] && echo "vm-${vmid} $ip"
done
}
# ---------------------------------------------------------------------------
# Collect metrics from one host and record findings
# ---------------------------------------------------------------------------
parse_and_report() {
local label="$1"
local addr="$2"
local raw
if ! raw=$(echo "$COLLECTOR_SCRIPT" | ssh_cmd "$addr" bash -s -- "$STUCK_PROC_CPU_WARN"); then
echo "SSH_FAILURE $label $addr" >>"$SSH_FAILURES_LOG"
echo "WARN $label: SSH connection failed" >>"$FINDINGS_FILE"
return
fi
while IFS= read -r line; do
case "$line" in
CPU_LOAD=*)
local load="${line#CPU_LOAD=}"
if [[ -n "$load" ]] && awk "BEGIN{exit !($load > $LOAD_WARN)}"; then
echo "WARN $label: load average ${load} > ${LOAD_WARN}" >>"$FINDINGS_FILE"
fi
;;
MEM_PCT=*)
local mem="${line#MEM_PCT=}"
if [[ -n "$mem" ]] && ((mem >= MEM_WARN)); then
echo "WARN $label: memory ${mem}% >= ${MEM_WARN}%" >>"$FINDINGS_FILE"
fi
;;
ZOMBIES=*)
local zombies="${line#ZOMBIES=}"
if [[ -n "$zombies" ]] && ((zombies >= ZOMBIE_WARN)); then
echo "WARN $label: ${zombies} zombie process(es)" >>"$FINDINGS_FILE"
fi
;;
STUCK_PROCS=*)
local procs="${line#STUCK_PROCS=}"
if [[ -n "$procs" ]]; then
echo "WARN $label: D-state procs with CPU>=${STUCK_PROC_CPU_WARN}%: ${procs}" >>"$FINDINGS_FILE"
fi
;;
ZOMBIE_PARENTS=*)
local zparents="${line#ZOMBIE_PARENTS=}"
if [[ -n "$zparents" ]]; then
echo "INFO $label: zombie parent process(es): ${zparents}" >>"$FINDINGS_FILE"
fi
;;
SWAP_MB=*)
local swap="${line#SWAP_MB=}"
if [[ -n "$swap" ]] && ((swap >= SWAP_WARN)); then
echo "WARN $label: swap usage ${swap} MB >= ${SWAP_WARN} MB" >>"$FINDINGS_FILE"
fi
;;
OOM_EVENTS=*)
local ooms="${line#OOM_EVENTS=}"
if [[ -n "$ooms" ]] && ((ooms > 0)); then
echo "WARN $label: ${ooms} OOM kill event(s) in last 7 days" >>"$FINDINGS_FILE"
fi
;;
IO_WAIT=*)
local iowait="${line#IO_WAIT=}"
if [[ -n "$iowait" ]] && ((iowait > 20)); then
echo "WARN $label: I/O wait ${iowait}% > 20%" >>"$FINDINGS_FILE"
fi
;;
DISK\ *)
local pct mnt
read -r _ pct mnt <<<"$line"
if ((pct >= DISK_CRIT)); then
echo "CRIT $label: disk ${mnt} at ${pct}% >= ${DISK_CRIT}%" >>"$FINDINGS_FILE"
elif ((pct >= DISK_WARN)); then
echo "WARN $label: disk ${mnt} at ${pct}% >= ${DISK_WARN}%" >>"$FINDINGS_FILE"
fi
;;
esac
done <<<"$raw"
}
# ---------------------------------------------------------------------------
# Summary — driven by actual findings in findings.txt and ssh-failures.log
# ---------------------------------------------------------------------------
generate_summary() {
local host_count="$1"
local ssh_failure_count=0
local warn_count=0
local crit_count=0
[[ -f "$SSH_FAILURES_LOG" ]] &&
ssh_failure_count=$(grep -c '^SSH_FAILURE' "$SSH_FAILURES_LOG" 2>/dev/null || true)
[[ -f "$FINDINGS_FILE" ]] &&
warn_count=$(grep -c '^WARN' "$FINDINGS_FILE" 2>/dev/null || true)
[[ -f "$FINDINGS_FILE" ]] &&
crit_count=$(grep -c '^CRIT' "$FINDINGS_FILE" 2>/dev/null || true)
echo ""
echo "=============================="
echo " HOMELAB AUDIT SUMMARY"
echo "=============================="
printf " Hosts audited : %d\n" "$host_count"
printf " SSH failures : %d\n" "$ssh_failure_count"
printf " Warnings : %d\n" "$warn_count"
printf " Critical : %d\n" "$crit_count"
echo "=============================="
if [[ ${#AUDITED_HOSTS[@]} -gt 0 ]] && ((warn_count + crit_count > 0)); then
echo ""
printf " %-30s %8s %8s\n" "Host" "Warnings" "Critical"
printf " %-30s %8s %8s\n" "----" "--------" "--------"
for host in "${AUDITED_HOSTS[@]}"; do
local hw hc
hw=$(grep -c "^WARN ${host}:" "$FINDINGS_FILE" 2>/dev/null || true)
hc=$(grep -c "^CRIT ${host}:" "$FINDINGS_FILE" 2>/dev/null || true)
((hw + hc > 0)) && printf " %-30s %8d %8d\n" "$host" "$hw" "$hc"
done
fi
if ((warn_count + crit_count > 0)); then
echo ""
echo "Findings:"
sort "$FINDINGS_FILE"
fi
if ((ssh_failure_count > 0)); then
echo ""
echo "SSH failures (see $SSH_FAILURES_LOG for details):"
grep '^SSH_FAILURE' "$SSH_FAILURES_LOG" | awk '{print " " $2 " (" $3 ")"}'
fi
echo ""
printf "Total: %d warning(s), %d critical across %d host(s)\n" \
"$warn_count" "$crit_count" "$host_count"
echo ""
echo "Reports: $REPORT_DIR"
}
# ---------------------------------------------------------------------------
# Proxmox backup recency — queries vzdump task history via pvesh (runs locally)
# ---------------------------------------------------------------------------
check_backup_recency() {
local tasks_json_file="$REPORT_DIR/vzdump-tasks.json"
pvesh get /nodes/proxmox/tasks --typefilter vzdump --limit 50 --output-format json \
>"$tasks_json_file" 2>/dev/null || {
echo "WARN proxmox: failed to query vzdump task history" >>"$FINDINGS_FILE"
return
}
[[ ! -s "$tasks_json_file" ]] && return
local running_ids=()
while read -r ctid; do
running_ids+=("$ctid")
done < <(pct list 2>/dev/null | awk 'NR>1 && $2=="running"{print $1}')
while read -r vmid; do
running_ids+=("$vmid")
done < <(qm list 2>/dev/null | awk 'NR>1 && $3=="running"{print $1}')
[[ ${#running_ids[@]} -eq 0 ]] && return
local week_ago
week_ago=$(($(date +%s) - 7 * 86400))
python3 - "$tasks_json_file" "$week_ago" "${running_ids[@]}" <<'PYEOF' >>"$FINDINGS_FILE"
import sys, json, datetime
tasks_file, week_ago = sys.argv[1], int(sys.argv[2])
running_ids = set(sys.argv[3:])
try:
tasks = json.load(open(tasks_file))
except Exception:
sys.exit(0)
last_backup = {}
for task in tasks:
if task.get("type") != "vzdump" or task.get("status") != "OK":
continue
vmid = str(task.get("id", ""))
endtime = int(task.get("endtime", 0))
if vmid and endtime and endtime > last_backup.get(vmid, 0):
last_backup[vmid] = endtime
for vmid in sorted(running_ids):
ts = last_backup.get(vmid)
if ts and ts >= week_ago:
pass
elif ts:
dt = datetime.datetime.fromtimestamp(ts).strftime("%Y-%m-%d")
print(f"WARN proxmox/vm-{vmid}: last backup {dt} is older than 7 days")
else:
print(f"CRIT proxmox/vm-{vmid}: no backup found in task history")
PYEOF
}
# ---------------------------------------------------------------------------
# Certificate expiry check — runs from the audit host via openssl
# ---------------------------------------------------------------------------
check_cert_expiry() {
local label="$1"
local addr="$2"
local now
now=$(date +%s)
for port in 443 8443; do
local enddate
enddate=$(echo | timeout 10 openssl s_client -connect "${addr}:${port}" 2>/dev/null |
openssl x509 -noout -enddate 2>/dev/null) || continue
[[ -z "$enddate" ]] && continue
local expiry_str="${enddate#notAfter=}"
local expiry_epoch
expiry_epoch=$(date -d "$expiry_str" +%s 2>/dev/null) || continue
local days_left=$(((expiry_epoch - now) / 86400))
if ((days_left <= 7)); then
echo "CRIT $label: TLS cert on :${port} expires in ${days_left} days" >>"$FINDINGS_FILE"
elif ((days_left <= 14)); then
echo "WARN $label: TLS cert on :${port} expires in ${days_left} days" >>"$FINDINGS_FILE"
fi
done
}
# ---------------------------------------------------------------------------
# JSON report — writes findings.json to $REPORT_DIR when --json is used
# ---------------------------------------------------------------------------
write_json_report() {
local host_count="$1"
local json_file="$REPORT_DIR/findings.json"
local ssh_failure_count=0
local warn_count=0
local crit_count=0
[[ -f "$SSH_FAILURES_LOG" ]] &&
ssh_failure_count=$(grep -c '^SSH_FAILURE' "$SSH_FAILURES_LOG" 2>/dev/null || true)
[[ -f "$FINDINGS_FILE" ]] &&
warn_count=$(grep -c '^WARN' "$FINDINGS_FILE" 2>/dev/null || true)
[[ -f "$FINDINGS_FILE" ]] &&
crit_count=$(grep -c '^CRIT' "$FINDINGS_FILE" 2>/dev/null || true)
python3 - "$json_file" "$host_count" "$ssh_failure_count" \
"$warn_count" "$crit_count" "$FINDINGS_FILE" <<'PYEOF'
import sys, json, datetime
json_file = sys.argv[1]
host_count = int(sys.argv[2])
ssh_failure_count = int(sys.argv[3])
warn_count = int(sys.argv[4])
crit_count = int(sys.argv[5])
findings_file = sys.argv[6]
findings = []
try:
with open(findings_file) as f:
for line in f:
line = line.strip()
if not line:
continue
parts = line.split(None, 2)
if len(parts) < 3:
continue
severity, host_colon, message = parts[0], parts[1], parts[2]
findings.append({
"severity": severity,
"host": host_colon.rstrip(":"),
"message": message,
})
except FileNotFoundError:
pass
output = {
"timestamp": datetime.datetime.utcnow().isoformat() + "Z",
"hosts_audited": host_count,
"warnings": warn_count,
"critical": crit_count,
"ssh_failures": ssh_failure_count,
"total_findings": warn_count + crit_count,
"findings": findings,
}
with open(json_file, "w") as f:
json.dump(output, f, indent=2)
print(f"JSON report: {json_file}")
PYEOF
}
# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------
main() {
echo "Starting homelab audit — $(date)"
echo "Report dir: $REPORT_DIR"
echo "STUCK_PROC_CPU_WARN threshold: ${STUCK_PROC_CPU_WARN}%"
[[ -n "$HOSTS_FILTER" ]] && echo "Host filter: $HOSTS_FILTER"
echo ""
>"$FINDINGS_FILE"
local host_count=0
if [[ -n "$HOSTS_FILTER" ]]; then
# --hosts mode: audit specified hosts directly, skip Proxmox inventory
# Accepts comma-separated entries; each entry may be plain hostname or label:ip
local check_proxmox=0
IFS=',' read -ra filter_hosts <<<"$HOSTS_FILTER"
for entry in "${filter_hosts[@]}"; do
local label="${entry%%:*}"
[[ "$label" == "proxmox" ]] && check_proxmox=1
done
if ((check_proxmox)); then
echo " Checking Proxmox backup recency..."
check_backup_recency
fi
for entry in "${filter_hosts[@]}"; do
local label="${entry%%:*}"
local addr="${entry#*:}"
echo " Auditing $label ($addr)..."
parse_and_report "$label" "$addr"
check_cert_expiry "$label" "$addr"
AUDITED_HOSTS+=("$label")
((host_count++)) || true
done
else
echo " Checking Proxmox backup recency..."
check_backup_recency
while read -r label addr; do
echo " Auditing $label ($addr)..."
parse_and_report "$label" "$addr"
check_cert_expiry "$label" "$addr"
AUDITED_HOSTS+=("$label")
((host_count++)) || true
done < <(collect_inventory)
fi
generate_summary "$host_count"
[[ "$JSON_OUTPUT" -eq 1 ]] && write_json_report "$host_count"
}
main "$@"

View File

@ -0,0 +1,230 @@
#!/usr/bin/env bash
# proxmox-backup-check.sh — Weekly Proxmox backup verification → Discord
#
# SSHes to the Proxmox host and checks that every running VM/CT has a
# successful vzdump backup within the last 7 days. Posts a color-coded
# Discord summary with per-guest status.
#
# Usage:
# proxmox-backup-check.sh [--discord-webhook URL] [--days N] [--dry-run]
#
# Environment overrides:
# DISCORD_WEBHOOK Discord webhook URL (required unless --dry-run)
# PROXMOX_NODE Proxmox node name (default: proxmox)
# PROXMOX_SSH SSH alias or host for Proxmox (default: proxmox)
# WINDOW_DAYS Backup recency window in days (default: 7)
#
# Install on CT 302 (weekly, Monday 08:00 UTC):
# 0 8 * * 1 /root/scripts/proxmox-backup-check.sh >> /var/log/proxmox-backup-check.log 2>&1
set -uo pipefail
PROXMOX_NODE="${PROXMOX_NODE:-proxmox}"
PROXMOX_SSH="${PROXMOX_SSH:-proxmox}"
WINDOW_DAYS="${WINDOW_DAYS:-7}"
DISCORD_WEBHOOK="${DISCORD_WEBHOOK:-}"
DRY_RUN=0
while [[ $# -gt 0 ]]; do
case "$1" in
--discord-webhook)
if [[ $# -lt 2 ]]; then
echo "Error: --discord-webhook requires a value" >&2
exit 1
fi
DISCORD_WEBHOOK="$2"
shift 2
;;
--days)
if [[ $# -lt 2 ]]; then
echo "Error: --days requires a value" >&2
exit 1
fi
WINDOW_DAYS="$2"
shift 2
;;
--dry-run)
DRY_RUN=1
shift
;;
*)
echo "Unknown option: $1" >&2
exit 1
;;
esac
done
if [[ "$DRY_RUN" -eq 0 && -z "$DISCORD_WEBHOOK" ]]; then
echo "Error: DISCORD_WEBHOOK not set. Use --discord-webhook URL or set env var." >&2
exit 1
fi
if ! command -v jq &>/dev/null; then
echo "Error: jq is required but not installed." >&2
exit 1
fi
SSH_OPTS="-o StrictHostKeyChecking=accept-new -o ConnectTimeout=10 -o BatchMode=yes"
CUTOFF=$(date -d "-${WINDOW_DAYS} days" +%s)
NOW=$(date +%s)
log() { echo "[$(date '+%Y-%m-%d %H:%M:%S')] $*"; }
# ---------------------------------------------------------------------------
# Fetch data from Proxmox
# ---------------------------------------------------------------------------
log "Fetching VM and CT list from Proxmox node '${PROXMOX_NODE}'..."
VMS_JSON=$(ssh $SSH_OPTS "$PROXMOX_SSH" \
"pvesh get /nodes/${PROXMOX_NODE}/qemu --output-format json 2>/dev/null" || echo "[]")
CTS_JSON=$(ssh $SSH_OPTS "$PROXMOX_SSH" \
"pvesh get /nodes/${PROXMOX_NODE}/lxc --output-format json 2>/dev/null" || echo "[]")
log "Fetching recent vzdump task history (limit 200)..."
TASKS_JSON=$(ssh $SSH_OPTS "$PROXMOX_SSH" \
"pvesh get /nodes/${PROXMOX_NODE}/tasks --typefilter vzdump --limit 200 --output-format json 2>/dev/null" || echo "[]")
# ---------------------------------------------------------------------------
# Build per-guest backup status
# ---------------------------------------------------------------------------
# Merge VMs and CTs into one list: [{vmid, name, type}]
GUESTS_JSON=$(jq -n \
--argjson vms "$VMS_JSON" \
--argjson cts "$CTS_JSON" '
($vms | map(select(.status == "running") | {vmid: (.vmid | tostring), name, type: "VM"})) +
($cts | map(select(.status == "running") | {vmid: (.vmid | tostring), name, type: "CT"}))
')
GUEST_COUNT=$(echo "$GUESTS_JSON" | jq 'length')
log "Found ${GUEST_COUNT} running guests."
# For each guest, find the most recent successful (status == "OK") vzdump task
RESULTS=$(jq -n \
--argjson guests "$GUESTS_JSON" \
--argjson tasks "$TASKS_JSON" \
--argjson cutoff "$CUTOFF" \
--argjson now "$NOW" \
--argjson window "$WINDOW_DAYS" '
$guests | map(
. as $g |
($tasks | map(
select(
(.vmid | tostring) == $g.vmid
and .status == "OK"
) | .starttime
) | max // 0) as $last_ts |
{
vmid: $g.vmid,
name: $g.name,
type: $g.type,
last_backup_ts: $last_ts,
age_days: (if $last_ts > 0 then (($now - $last_ts) / 86400 | floor) else -1 end),
status: (
if $last_ts >= $cutoff then "green"
elif $last_ts > 0 then "yellow"
else "red"
end
)
}
) | sort_by(.vmid | tonumber)
')
GREEN_GUESTS=$(echo "$RESULTS" | jq '[.[] | select(.status == "green")]')
YELLOW_GUESTS=$(echo "$RESULTS" | jq '[.[] | select(.status == "yellow")]')
RED_GUESTS=$(echo "$RESULTS" | jq '[.[] | select(.status == "red")]')
GREEN_COUNT=$(echo "$GREEN_GUESTS" | jq 'length')
YELLOW_COUNT=$(echo "$YELLOW_GUESTS" | jq 'length')
RED_COUNT=$(echo "$RED_GUESTS" | jq 'length')
log "Results: ${GREEN_COUNT} green, ${YELLOW_COUNT} yellow, ${RED_COUNT} red"
# ---------------------------------------------------------------------------
# Build Discord payload
# ---------------------------------------------------------------------------
if [[ "$RED_COUNT" -gt 0 ]]; then
EMBED_COLOR=15548997 # 0xED4245 red
STATUS_LINE="🔴 Backup issues detected — action required"
elif [[ "$YELLOW_COUNT" -gt 0 ]]; then
EMBED_COLOR=16705372 # 0xFF851C orange
STATUS_LINE="🟡 Some backups are overdue (>${WINDOW_DAYS}d)"
else
EMBED_COLOR=5763719 # 0x57F287 green
STATUS_LINE="🟢 All ${GUEST_COUNT} guests backed up within ${WINDOW_DAYS} days"
fi
# Format guest lines: "VM 116 (plex) — 2d ago" or "CT 302 (claude-runner) — NO BACKUPS"
format_guest() {
local prefix="$1" guests="$2"
echo "$guests" | jq -r '.[] | "\(.type) \(.vmid) (\(.name))"' |
while IFS= read -r line; do echo "${prefix} ${line}"; done
}
format_guest_with_age() {
local prefix="$1" guests="$2"
echo "$guests" | jq -r '.[] | "\(.type) \(.vmid) (\(.name)) — \(.age_days)d ago"' |
while IFS= read -r line; do echo "${prefix} ${line}"; done
}
# Build fields array
fields='[]'
if [[ "$GREEN_COUNT" -gt 0 ]]; then
green_lines=$(format_guest_with_age "✅" "$GREEN_GUESTS")
fields=$(echo "$fields" | jq \
--arg name "🟢 Healthy (${GREEN_COUNT})" \
--arg value "$green_lines" \
'. + [{"name": $name, "value": $value, "inline": false}]')
fi
if [[ "$YELLOW_COUNT" -gt 0 ]]; then
yellow_lines=$(format_guest_with_age "⚠️" "$YELLOW_GUESTS")
fields=$(echo "$fields" | jq \
--arg name "🟡 Overdue — last backup >${WINDOW_DAYS}d ago (${YELLOW_COUNT})" \
--arg value "$yellow_lines" \
'. + [{"name": $name, "value": $value, "inline": false}]')
fi
if [[ "$RED_COUNT" -gt 0 ]]; then
red_lines=$(format_guest "❌" "$RED_GUESTS")
fields=$(echo "$fields" | jq \
--arg name "🔴 No Successful Backups Found (${RED_COUNT})" \
--arg value "$red_lines" \
'. + [{"name": $name, "value": $value, "inline": false}]')
fi
FOOTER="$(date -u '+%Y-%m-%d %H:%M UTC') · ${GUEST_COUNT} guests · window: ${WINDOW_DAYS}d"
PAYLOAD=$(jq -n \
--arg title "Proxmox Backup Check — ${STATUS_LINE}" \
--argjson color "$EMBED_COLOR" \
--argjson fields "$fields" \
--arg footer "$FOOTER" \
'{
"embeds": [{
"title": $title,
"color": $color,
"fields": $fields,
"footer": {"text": $footer}
}]
}')
if [[ "$DRY_RUN" -eq 1 ]]; then
log "DRY RUN — Discord payload:"
echo "$PAYLOAD" | jq .
exit 0
fi
log "Posting to Discord..."
HTTP_STATUS=$(curl -s -o /tmp/proxmox-backup-check-discord.out \
-w "%{http_code}" \
-X POST "$DISCORD_WEBHOOK" \
-H "Content-Type: application/json" \
-d "$PAYLOAD")
if [[ "$HTTP_STATUS" -ge 200 && "$HTTP_STATUS" -lt 300 ]]; then
log "Discord notification sent (HTTP ${HTTP_STATUS})."
else
log "Warning: Discord returned HTTP ${HTTP_STATUS}."
cat /tmp/proxmox-backup-check-discord.out >&2
exit 1
fi

View File

@ -0,0 +1,126 @@
#!/usr/bin/env bash
# test-audit-collectors.sh — validates homelab-audit.sh collector output format
#
# Re-implements each collector function inline and runs it locally, checking
# that output matches the expected format. Exits non-zero on any failure.
set -euo pipefail
PASS=0
FAIL=0
pass() {
((PASS++)) || true
echo " PASS: $1"
}
fail() {
((FAIL++)) || true
echo " FAIL: $1$2"
}
echo "=== Collector output format tests ==="
# Run each collector function locally and validate output format
# These functions are designed to work on any Linux host
# --- cpu_load ---
result=$(uptime | awk -F'load average:' '{print $2}' | awk -F'[, ]+' '{print $2}')
if [[ "$result" =~ ^[0-9]+\.?[0-9]*$ ]]; then
pass "cpu_load returns numeric value: $result"
else
fail "cpu_load" "expected numeric, got: '$result'"
fi
# --- mem_pct ---
result=$(free | awk '/^Mem:/ {printf "%.0f", $3/$2*100}')
if [[ "$result" =~ ^[0-9]+$ ]] && ((result >= 0 && result <= 100)); then
pass "mem_pct returns percentage: $result"
else
fail "mem_pct" "expected 0-100, got: '$result'"
fi
# --- zombie_count ---
result=$(ps -eo stat= | grep -c "^Z" || true)
if [[ "$result" =~ ^[0-9]+$ ]]; then
pass "zombie_count returns integer: $result"
else
fail "zombie_count" "expected integer, got: '$result'"
fi
# --- zombie_parents ---
# May be empty if no zombies — that's valid
result=$(ps -eo pid=,ppid=,stat= | awk '$3 ~ /^Z/ {print $2}' | sort -u |
xargs -I{} ps -o comm= -p {} 2>/dev/null | paste -sd, || true)
if [[ -z "$result" || "$result" =~ ^[a-zA-Z0-9_.,/-]+$ ]]; then
pass "zombie_parents returns csv or empty: '${result:-<empty>}'"
else
fail "zombie_parents" "unexpected format: '$result'"
fi
# --- swap_mb ---
result=$(free | awk '/^Swap:/ {printf "%.0f", $3/1024}')
if [[ "$result" =~ ^[0-9]+$ ]]; then
pass "swap_mb returns integer MB: $result"
else
fail "swap_mb" "expected integer, got: '$result'"
fi
# --- oom_events ---
result=$(journalctl -k --since "7 days ago" 2>/dev/null | grep -ci "out of memory") || true
result="${result:-0}"
if [[ "$result" =~ ^[0-9]+$ ]]; then
pass "oom_events returns integer: $result"
else
fail "oom_events" "expected integer, got: '$result'"
fi
# --- stuck_procs ---
# May be empty — that's valid
result=$(ps -eo stat=,pcpu=,comm= |
awk '$1 ~ /^D/ && $2+0 >= 10 {print $3}' | paste -sd, || true)
if [[ -z "$result" || "$result" =~ ^[a-zA-Z0-9_.,/-]+$ ]]; then
pass "stuck_procs returns csv or empty: '${result:-<empty>}'"
else
fail "stuck_procs" "unexpected format: '$result'"
fi
# --- disk_usage format ---
result=$(df --output=pcent,target -x tmpfs -x devtmpfs 2>/dev/null | tail -n +2 | head -1 |
while read -r pct mnt; do echo "${pct%%%} $mnt"; done)
if [[ "$result" =~ ^[0-9]+\ / ]]; then
pass "disk_usage returns 'pct mount' format: $result"
else
fail "disk_usage" "expected 'N /path', got: '$result'"
fi
# --- --hosts flag parsing ---
echo ""
echo "=== --hosts argument parsing tests ==="
# Single host
input="vm-115:10.10.0.88"
IFS=',' read -ra entries <<<"$input"
label="${entries[0]%%:*}"
addr="${entries[0]#*:}"
if [[ "$label" == "vm-115" && "$addr" == "10.10.0.88" ]]; then
pass "--hosts single entry parsed: $label $addr"
else
fail "--hosts single" "expected 'vm-115 10.10.0.88', got: '$label $addr'"
fi
# Multiple hosts
input="vm-115:10.10.0.88,lxc-225:10.10.0.225"
IFS=',' read -ra entries <<<"$input"
label1="${entries[0]%%:*}"
addr1="${entries[0]#*:}"
label2="${entries[1]%%:*}"
addr2="${entries[1]#*:}"
if [[ "$label1" == "vm-115" && "$addr1" == "10.10.0.88" && "$label2" == "lxc-225" && "$addr2" == "10.10.0.225" ]]; then
pass "--hosts multi entry parsed: $label1 $addr1, $label2 $addr2"
else
fail "--hosts multi" "unexpected parse result"
fi
echo ""
echo "=== Results: $PASS passed, $FAIL failed ==="
((FAIL == 0))

View File

@ -92,6 +92,42 @@ CT 302 does **not** have an SSH key registered with Gitea, so SSH git remotes wo
3. Commit to Gitea, pull on CT 302
4. Add Uptime Kuma monitors if desired
## Health Check Thresholds
Thresholds are evaluated in `health_check.py`. All load thresholds use **per-core** metrics
to avoid false positives from LXC containers (which see the Proxmox host's aggregate load).
### Load Average
| Metric | Value | Rationale |
|--------|-------|-----------|
| `LOAD_WARN_PER_CORE` | `0.7` | Elevated — investigate if sustained |
| `LOAD_CRIT_PER_CORE` | `1.0` | Saturated — CPU is a bottleneck |
| Sample window | 5-minute | Filters transient spikes (not 1-minute) |
**Formula**: `load_per_core = load_5m / nproc`
**Why per-core?** Proxmox LXC containers see the host's aggregate load average via the
shared kernel. A 32-core Proxmox host at load 9 is at 0.28/core (healthy), but a naive
absolute threshold of 2× would trigger at 9 for a 4-core LXC. Using `load_5m / nproc`
where `nproc` returns the host's visible core count gives the correct ratio.
**Validation examples**:
- Proxmox host: load 9 / 32 cores = 0.28/core → no alert ✓
- VM 116 at 0.75/core → warning ✓ (above 0.7 threshold)
- VM at 1.1/core → critical ✓
### Other Thresholds
| Check | Threshold | Notes |
|-------|-----------|-------|
| Zombie processes | 5 | Single zombies are transient noise; alert only if ≥ 5 |
| Swap usage | 30% of total swap | Percentage-based to handle varied swap sizes across hosts |
| Disk warning | 85% | |
| Disk critical | 95% | |
| Memory | 90% | |
| Uptime alert | Non-urgent Discord post | Not a page-level alert |
## Related
- [monitoring/CONTEXT.md](../CONTEXT.md) — Overall monitoring architecture

View File

@ -47,12 +47,13 @@ home_network:
services: ["media", "transcoding"]
description: "Tdarr media transcoding"
vpn_docker:
hostname: "10.10.0.121"
port: 22
user: "cal"
services: ["vpn", "docker"]
description: "VPN and Docker services"
# DECOMMISSIONED: vpn_docker (10.10.0.121) - VM 105 destroyed 2026-04
# vpn_docker:
# hostname: "10.10.0.121"
# port: 22
# user: "cal"
# services: ["vpn", "docker"]
# description: "VPN and Docker services"
remote_servers:
akamai_nano:

View File

@ -23,7 +23,7 @@ servers:
pihole: 10.10.0.16 # Pi-hole DNS and ad blocking
sba_pd_bots: 10.10.0.88 # SBa and PD bot services
tdarr: 10.10.0.43 # Media transcoding
vpn_docker: 10.10.0.121 # VPN and Docker services
# vpn_docker: 10.10.0.121 # DECOMMISSIONED — VM 105 destroyed, migrated to arr-stack LXC 221
```
### Cloud Servers
@ -175,11 +175,12 @@ Host tdarr media
Port 22
IdentityFile ~/.ssh/homelab_rsa
Host docker-vpn
HostName 10.10.0.121
User cal
Port 22
IdentityFile ~/.ssh/homelab_rsa
# DECOMMISSIONED: docker-vpn (10.10.0.121) - VM 105 destroyed, migrated to arr-stack LXC 221
# Host docker-vpn
# HostName 10.10.0.121
# User cal
# Port 22
# IdentityFile ~/.ssh/homelab_rsa
# Remote Cloud Servers
Host akamai-nano akamai

View File

@ -0,0 +1,63 @@
---
title: "Refractor Phase 2: Integration — boost wiring, tests, and review"
description: "Implemented apply_tier_boost orchestration, dry_run evaluator, evaluate-game wiring with kill switch, and 51 new tests across paper-dynasty-database. PRs #176 and #177 merged."
type: context
domain: paper-dynasty
tags: [paper-dynasty-database, refractor, phase-2, testing]
---
# Refractor Phase 2: Integration — boost wiring, tests, and review
**Date:** 2026-03-30
**Branch:** `feature/refractor-phase2-integration` (merged to `main`)
**Repo:** paper-dynasty-database
## What Was Done
Full implementation of Refractor Phase 2 Integration — wiring the Phase 2 Foundation boost functions (PR #176) into the live evaluate-game endpoint so that tier-ups actually create boosted variant cards with modified ratings.
1. **PR #176 merged (Foundation)** — Review findings fixed (renamed `evolution_tier` to `refractor_tier`, removed redundant parens), then merged via pd-ops
2. **`evaluate_card(dry_run=True)`** — Added dry_run parameter to separate tier detection from tier write. `apply_tier_boost()` becomes the sole writer of `current_tier`, ensuring atomicity with variant creation. Added `computed_tier` and `computed_fully_evolved` to return dict.
3. **`apply_tier_boost()` orchestration** — Full flow: source card lookup, boost application per vs_hand split, variant card + ratings creation with idempotency guards, audit record with idempotency guard, atomic state mutations via `db.atomic()`. Display stat helpers compute fresh avg/obp/slg.
4. **`evaluate_game()` wiring** — Calls evaluate_card with dry_run=True, loops through intermediate tiers on tier-up, handles partial multi-tier failures (reports last successful tier), `REFRACTOR_BOOST_ENABLED` env var kill switch, suppresses false notifications when boost is disabled or card_type is missing.
5. **79-sum documentation fix** — Clarified all references to "79-sum" across code, tests, and docs to note the 108-total card invariant (79 variable + 29 x-check for pitchers).
6. **51 new tests** — Display stat unit tests (12), integration tests for orchestration (27), HTTP endpoint tests (7), dry_run evaluator tests (6). Total suite: 223 passed.
7. **Five rounds of swarm reviews** — Each change reviewed individually by swarm-reviewer agents. All findings addressed: false notification on null card_type, wrong tier in log message, partial multi-tier failure reporting, atomicity test accuracy, audit idempotency gap, import os placement.
8. **PR #177 merged** — Review found two issues (import os inside function, audit idempotency gap on PostgreSQL UNIQUE constraint). Both fixed, pushed, approved by Claude, merged via pd-ops.
## Decisions
### Display stats computed fresh, not set to None
The original PO review note suggested setting avg/obp/slg to None on variant cards and deferring recalculation. Cal decided to compute them fresh using the exact Pydantic validator formulas instead — strictly better than stale or missing values. Design doc updated to reflect this.
### Card/ratings creation outside db.atomic()
The design doc specified all writes inside `db.atomic()`. Implementation splits card/ratings creation outside (idempotent, retry-safe via get_or_none guards) with only state mutations (audit, tier write, Card.variant propagation) inside the atomic block. This is pragmatically correct — on retry, existing card/ratings are reused. Design doc updated.
### Kill switch suppresses notifications entirely
When `REFRACTOR_BOOST_ENABLED=false`, the router skips both the boost AND the tier_up notification (via `continue`). This prevents false notifications to the Discord bot during maintenance windows. Initially the code fell through and emitted a notification without a variant — caught during coverage gap analysis and fixed.
### Audit idempotency guard added
PR review identified that `RefractorBoostAudit` has a `UNIQUE(card_state_id, tier)` constraint in PostgreSQL (from the migration) that the SQLite test DB doesn't enforce. Added `get_or_none` before `create` to prevent IntegrityError on retry.
## Follow-Up
- Phase 3: Documentation updates in `card-creation` repo (docs only, no code)
- Phase 4a: Validation test cases in `database` repo
- Phase 4b: Discord bot tier-up notification fix (must ship alongside or after Phase 2 deploy)
- Deploy Phase 2 to dev: run migration `2026-03-28_refractor_phase2_boost.sql` on dev DB
- Stale branches to clean up in database repo: `feat/evolution-refractor-schema-migration`, `test/refractor-tier3`
## Files Changed
**paper-dynasty-database:**
- `app/services/refractor_boost.py` — apply_tier_boost orchestration, display stat helpers, card_type validation, audit idempotency guard
- `app/services/refractor_evaluator.py` — dry_run parameter, computed_tier/computed_fully_evolved in return dict
- `app/routers_v2/refractor.py` — evaluate_game wiring, kill switch, partial multi-tier failure, isoformat crash fix
- `tests/test_refractor_boost.py` — 12 new display stat tests, 79-sum comment fixes
- `tests/test_refractor_boost_integration.py` — 27 new integration tests (new file)
- `tests/test_postgame_refractor.py` — 7 new HTTP endpoint tests
- `tests/test_refractor_evaluator.py` — 6 new dry_run unit tests
**paper-dynasty (parent repo):**
- `docs/refractor-phase2/01-phase1-foundation.md` — 79-sum clarifications
- `docs/refractor-phase2/02-phase2-integration.md` — atomicity boundary, display stats updates

View File

@ -178,7 +178,7 @@ When merging many PRs at once (e.g., batch pagination PRs), branch protection ru
| `LOG_LEVEL` | Logging verbosity (default: INFO) |
| `DATABASE_TYPE` | `postgresql` |
| `POSTGRES_HOST` | Container name of PostgreSQL |
| `POSTGRES_DB` | Database name (`pd_master`) |
| `POSTGRES_DB` | Database name `pd_master` (prod) / `paperdynasty_dev` (dev) |
| `POSTGRES_USER` | DB username |
| `POSTGRES_PASSWORD` | DB password |
@ -189,4 +189,6 @@ When merging many PRs at once (e.g., batch pagination PRs), branch protection ru
| Database API (prod) | `ssh akamai` | `pd_api` | 815 |
| Database API (dev) | `ssh pd-database` | `dev_pd_database` | 813 |
| PostgreSQL (prod) | `ssh akamai` | `pd_postgres` | 5432 |
| PostgreSQL (dev) | `ssh pd-database` | `pd_postgres` | 5432 |
| PostgreSQL (dev) | `ssh pd-database` | `sba_postgres` | 5432 |
**Dev database credentials:** container `sba_postgres`, database `paperdynasty_dev`, user `sba_admin`. Prod uses `pd_postgres`, database `pd_master`.

View File

@ -0,0 +1,170 @@
---
title: "Discord Bot Browser Testing via Playwright + CDP"
description: "Step-by-step workflow for automated Discord bot testing using Playwright connected to Brave browser via Chrome DevTools Protocol. Covers setup, slash command execution, and screenshot capture."
type: runbook
domain: paper-dynasty
tags: [paper-dynasty, discord, testing, playwright, automation]
---
# Discord Bot Browser Testing via Playwright + CDP
Automated testing of Paper Dynasty Discord bot commands by connecting Playwright to a running Brave browser instance with Discord open.
## Prerequisites
- Brave browser installed (`brave-browser-stable`)
- Playwright installed (`pip install playwright && playwright install chromium`)
- Discord logged in via browser (not desktop app)
- Discord bot running (locally via docker-compose or on remote host)
- Bot's `API_TOKEN` must match the target API environment
## Setup
### 1. Launch Brave with CDP enabled
Brave must be started with `--remote-debugging-port`. If Brave is already running, **kill it first** — otherwise the flag is ignored and the new process merges into the existing one.
```bash
killall brave && sleep 2 && brave-browser-stable --remote-debugging-port=9222 &
```
### 2. Verify CDP is responding
```bash
curl -s http://localhost:9222/json/version | python3 -m json.tool
```
Should return JSON with `Browser`, `webSocketDebuggerUrl`, etc.
### 3. Open Discord in browser
Navigate to `https://discord.com/channels/<server_id>/<channel_id>` in Brave.
**Paper Dynasty test server:**
- Server: Cals Test Server (`669356687294988350`)
- Channel: #pd-game-test (`982850262903451658`)
- URL: `https://discord.com/channels/669356687294988350/982850262903451658`
### 4. Verify bot is running with correct API token
```bash
# Check docker-compose.yml has the right API_TOKEN for the target environment
grep API_TOKEN /mnt/NV2/Development/paper-dynasty/discord-app/docker-compose.yml
# Dev API token lives on the dev host:
ssh pd-database "docker exec sba_postgres psql -U sba_admin -d paperdynasty_dev -c \"SELECT 1;\""
# Restart bot if token was changed:
cd /mnt/NV2/Development/paper-dynasty/discord-app && docker compose up -d
```
## Running Commands
### Find the Discord tab
```python
from playwright.sync_api import sync_playwright
import time
with sync_playwright() as p:
browser = p.chromium.connect_over_cdp('http://localhost:9222')
for ctx in browser.contexts:
for page in ctx.pages:
if 'discord' in page.url.lower():
print(f'Found: {page.url}')
break
browser.close()
```
### Execute a slash command and capture result
```python
from playwright.sync_api import sync_playwright
import time
def run_slash_command(command: str, wait_seconds: int = 5, screenshot_path: str = '/tmp/discord_result.png'):
"""
Type a slash command in Discord, select the top autocomplete option,
submit it, wait for the bot response, and take a screenshot.
"""
with sync_playwright() as p:
browser = p.chromium.connect_over_cdp('http://localhost:9222')
for ctx in browser.contexts:
for page in ctx.pages:
if 'discord' in page.url.lower():
msg_box = page.locator('[role="textbox"][data-slate-editor="true"]')
msg_box.click()
time.sleep(0.3)
# Type the command (delay simulates human typing for autocomplete)
msg_box.type(command, delay=80)
time.sleep(2)
# Tab selects the top autocomplete option
page.keyboard.press('Tab')
time.sleep(1)
# Enter submits the command
page.keyboard.press('Enter')
time.sleep(wait_seconds)
page.screenshot(path=screenshot_path)
print(f'Screenshot saved to {screenshot_path}')
break
browser.close()
# Example usage:
run_slash_command('/refractor status')
```
### Commands with parameters
After pressing Tab to select the command, Discord shows an options panel. To fill parameters:
1. The first parameter input is auto-focused after Tab
2. Type the value, then Tab to move to the next parameter
3. Press Enter when ready to submit
```python
# Example: /refractor status with tier filter
msg_box.type('/refractor status', delay=80)
time.sleep(2)
page.keyboard.press('Tab') # Select command from autocomplete
time.sleep(1)
# Now fill parameters if needed, or just submit
page.keyboard.press('Enter')
```
## Key Selectors
| Element | Selector |
|---------|----------|
| Message input box | `[role="textbox"][data-slate-editor="true"]` |
| Autocomplete popup | `[class*="autocomplete"]` |
## Gotchas
- **Brave must be killed before relaunch** — if an instance is already running, `--remote-debugging-port` is silently ignored
- **Bot token mismatch** — the bot's `API_TOKEN` in `docker-compose.yml` must match the target API (dev or prod). Symptoms: `{"detail":"Unauthorized"}` in bot logs
- **Viewport is None** — when connecting via CDP, `page.viewport_size` returns None. Use `page.evaluate('() => ({w: window.innerWidth, h: window.innerHeight})')` instead
- **Autocomplete timing** — typing too fast may not trigger Discord's autocomplete. The `delay=80` on `msg_box.type()` simulates human speed
- **Multiple bots** — if multiple bots register the same slash command (e.g. MantiTestBot and PucklTestBot), Tab selects the top option. Verify the correct bot name in the autocomplete popup before proceeding
## Test Plan Reference
The Refractor integration test plan is at:
`discord-app/tests/refractor-integration-test-plan.md`
Key test case groups:
- REF-01 to REF-06: Tier badges and display
- REF-10 to REF-15: Progress bars and filtering
- REF-40 to REF-42: Cross-command badges (card, roster)
- REF-70 to REF-72: Cross-command badge propagation (the current priority)
## Verified On
- **Date:** 2026-04-06
- **Browser:** Brave 146.0.7680.178 (Chromium-based)
- **Playwright:** Node.js driver via Python sync API
- **Bot:** MantiTestBot on Cals Test Server, #pd-game-test channel
- **API:** pddev.manticorum.com (dev environment)

View File

@ -0,0 +1,48 @@
---
title: "Fix: /open-packs crash from orphaned Check-In Player packs"
description: "Check-In Player packs with hyphenated name caused empty Discord select menu (400 Bad Request) and KeyError in callback."
type: troubleshooting
domain: paper-dynasty
tags: [troubleshooting, discord, paper-dynasty, packs, hotfix]
---
# Fix: /open-packs crash from orphaned Check-In Player packs
**Date:** 2026-03-26
**PR:** #134 (hotfix branch based on prod tag 2026.3.4, merged to main)
**Tag:** 2026.3.8
**Severity:** High --- any user with an orphaned Check-In Player pack could not open any packs at all
## Problem
Running `/open-packs` returned: `HTTPException: 400 Bad Request (error code: 50035): Invalid Form Body --- In data.components.0.components.0.options: This field is required`
Discord rejected the message because the select menu had zero options.
## Root Cause
Two cascading bugs triggered by the "Check-In Player" pack type name containing a hyphen:
1. **Empty select menu:** The `pretty_name` logic used `'-' not in key` to identify bare pack type names. "Check-In Player" contains a hyphen, so it fell into the `elif 'Team' in key` / `elif 'Cardset' in key` chain --- matching neither. `pretty_name` stayed `None`, no `SelectOption` was created, and Discord rejected the empty options list.
2. **KeyError in callback (secondary):** Even if displayed, selecting "Check-In Player" would call `self.values[0].split('-')` producing `['Check', 'In Player']`, which matched none of the pack type tokens in the `if/elif` chain, raising `KeyError`.
Check-In Player packs are normally auto-opened during the daily check-in (`/comeonmanineedthis`). An orphaned pack existed because `roll_for_cards` had previously failed mid-flow, leaving an unopened pack in inventory.
## Fix
Three-layer fix applied to both `cogs/economy.py` (production) and `cogs/economy_new/packs.py` (main):
1. **Filter at source:** Added `AUTO_OPEN_TYPES = {"Check-In Player"}` set. Packs with these types are skipped during grouping with `continue`, so they never reach the select menu.
2. **Fallback for hyphenated names:** Added `else: pretty_name = key` after the `Team`/`Cardset` checks, so any future hyphenated pack type names still get a display label.
3. **Graceful error in callback:** Replaced `raise KeyError` with a user-facing ephemeral message ("This pack type cannot be opened manually. Please contact Cal.") and `return`.
Also changed all "contact an admin" strings to "contact Cal" in `discord_ui/selectors.py`.
## Lessons
- **Production loads `cogs/economy.py`, not `cogs/economy_new/packs.py`.** The initial fix was applied to the wrong file. Always check which cogs are actually loaded by inspecting the bot startup logs (`Loaded cog: ...`) before assuming which file handles a command.
- **Hotfix branches based on old tags may have stale CI workflows.** The `docker-build.yml` at the tagged commit had an older trigger config (branch push, not tag push), so the CalVer tag silently failed to trigger CI. Cherry-pick the current workflow into hotfix branches.
- **Pack type names are used as dict keys and split on hyphens** throughout the open-packs flow. Any new pack type with a hyphen in its name will hit similar issues unless the grouping/parsing logic is refactored to stop using hyphen-delimited strings as composite keys.

View File

@ -0,0 +1,107 @@
---
title: "Refractor In-App Test Plan"
description: "Comprehensive manual test plan for the Refractor card evolution system — covers /refractor status, tier badges, post-game hooks, tier-up notifications, card art tiers, and known issues."
type: guide
domain: paper-dynasty
tags: [paper-dynasty, testing, refractor, discord, database]
---
# Refractor In-App Test Plan
Manual test plan for the Refractor (card evolution) system. All testing targets **dev** environment (`pddev.manticorum.com` / dev Discord bot).
## Prerequisites
- Dev bot running on `sba-bots`
- Dev API at `pddev.manticorum.com` (port 813)
- Team with seeded refractor data (team 31 from prior session)
- At least one game playable to trigger post-game hooks
---
## REF-10: `/refractor status` — Basic Display
| # | Test | Steps | Expected |
|---|---|---|---|
| 10 | No filters | `/refractor status` | Ephemeral embed with team branding, tier summary line, 10 cards sorted by tier DESC, pagination buttons if >10 cards |
| 11 | Card type filter | `/refractor status card_type:Batter` | Only batter cards shown, count matches |
| 12 | Tier filter | `/refractor status tier:T2—Refractor` | Only T2 cards, embed color changes to tier color |
| 13 | Progress filter | `/refractor status progress:Close to next tier` | Only cards >=80% to next threshold, fully evolved excluded |
| 14 | Combined filters | `/refractor status card_type:Batter tier:T1—Base Chrome` | Intersection of both filters |
| 15 | Empty result | `/refractor status tier:T4—Superfractor` (if none exist) | "No cards match your filters..." message with filter details |
## REF-20: `/refractor status` — Pagination
| # | Test | Steps | Expected |
|---|---|---|---|
| 20 | Page buttons appear | `/refractor status` with >10 cards | Prev/Next buttons visible |
| 21 | Next page | Click `Next >` | Page 2 shown, footer updates to "Page 2/N" |
| 22 | Prev page | From page 2, click `< Prev` | Back to page 1 |
| 23 | First page prev | On page 1, click `< Prev` | Nothing happens / stays on page 1 |
| 24 | Last page next | On last page, click `Next >` | Nothing happens / stays on last page |
| 25 | Button timeout | Wait 120s after command | Buttons become unresponsive |
| 26 | Wrong user clicks | Another user clicks buttons | Silently ignored |
## REF-30: Tier Badges in Card Embeds
| # | Test | Steps | Expected |
|---|---|---|---|
| 30 | T0 card display | View a T0 card via `/myteam` or `/roster` | No badge prefix, just player name |
| 31 | T1 badge | View a T1 card | Title shows `[BC] Player Name` |
| 32 | T2 badge | View a T2 card | Title shows `[R] Player Name` |
| 33 | T3 badge | View a T3 card | Title shows `[GR] Player Name` |
| 34 | T4 badge | View a T4 card (if exists) | Title shows `[SF] Player Name` |
| 35 | Badge in pack open | Open a pack with an evolved card | Badge appears in pack embed |
| 36 | API down gracefully | (hard to test) | Card displays normally with no badge, no error |
## REF-50: Post-Game Hook & Tier-Up Notifications
| # | Test | Steps | Expected |
|---|---|---|---|
| 50 | Game completes normally | Play a full game | No errors in bot logs; refractor evaluate-game fires after season-stats update |
| 51 | Tier-up notification | Play game where a card crosses a threshold | Embed in game channel: "Refractor Tier Up!", player name, tier name, correct color |
| 52 | No tier-up | Play game where no thresholds crossed | No refractor embed posted, game completes normally |
| 53 | Multiple tier-ups | Game where 2+ players tier up | One embed per tier-up, all posted |
| 54 | Auto-init new card | Play game with a card that has no RefractorCardState | State created automatically, player evaluated, no error |
| 55 | Superfractor notification | (may need forced data) | "SUPERFRACTOR!" title, teal color |
## REF-60: Card Art with Tiers (API-level)
| # | Test | Steps | Expected |
|---|---|---|---|
| 60 | T0 card image | `GET /api/v2/players/{id}/card-image?card_type=batting` | Base card, no tier styling |
| 61 | Tier override | `GET ...?card_type=batting&tier=2` | Refractor styling visible (border, diamond indicator) |
| 62 | Each tier visual | `?tier=1` through `?tier=4` | Correct border colors, diamond fill, header gradients per tier |
| 63 | Pitcher card | `?card_type=pitching&tier=2` | Tier styling applies correctly to pitcher layout |
## REF-70: Known Issues to Verify
| # | Issue | Check | Status |
|---|---|---|---|
| 70 | Superfractor embed says "Rating boosts coming in a future update!" | Verify — boosts ARE implemented now, text is stale | **Fix needed** |
| 71 | `on_timeout` doesn't edit message | Buttons stay visually active after 120s | **Known, low priority** |
| 72 | Card embed perf (1 API call per card) | Note latency on roster views with 10+ cards | **Monitor** |
| 73 | Season-stats failure kills refractor eval | Both in same try/except | **Known risk, verify logging** |
---
## API Endpoints Under Test
| Method | Endpoint | Used By |
|---|---|---|
| GET | `/api/v2/refractor/tracks` | Track listing |
| GET | `/api/v2/refractor/cards?team_id=X` | `/refractor status` command |
| GET | `/api/v2/refractor/cards/{card_id}` | Tier badge in card embeds |
| POST | `/api/v2/refractor/cards/{card_id}/evaluate` | Force re-evaluation |
| POST | `/api/v2/refractor/evaluate-game/{game_id}` | Post-game hook |
| GET | `/api/v2/teams/{team_id}/refractors` | Teams alias endpoint |
| GET | `/api/v2/players/{id}/card-image?tier=N` | Card art tier preview |
## Notification Embed Colors
| Tier | Name | Color |
|---|---|---|
| T1 | Base Chrome | Green (0x2ECC71) |
| T2 | Refractor | Gold (0xF1C40F) |
| T3 | Gold Refractor | Purple (0x9B59B6) |
| T4 | Superfractor | Teal (0x1ABC9C) |

View File

@ -0,0 +1,62 @@
---
title: "Codex-to-Claude Agent Converter & Plugin Marketplace"
description: "Pipeline that converts VoltAgent/awesome-codex-subagents TOML definitions to Claude Code plugin marketplace format, hosted at cal/codex-agents on Gitea."
type: reference
domain: productivity
tags: [claude-code, automation, plugins, agents, gitea]
---
# Codex Agents Marketplace
## Overview
136+ specialized agent definitions converted from [VoltAgent/awesome-codex-subagents](https://github.com/VoltAgent/awesome-codex-subagents) (OpenAI Codex format) to Claude Code plugin marketplace format.
- **Repo**: `cal/codex-agents` on Gitea (`git@git.manticorum.com:cal/codex-agents.git`)
- **Local path**: `/mnt/NV2/Development/codex-agents/`
- **Upstream**: Cloned to `upstream/` (gitignored), pulled on each sync
## Sync Pipeline
```bash
cd /mnt/NV2/Development/codex-agents
./sync.sh # pull upstream + convert changed agents
./sync.sh --force # re-convert all regardless of hash
./sync.sh --dry-run # preview only
./sync.sh --verbose # per-agent status
```
- `convert.py` handles TOML → Markdown+YAML frontmatter conversion
- SHA-256 per-file hashes in `codex-manifest.json` skip unchanged agents
- Deleted upstream agents are auto-removed locally
- `.claude-plugin/marketplace.json` is regenerated on each sync
## Format Mapping
| Codex | Claude Code |
|-------|------------|
| `gpt-5.4` + `high` | `model: opus` |
| `gpt-5.3-codex-spark` + `medium` | `model: sonnet` |
| `sandbox_mode: read-only` | `disallowedTools: Edit, Write` |
| `sandbox_mode: workspace-write` | full tool access |
| `developer_instructions` | markdown body |
| `"parent agent"` | replaced with `"orchestrating agent"` |
## Installing Agents
Add marketplace to `~/.claude/settings.json`:
```json
"extraKnownMarketplaces": {
"codex-agents": { "source": { "source": "git", "url": "https://git.manticorum.com/cal/codex-agents.git" } }
}
```
Then:
```bash
claude plugin update codex-agents
claude plugin install docker-expert@codex-agents --scope user
```
## Agent Categories
10 categories: Core Development (12), Language Specialists (27), Infrastructure (16), Quality & Security (16), Data & AI (12), Developer Experience (13), Specialized Domains (12), Business & Product (11), Meta & Orchestration (10), Research & Analysis (7).

View File

@ -158,6 +158,23 @@ ls -t ~/.local/share/claude-scheduled/logs/backlog-triage/ | head -1
~/.config/claude-scheduled/runner.sh backlog-triage
```
## Session Resumption
Tasks can opt into session persistence for multi-step workflows:
```json
{
"session_resumable": true,
"resume_last_session": true
}
```
When `session_resumable` is `true`, runner.sh saves the `session_id` to `$LOG_DIR/last_session_id` after each run. When `resume_last_session` is also `true`, the next run resumes that session with `--resume`.
Issue-poller and PR-reviewer capture `session_id` in logs and result JSON for manual follow-up.
See also: [Agent SDK Evaluation](agent-sdk-evaluation.md) for CLI vs SDK comparison.
## Cost Safety
- Per-task `max_budget_usd` cap — runner.sh detects `error_max_budget_usd` and warns

View File

@ -0,0 +1,175 @@
---
title: "Agent SDK Evaluation — CLI vs Python/TypeScript SDK"
description: "Comparison of Claude Code CLI invocation (claude -p) vs the native Agent SDK for programmatic use in the headless-claude and claude-scheduled systems."
type: context
domain: scheduled-tasks
tags: [claude-code, sdk, agent-sdk, python, typescript, headless, automation, evaluation]
---
# Agent SDK Evaluation: CLI vs Python/TypeScript SDK
**Date:** 2026-04-03
**Status:** Evaluation complete — recommendation below
**Related:** Issue #3 (headless-claude: Additional Agent SDK improvements)
## 1. Current Approach — CLI via `claude -p`
All headless Claude invocations use the CLI subprocess pattern:
```bash
claude -p "<prompt>" \
--model sonnet \
--output-format json \
--allowedTools "Read,Grep,Glob" \
--append-system-prompt "..." \
--max-budget-usd 2.00
```
**Pros:**
- Simple to invoke from any language (bash, n8n SSH nodes, systemd units)
- Uses Claude Max OAuth — no API key needed, no per-token billing
- Mature and battle-tested in our scheduled-tasks framework
- CLAUDE.md and settings.json are loaded automatically
- No runtime dependencies beyond the CLI binary
**Cons:**
- Structured output requires parsing JSON from stdout
- Error handling is exit-code-based with stderr parsing
- No mid-stream observability (streaming requires JSONL parsing)
- Tool approval is allowlist-only — no dynamic per-call decisions
- Session resumption requires manual `--resume` flag plumbing
## 2. Python Agent SDK
**Package:** `claude-agent-sdk` (renamed from `claude-code`)
**Install:** `pip install claude-agent-sdk`
**Requires:** Python 3.10+, `ANTHROPIC_API_KEY` env var
```python
from claude_agent_sdk import query, ClaudeAgentOptions
async for message in query(
prompt="Diagnose server health",
options=ClaudeAgentOptions(
allowed_tools=["Read", "Grep", "Bash(python3 *)"],
output_format={"type": "json_schema", "schema": {...}},
max_budget_usd=2.00,
),
):
if hasattr(message, "result"):
print(message.result)
```
**Key features:**
- Async generator with typed `SDKMessage` objects (User, Assistant, Result, System)
- `ClaudeSDKClient` for stateful multi-turn conversations
- `can_use_tool` callback for dynamic per-call tool approval
- In-process hooks (`PreToolUse`, `PostToolUse`, `Stop`, etc.)
- `rewindFiles()` to restore filesystem to any prior message point
- Typed exception hierarchy (`CLINotFoundError`, `ProcessError`, etc.)
**Limitation:** Shells out to the Claude Code CLI binary — it is NOT a pure HTTP client. The binary must be installed.
## 3. TypeScript Agent SDK
**Package:** `@anthropic-ai/claude-agent-sdk` (renamed from `@anthropic-ai/claude-code`)
**Install:** `npm install @anthropic-ai/claude-agent-sdk`
**Requires:** Node 18+, `ANTHROPIC_API_KEY` env var
```typescript
import { query } from "@anthropic-ai/claude-agent-sdk";
for await (const message of query({
prompt: "Diagnose server health",
options: {
allowedTools: ["Read", "Grep", "Bash(python3 *)"],
maxBudgetUsd: 2.00,
}
})) {
if ("result" in message) console.log(message.result);
}
```
**Key features (superset of Python):**
- Same async generator pattern
- `"auto"` permission mode (model classifier per tool call) — TS-only
- `spawnClaudeCodeProcess` hook for remote/containerized execution
- `setMcpServers()` for dynamic MCP server swapping mid-session
- V2 preview: `send()` / `stream()` patterns for simpler multi-turn
- Bundles the Claude Code binary — no separate install needed
## 4. Comparison Matrix
| Capability | `claude -p` CLI | Python SDK | TypeScript SDK |
|---|---|---|---|
| **Auth** | OAuth (Claude Max) | API key only | API key only |
| **Invocation** | Shell subprocess | Async generator | Async generator |
| **Structured output** | `--json-schema` flag | Schema in options | Schema in options |
| **Streaming** | JSONL parsing | Typed messages | Typed messages |
| **Tool approval** | `--allowedTools` only | `can_use_tool` callback | `canUseTool` callback + auto mode |
| **Session resume** | `--resume` flag | `resume: sessionId` | `resume: sessionId` |
| **Cost tracking** | Parse result JSON | `ResultMessage.total_cost_usd` | Same + per-model breakdown |
| **Error handling** | Exit codes + stderr | Typed exceptions | Typed exceptions |
| **Hooks** | External shell scripts | In-process callbacks | In-process callbacks |
| **Custom tools** | Not available | `tool()` decorator | `tool()` + Zod schemas |
| **Subagents** | Not programmatic | `agents` option | `agents` option |
| **File rewind** | Not available | `rewindFiles()` | `rewindFiles()` |
| **MCP servers** | `--mcp-config` file | Inline config object | Inline + dynamic swap |
| **CLAUDE.md loading** | Automatic | Must opt-in (`settingSources`) | Must opt-in |
| **Dependencies** | CLI binary | CLI binary + Python | Node 18+ (bundles CLI) |
## 5. Integration Paths
### A. n8n Code Nodes
The n8n Code node supports JavaScript (not TypeScript directly, but the SDK's JS output works). This would replace the current SSH → CLI pattern:
```
Schedule Trigger → Code Node (JS, uses SDK) → IF → Discord
```
**Trade-off:** Eliminates the SSH hop to CT 300, but requires `ANTHROPIC_API_KEY` and n8n to have the npm package installed. Current n8n runs in a Docker container on CT 210 — would need the SDK and CLI binary in the image.
### B. Standalone Python Scripts
Replace `claude -p` subprocess calls in custom dispatchers with the Python SDK:
```python
# Instead of: subprocess.run(["claude", "-p", prompt, ...])
async for msg in query(prompt=prompt, options=opts):
...
```
**Trade-off:** Richer error handling and streaming, but our dispatchers are bash scripts, not Python. Would require rewriting `runner.sh` and dispatchers in Python.
### C. Systemd-triggered Tasks (Current Architecture)
Keep systemd timers → bash scripts, but optionally invoke a thin Python wrapper that uses the SDK instead of `claude -p` directly.
**Trade-off:** Adds Python as a dependency for scheduled tasks that currently only need bash + the CLI binary. Marginal benefit unless we need hooks or dynamic tool approval.
## 6. Recommendation
**Stay with CLI invocation for now. Revisit the Python SDK when we need dynamic tool approval or in-process hooks.**
### Rationale
1. **Auth is the blocker.** The SDK requires `ANTHROPIC_API_KEY` (API billing). Our entire scheduled-tasks framework runs on Claude Max OAuth at zero marginal cost. Switching to the SDK means paying per-token for every scheduled task, issue-worker, and PR-reviewer invocation. This alone makes the SDK non-viable for our current architecture.
2. **The CLI covers our needs.** With `--append-system-prompt` (done), `--resume` (this PR), `--json-schema`, and `--allowedTools`, the CLI provides everything we currently need. Session resumption was the last missing piece.
3. **Bash scripts are the right abstraction.** Our runners are launched by systemd timers. Bash + CLI is the natural fit — no runtime dependencies, no async event loops, no package management.
### When to Revisit
- If Anthropic adds OAuth support to the SDK (eliminating the billing difference)
- If we need dynamic tool approval (e.g., "allow this Bash command but deny that one" at runtime)
- If we build a long-running Python service that orchestrates multiple Claude sessions (the `ClaudeSDKClient` stateful pattern would be valuable there)
- If we move to n8n custom nodes written in TypeScript (the TS SDK bundles the CLI binary)
### Migration Path (If Needed Later)
1. Start with the Python SDK in a single task (e.g., `backlog-triage`) as a proof of concept
2. Create a thin `sdk-runner.py` wrapper that reads the same `settings.json` and `prompt.md` files
3. Swap the systemd unit's `ExecStart` from `runner.sh` to `sdk-runner.py`
4. Expand to other tasks if the POC proves valuable

View File

@ -245,11 +245,25 @@ hosts:
- sqlite-major-domo
- temp-postgres
# Docker Home Servers VM (Proxmox) - decommission candidate
# VM 116: Only Jellyfin remains after 2026-04-03 cleanup (watchstate removed — duplicate of manticore's canonical instance)
# Jellyfin on manticore already covers this service. VM 116 + VM 110 are candidates to reclaim 8 vCPUs + 16 GB RAM.
# See issue #31 for cleanup details.
docker-home-servers:
type: docker
ip: 10.10.0.124
vmid: 116
user: cal
description: "Legacy home servers VM — Jellyfin only, decommission candidate"
config_paths:
docker-compose: /home/cal/container-data
services:
- jellyfin # only remaining service; duplicate of ubuntu-manticore jellyfin
decommission_candidate: true
notes: "watchstate removed 2026-04-03 (duplicate of manticore); 3.36 GB images pruned; see issue #31"
# Decommissioned hosts (kept for reference)
# decommissioned:
# tdarr-old:
# ip: 10.10.0.43
# note: "Replaced by ubuntu-manticore tdarr"
# docker-home:
# ip: 10.10.0.124
# note: "Decommissioned"

View File

@ -0,0 +1,246 @@
---
title: "Proxmox Monthly Maintenance Reboot"
description: "Runbook for the first-Sunday-of-the-month Proxmox host reboot — dependency-aware shutdown/startup order, validation checklist, and Ansible automation."
type: runbook
domain: server-configs
tags: [proxmox, maintenance, reboot, ansible, operations, systemd]
---
# Proxmox Monthly Maintenance Reboot
## Overview
| Detail | Value |
|--------|-------|
| **Schedule** | 1st Sunday of every month, 3:00 AM ET (08:00 UTC) |
| **Expected downtime** | ~15 minutes (host reboot + VM/LXC startup) |
| **Orchestration** | Ansible on LXC 304 — shutdown playbook → host reboot → post-reboot startup playbook |
| **Calendar** | Google Calendar recurring event: "Proxmox Monthly Maintenance Reboot" |
| **HA DNS** | ubuntu-manticore (10.10.0.226) provides Pi-hole 2 during Proxmox downtime |
## Why
- Kernel updates accumulate without reboot and never take effect
- Long uptimes allow memory leaks and process state drift (e.g., avahi busy-loops)
- Validates that all VMs/LXCs auto-start cleanly with `onboot: 1`
## Architecture
The reboot is split into two playbooks because LXC 304 (the Ansible controller) is itself a guest on the Proxmox host being rebooted:
1. **`monthly-reboot.yml`** — Snapshots all guests, shuts them down in dependency order, issues a fire-and-forget `reboot` to the Proxmox host, then exits. LXC 304 is killed when the host reboots.
2. **`post-reboot-startup.yml`** — After the host reboots, LXC 304 auto-starts via `onboot: 1`. A systemd service (`ansible-post-reboot.service`) waits 120 seconds for the Proxmox API to stabilize, then starts all guests in dependency order with staggered delays.
The `onboot: 1` flag on all production guests acts as a safety net — even if the post-reboot playbook fails, Proxmox will start everything (though without controlled ordering).
## Prerequisites (Before Maintenance)
- [ ] Verify no active Tdarr transcodes on ubuntu-manticore
- [ ] Verify no running database backups
- [ ] Ensure workstation has Pi-hole 2 (10.10.0.226) as a fallback DNS server so it fails over automatically during downtime
- [ ] Confirm ubuntu-manticore Pi-hole 2 is healthy: `ssh manticore "docker exec pihole pihole status"`
## `onboot` Audit
All production VMs and LXCs must have `onboot: 1` so they restart automatically as a safety net.
**Check VMs:**
```bash
ssh proxmox "for id in \$(qm list | awk 'NR>1{print \$1}'); do \
name=\$(qm config \$id | grep '^name:' | awk '{print \$2}'); \
onboot=\$(qm config \$id | grep '^onboot:'); \
echo \"VM \$id (\$name): \${onboot:-onboot NOT SET}\"; \
done"
```
**Check LXCs:**
```bash
ssh proxmox "for id in \$(pct list | awk 'NR>1{print \$1}'); do \
name=\$(pct config \$id | grep '^hostname:' | awk '{print \$2}'); \
onboot=\$(pct config \$id | grep '^onboot:'); \
echo \"LXC \$id (\$name): \${onboot:-onboot NOT SET}\"; \
done"
```
**Audit results (2026-04-03):**
| ID | Name | Type | `onboot` | Status |
|----|------|------|----------|--------|
| 106 | docker-home | VM | 1 | OK |
| 109 | homeassistant | VM | 1 | OK (fixed 2026-04-03) |
| 110 | discord-bots | VM | 1 | OK |
| 112 | databases-bots | VM | 1 | OK |
| 115 | docker-sba | VM | 1 | OK |
| 116 | docker-home-servers | VM | 1 | OK |
| 210 | docker-n8n-lxc | LXC | 1 | OK |
| 221 | arr-stack | LXC | 1 | OK (fixed 2026-04-03) |
| 222 | memos | LXC | 1 | OK |
| 223 | foundry-lxc | LXC | 1 | OK (fixed 2026-04-03) |
| 225 | gitea | LXC | 1 | OK |
| 227 | uptime-kuma | LXC | 1 | OK |
| 301 | claude-discord-coordinator | LXC | 1 | OK |
| 302 | claude-runner | LXC | 1 | OK |
| 303 | mcp-gateway | LXC | 0 | Intentional (on-demand) |
| 304 | ansible-controller | LXC | 1 | OK |
**If any production guest is missing `onboot: 1`:**
```bash
ssh proxmox "qm set <VMID> --onboot 1" # for VMs
ssh proxmox "pct set <CTID> --onboot 1" # for LXCs
```
## Shutdown Order (Dependency-Aware)
Reverse of the validated startup sequence. Stop consumers before their dependencies. Each tier polls per-guest status rather than using fixed waits.
```
Tier 4 — Media & Others (no downstream dependents)
VM 109 homeassistant
LXC 221 arr-stack
LXC 222 memos
LXC 223 foundry-lxc
LXC 302 claude-runner
Tier 3 — Applications (depend on databases + infra)
VM 115 docker-sba (Paper Dynasty, Major Domo)
VM 110 discord-bots
LXC 301 claude-discord-coordinator
Tier 2 — Infrastructure + DNS (depend on databases)
VM 106 docker-home (Pi-hole 1, NPM)
LXC 225 gitea
LXC 210 docker-n8n-lxc
LXC 227 uptime-kuma
VM 116 docker-home-servers
Tier 1 — Databases (no dependencies, shut down last)
VM 112 databases-bots (force-stop after 90s if ACPI ignored)
→ LXC 304 issues fire-and-forget reboot to Proxmox host, then is killed
```
**Known quirks:**
- VM 112 (databases-bots) may ignore ACPI shutdown — playbook force-stops after 90s
- VM 109 (homeassistant) is self-managed via HA Supervisor, excluded from Ansible inventory
- LXC 303 (mcp-gateway) has `onboot: 0` and is operator-managed — not included in shutdown/startup. If it was running before maintenance, bring it up manually afterward
## Startup Order (Staggered)
After the Proxmox host reboots, LXC 304 auto-starts and the `ansible-post-reboot.service` waits 120s before running the controlled startup:
```
Tier 1 — Databases first
VM 112 databases-bots
→ wait 30s for DB to accept connections
Tier 2 — Infrastructure + DNS
VM 106 docker-home (Pi-hole 1, NPM)
LXC 225 gitea
LXC 210 docker-n8n-lxc
LXC 227 uptime-kuma
VM 116 docker-home-servers
→ wait 30s
Tier 3 — Applications
VM 115 docker-sba
VM 110 discord-bots
LXC 301 claude-discord-coordinator
→ wait 30s
Pi-hole fix — restart container via SSH to clear UDP DNS bug
ssh docker-home "docker restart pihole"
→ wait 10s
Tier 4 — Media & Others
VM 109 homeassistant
LXC 221 arr-stack
LXC 222 memos
LXC 223 foundry-lxc
LXC 302 claude-runner
```
## Post-Reboot Validation
- [ ] Pi-hole 1 DNS resolving: `ssh docker-home "docker exec pihole dig google.com @127.0.0.1"`
- [ ] Gitea accessible: `curl -sf https://git.manticorum.com/api/v1/version`
- [ ] n8n healthy: `ssh docker-n8n-lxc "docker ps --filter name=n8n --format '{{.Status}}'"`
- [ ] Discord bots responding (check Discord)
- [ ] Uptime Kuma dashboard green: `curl -sf http://10.10.0.227:3001/api/status-page/homelab`
- [ ] Home Assistant running: `curl -sf http://10.10.0.109:8123/api/ -H 'Authorization: Bearer <token>'`
- [ ] Maintenance snapshots cleaned up (auto, 7-day retention)
## Automation
### Ansible Playbooks
Both located at `/opt/ansible/playbooks/` on LXC 304.
```bash
# Dry run — shutdown only
ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --check"
# Manual full execution — shutdown + reboot
ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml"
# Manual post-reboot startup (if automatic startup failed)
ssh ansible "ansible-playbook /opt/ansible/playbooks/post-reboot-startup.yml"
# Shutdown only — skip the host reboot
ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --tags shutdown"
```
### Systemd Units (on LXC 304)
| Unit | Purpose | Schedule |
|------|---------|----------|
| `ansible-monthly-reboot.timer` | Triggers shutdown + reboot playbook | 1st Sunday of month, 08:00 UTC |
| `ansible-monthly-reboot.service` | Runs `monthly-reboot.yml` | Activated by timer |
| `ansible-post-reboot.service` | Runs `post-reboot-startup.yml` | On boot (multi-user.target), only if uptime < 10 min |
```bash
# Check timer status
ssh ansible "systemctl status ansible-monthly-reboot.timer"
# Next scheduled run
ssh ansible "systemctl list-timers ansible-monthly-reboot.timer"
# Check post-reboot service status
ssh ansible "systemctl status ansible-post-reboot.service"
# Disable for a month (e.g., during an incident)
ssh ansible "systemctl stop ansible-monthly-reboot.timer"
```
### Deployment (one-time setup on LXC 304)
```bash
# Copy playbooks
scp ansible/playbooks/monthly-reboot.yml ansible:/opt/ansible/playbooks/
scp ansible/playbooks/post-reboot-startup.yml ansible:/opt/ansible/playbooks/
# Copy and enable systemd units
scp ansible/systemd/ansible-monthly-reboot.timer ansible:/etc/systemd/system/
scp ansible/systemd/ansible-monthly-reboot.service ansible:/etc/systemd/system/
scp ansible/systemd/ansible-post-reboot.service ansible:/etc/systemd/system/
ssh ansible "sudo systemctl daemon-reload && \
sudo systemctl enable --now ansible-monthly-reboot.timer && \
sudo systemctl enable ansible-post-reboot.service"
# Verify SSH key access from LXC 304 to docker-home (needed for Pi-hole restart)
ssh ansible "ssh -o BatchMode=yes docker-home 'echo ok'"
```
## Rollback
If a guest fails to start after reboot:
1. Check Proxmox web UI or `pvesh get /nodes/proxmox/qemu/<VMID>/status/current`
2. Review guest logs: `ssh proxmox "journalctl -u pve-guests -n 50"`
3. Manual start: `ssh proxmox "pvesh create /nodes/proxmox/qemu/<VMID>/status/start"`
4. If guest is corrupted, restore from the pre-reboot Proxmox snapshot
5. If post-reboot startup failed entirely, run manually: `ssh ansible "ansible-playbook /opt/ansible/playbooks/post-reboot-startup.yml"`
## Related Documentation
- [Ansible Controller Setup](../../vm-management/ansible-controller-setup.md) — LXC 304 details and inventory
- [Proxmox 7→9 Upgrade Plan](../../vm-management/proxmox-upgrades/proxmox-7-to-9-upgrade-plan.md) — original startup order and Phase 1 lessons
- [VM Decommission Runbook](../../vm-management/vm-decommission-runbook.md) — removing VMs from the rotation

View File

@ -1,15 +0,0 @@
agent: 1
boot: order=scsi0;net0
cores: 8
memory: 16384
meta: creation-qemu=6.1.0,ctime=1646688596
name: docker-vpn
net0: virtio=76:36:85:A7:6A:A3,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-105-disk-0,size=256G
scsihw: virtio-scsi-pci
smbios1: uuid=55061264-b9b1-4ce4-8d44-9c187affcb1d
sockets: 1
vmgenid: 30878bdf-66f9-41bf-be34-c31b400340f9

View File

@ -1,7 +1,7 @@
agent: 1
boot: order=scsi0;net0
cores: 4
memory: 16384
memory: 6144
meta: creation-qemu=6.1.0,ctime=1646083628
name: docker-home
net0: virtio=BA:65:DF:88:85:4C,bridge=vmbr0,firewall=1
@ -11,5 +11,5 @@ ostype: l26
scsi0: local-lvm:vm-106-disk-0,size=256G
scsihw: virtio-scsi-pci
smbios1: uuid=54ef12fc-edcc-4744-a109-dd2de9a6dc03
sockets: 2
sockets: 1
vmgenid: a13c92a2-a955-485e-a80e-391e99b19fbd

View File

@ -12,5 +12,5 @@ ostype: l26
scsi0: local-lvm:vm-115-disk-0,size=256G
scsihw: virtio-scsi-pci
smbios1: uuid=19be98ee-f60d-473d-acd2-9164717fcd11
sockets: 2
sockets: 1
vmgenid: 682dfeab-8c63-4f0b-8ed2-8828c2f808ef

View File

@ -0,0 +1,141 @@
---
title: "VM 106 (docker-home) Right-Sizing Runbook"
description: "Runbook for right-sizing VM 106 from 16 GB/8 vCPU to 6 GB/4 vCPU — pre-checks, resize commands, and post-resize validation."
type: runbook
domain: server-configs
tags: [proxmox, infra-audit, right-sizing, docker-home]
---
# VM 106 (docker-home) Right-Sizing Runbook
## Context
Infrastructure audit (2026-04-02) found VM 106 severely overprovisioned:
| Resource | Allocated | Actual Usage | Target |
|----------|-----------|--------------|--------|
| RAM | 16 GB | 1.11.5 GB | 6 GB (4× headroom) |
| vCPUs | 8 (2 sockets × 4 cores) | load 0.12/core | 4 (1 socket × 4 cores) |
**Services**: Pi-hole, Nginx Proxy Manager, Portainer
## Pre-Check Results (2026-04-03)
Automated checks were run before resizing. **All clear.**
### Container memory limits
```bash
docker inspect pihole nginx-proxy-manager_app_1 portainer \
| python3 -c "import json,sys; c=json.load(sys.stdin); \
[print(x['Name'], 'MemoryLimit:', x['HostConfig']['Memory']) for x in c]"
```
Result:
```
/pihole MemoryLimit: 0
/nginx-proxy-manager_app_1 MemoryLimit: 0
/portainer MemoryLimit: 0
```
`0` = no limit — no containers will OOM at 6 GB.
### Docker Compose memory reservations
```bash
grep -rn 'memory\|mem_limit\|memswap' /home/cal/container-data/*/docker-compose.yml
```
Result: **no matches** — no compose-level memory reservations.
### Live memory usage at audit time
```
total: 15 GiB used: 1.1 GiB free: 6.8 GiB buff/cache: 7.7 GiB
Pi-hole: 463 MiB
NPM: 367 MiB
Portainer: 12 MiB
Total containers: ~842 MiB
```
## Resize Procedure
Brief downtime: Pi-hole and NPM will be unavailable during shutdown.
Manticore runs Pi-hole 2 (10.10.0.226) for HA DNS — clients fail over automatically.
### Step 1 — Shut down the VM
```bash
ssh proxmox "qm shutdown 106 --timeout 60"
# Wait for shutdown
ssh proxmox "qm status 106" # Should show: status: stopped
```
### Step 2 — Apply new hardware config
```bash
# Reduce RAM: 16384 MB → 6144 MB
ssh proxmox "qm set 106 --memory 6144"
# Reduce vCPUs: 2 sockets × 4 cores → 1 socket × 4 cores (8 → 4 vCPUs)
ssh proxmox "qm set 106 --sockets 1 --cores 4"
# Verify
ssh proxmox "qm config 106 | grep -E 'memory|cores|sockets'"
```
Expected output:
```
cores: 4
memory: 6144
sockets: 1
```
### Step 3 — Start the VM
```bash
ssh proxmox "qm start 106"
```
Wait ~30 seconds for Docker to come up.
### Step 4 — Verify services
```bash
# Pi-hole DNS resolution
ssh pihole "docker exec pihole dig google.com @127.0.0.1 | grep -E 'SERVER|ANSWER'"
# NPM — check it's running
ssh pihole "docker ps --filter name=nginx-proxy-manager --format '{{.Status}}'"
# Portainer
ssh pihole "docker ps --filter name=portainer --format '{{.Status}}'"
# Memory usage post-resize
ssh pihole "free -h"
```
### Step 5 — Monitor for 24h
Check memory doesn't approach the 6 GB limit:
```bash
ssh pihole "free -h && docker stats --no-stream --format 'table {{.Name}}\t{{.MemUsage}}'"
```
Alert threshold: if `used` exceeds 4.5 GB (75% of 6 GB), consider increasing to 8 GB.
## Rollback
If services fail to come up after resizing:
```bash
# Restore original allocation
ssh proxmox "qm set 106 --memory 16384 --sockets 2 --cores 4"
ssh proxmox "qm start 106"
```
## Related
- [Maintenance Reboot Runbook](maintenance-reboot.md) — VM 106 is Tier 2 (shut down after apps, before databases)
- Issue: cal/claude-home#19

View File

@ -3,6 +3,7 @@ services:
tdarr:
image: ghcr.io/haveagitgat/tdarr:latest
container_name: tdarr-server
init: true
restart: unless-stopped
ports:
- "8265:8265" # Web UI
@ -23,6 +24,7 @@ services:
tdarr-node:
image: ghcr.io/haveagitgat/tdarr_node:latest
container_name: tdarr-node
init: true
restart: unless-stopped
environment:
- PUID=1000
@ -37,6 +39,8 @@ services:
- /mnt/NV2/tdarr-cache:/temp
deploy:
resources:
limits:
memory: 28g
reservations:
devices:
- driver: nvidia

View File

@ -28,8 +28,8 @@ tags: [proxmox, upgrade, pve, backup, rollback, infrastructure]
**Production Services** (7 LXC + 7 VMs) — cleaned up 2026-02-19:
- **Critical**: Paper Dynasty/Major Domo (VM 115), Discord bots (VM 110), Gitea (LXC 225), n8n (LXC 210), Home Assistant (VM 109), Databases (VM 112), docker-home/Pi-hole 1 (VM 106)
- **Important**: Claude Discord Coordinator (LXC 301), arr-stack (LXC 221), Uptime Kuma (LXC 227), Foundry VTT (LXC 223), Memos (LXC 222)
- **Stopped/Investigate**: docker-vpn (VM 105, decommissioning), docker-home-servers (VM 116, needs investigation)
- **Removed (2026-02-19)**: 108 (ansible), 224 (openclaw), 300 (openclaw-migrated), 101/102/104/111/211 (game servers), 107 (plex), 113 (tdarr - moved to .226), 114 (duplicate arr-stack), 117 (unused), 100/103 (old templates)
- **Decommission Candidate**: docker-home-servers (VM 116) — Jellyfin-only after 2026-04-03 cleanup; watchstate removed (duplicate of manticore); see issue #31
- **Removed (2026-02-19)**: 108 (ansible), 224 (openclaw), 300 (openclaw-migrated), 101/102/104/111/211 (game servers), 107 (plex), 113 (tdarr - moved to .226), 114 (duplicate arr-stack), 117 (unused), 100/103 (old templates), 105 (docker-vpn - decommissioned 2026-04)
**Key Constraints**:
- Home Assistant VM 109 requires dual network (vmbr1 for Matter support)

View File

@ -67,10 +67,15 @@ runcmd:
# Add cal user to docker group (will take effect after next login)
- usermod -aG docker cal
# Test Docker installation
- docker run --rm hello-world
# Mask avahi-daemon — not needed in a static-IP homelab with Pi-hole DNS,
# and has a known kernel busy-loop bug that wastes CPU
- systemctl stop avahi-daemon || true
- systemctl mask avahi-daemon
# Write configuration files
write_files:
# SSH hardening configuration

View File

@ -0,0 +1,163 @@
---
title: "VM Decommission Runbook"
description: "Step-by-step procedure for safely decommissioning a Proxmox VM — dependency checks, destruction, and repo cleanup."
type: runbook
domain: vm-management
tags: [proxmox, decommission, infrastructure, cleanup]
---
# VM Decommission Runbook
Procedure for safely removing a stopped Proxmox VM and reclaiming its disk space. Derived from the VM 105 (docker-vpn) decommission (2026-04-02, issue #20).
## Prerequisites
- VM must already be **stopped** on Proxmox
- Services previously running on the VM must be confirmed migrated or no longer needed
- SSH access to Proxmox host (`ssh proxmox`)
## Phase 1 — Dependency Verification
Run all checks before destroying anything. A clean result on all five means safe to proceed.
### 1.1 Pi-hole DNS
Check both primary and secondary Pi-hole for DNS records pointing to the VM's IP:
```bash
ssh pihole "grep '<VM_IP>' /etc/pihole/custom.list || echo 'No DNS entries'"
ssh pihole "pihole -q <VM_HOSTNAME>"
```
### 1.2 Nginx Proxy Manager (NPM)
Check NPM for any proxy hosts with the VM's IP as an upstream:
- NPM UI: https://npm.manticorum.com → Proxy Hosts → search for VM IP
- Or via API: `ssh npm-pihole "curl -s http://localhost:81/api/nginx/proxy-hosts" | grep <VM_IP>`
### 1.3 Proxmox Firewall Rules
```bash
ssh proxmox "cat /etc/pve/firewall/<VMID>.fw 2>/dev/null || echo 'No firewall rules'"
```
### 1.4 Backup Existence
```bash
ssh proxmox "ls -la /var/lib/vz/dump/ | grep <VMID>"
```
### 1.5 VPN / Tunnel References
Check if any WireGuard or VPN configs on other hosts reference this VM:
```bash
ssh proxmox "grep -r '<VM_IP>' /etc/wireguard/ 2>/dev/null || echo 'No WireGuard refs'"
```
Also check SSH config and any automation scripts in the claude-home repo:
```bash
grep -r '<VM_IP>\|<VM_HOSTNAME>' ~/Development/claude-home/
```
## Phase 2 — Safety Measures
### 2.1 Disable Auto-Start
Prevent the VM from starting on Proxmox reboot while you work:
```bash
ssh proxmox "qm set <VMID> --onboot 0"
```
### 2.2 Record Disk Space (Before)
```bash
ssh proxmox "lvs | grep pve"
```
Save this output for comparison after destruction.
### 2.3 Optional: Take a Final Backup
If the VM might contain anything worth preserving:
```bash
ssh proxmox "vzdump <VMID> --mode snapshot --storage home-truenas --compress zstd"
```
Skip if the VM has been stopped for a long time and all services are confirmed migrated.
## Phase 3 — Destroy
```bash
ssh proxmox "qm destroy <VMID> --purge"
```
The `--purge` flag removes the disk along with the VM config. Verify:
```bash
ssh proxmox "qm list | grep <VMID>" # Should return nothing
ssh proxmox "lvs | grep vm-<VMID>-disk" # Should return nothing
ssh proxmox "lvs | grep pve" # Compare with Phase 2.2
```
## Phase 4 — Repo Cleanup
Update these files in the `claude-home` repo:
| File | Action |
|------|--------|
| `~/.ssh/config` | Comment out Host block, add `# DECOMMISSIONED: <name> (<IP>) - <reason>` |
| `server-configs/proxmox/qemu/<VMID>.conf` | Delete the file |
| Migration results (if applicable) | Check off decommission tasks |
| `vm-management/proxmox-upgrades/proxmox-7-to-9-upgrade-plan.md` | Move from Stopped/Investigate to Decommissioned |
| `networking/examples/ssh-homelab-setup.md` | Comment out or remove entry |
| `networking/examples/server_inventory.yaml` | Comment out or remove entry |
Leave historical/planning docs (migration plans, wave results) as-is — they serve as historical records.
## Phase 5 — Commit and PR
Branch naming: `chore/<ISSUE_NUMBER>-decommission-<vm-name>`
Commit message format:
```
chore: decommission VM <VMID> (<name>) — reclaim <SIZE> disk (#<ISSUE>)
Closes #<ISSUE>
```
This is typically a docs-only PR (all `.md` and config files) which gets auto-approved by the `auto-merge-docs` workflow.
## Checklist Template
Copy this for each decommission:
```markdown
### VM <VMID> (<name>) Decommission
**Pre-deletion verification:**
- [ ] Pi-hole DNS — no records
- [ ] NPM upstreams — no proxy hosts
- [ ] Proxmox firewall — no rules
- [ ] Backup status — verified
- [ ] VPN/tunnel references — none
**Execution:**
- [ ] Disabled onboot
- [ ] Recorded disk space before
- [ ] Took backup (or confirmed skip)
- [ ] Destroyed VM with --purge
- [ ] Verified disk space reclaimed
**Cleanup:**
- [ ] SSH config updated
- [ ] VM config file deleted from repo
- [ ] Migration docs updated
- [ ] Upgrade plan updated
- [ ] Example files updated
- [ ] Committed, pushed, PR created
```

View File

@ -262,7 +262,7 @@ When connecting Jellyseerr to arr apps, be careful with tag configurations - inv
- [x] Test movie/show requests through Jellyseerr
### After 48 Hours
- [ ] Decommission VM 121 (docker-vpn)
- [x] Decommission VM 121 (docker-vpn)
- [ ] Clean up local migration temp files (`/tmp/arr-config-migration/`)
---

View File

@ -0,0 +1,33 @@
---
title: "Workstation Troubleshooting"
description: "Troubleshooting notes for Nobara/KDE Wayland workstation issues."
type: troubleshooting
domain: workstation
tags: [troubleshooting, wayland, kde]
---
# Workstation Troubleshooting
## Discord screen sharing shows no windows on KDE Wayland (2026-04-03)
**Severity:** Medium — cannot share screen via Discord desktop app
**Problem:** Clicking "Share Your Screen" in Discord desktop app (v0.0.131, Electron 37) opens the Discord picker but shows zero windows/screens. Same behavior in both the desktop app and the web app when using Discord's own picker. Affects both native Wayland and XWayland modes.
**Root Cause:** Discord's built-in screen picker uses Electron's `desktopCapturer.getSources()` which relies on X11 window enumeration. On KDE Wayland:
- In native Wayland mode: no X11 windows exist, so the picker is empty
- In forced X11/XWayland mode (`ELECTRON_OZONE_PLATFORM_HINT=x11`): Discord can only see other XWayland windows (itself, Android emulator), not native Wayland apps
- Discord ignores `--use-fake-ui-for-media-stream` and other Chromium flags that should force portal usage
- The `discord-flags.conf` file is **not read** by the Nobara/RPM Discord package — flags must go in the `.desktop` file `Exec=` line
**Fix:** Use **Discord web app in Firefox** for screen sharing. Firefox natively delegates to the XDG Desktop Portal via PipeWire, which shows the KDE screen picker with all windows. The desktop app's own picker remains broken on Wayland as of v0.0.131.
Configuration applied (for general Discord Wayland support):
- `~/.local/share/applications/discord.desktop` — overrides system `.desktop` with Wayland flags
- `~/.config/discord-flags.conf` — created but not read by this Discord build
**Lesson:**
- Discord desktop on Linux Wayland cannot do screen sharing through its own picker — always use the web app in Firefox for this
- Electron's `desktopCapturer` API is fundamentally X11-only; the PipeWire/portal path requires the app to use `getDisplayMedia()` instead, which Discord's desktop app does not do
- `discord-flags.conf` is unreliable across distros — always verify flags landed in `/proc/<pid>/cmdline`
- Vesktop (community client) is an alternative that properly implements portal-based screen sharing, if the web app is insufficient