Compare commits

...

15 Commits

Author SHA1 Message Date
Cal Corum
29a20fbe06 feat: add monthly Proxmox maintenance reboot automation (#26)
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 2s
Establishes a first-Sunday-of-the-month maintenance window orchestrated
by Ansible on LXC 304. Split into two playbooks to handle the self-reboot
paradox (the controller is a guest on the host being rebooted):

- monthly-reboot.yml: snapshots, tiered shutdown with per-guest polling,
  fire-and-forget host reboot
- post-reboot-startup.yml: controlled tiered startup with staggered delays,
  Pi-hole UDP DNS fix, validation, and snapshot cleanup

Also fixes onboot:1 on VM 109, LXC 221, LXC 223 and creates a recurring
Google Calendar event for the maintenance window.

Closes #26

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 23:33:59 -05:00
cal
fdc44acb28 Merge pull request 'chore: add --hosts test coverage and right-size VM 115 socket config' (#46) from chore/26-proxmox-monthly-maintenance-reboot into main 2026-04-04 00:35:31 +00:00
Cal Corum
48a804dda2 feat: right-size VM 115 config and add --hosts flag to audit script
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
Reduce VM 115 (docker-sba) from 16 vCPUs (2×8) to 8 vCPUs (1×8) to
match actual workload (0.06 load/core). Add --hosts flag to
homelab-audit.sh for targeted post-change audits.

Closes #18

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 17:33:01 -05:00
Cal Corum
64f299aa1a docs: sync KB — maintenance-reboot.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 2s
2026-04-03 16:00:22 -05:00
cal
a9a778f53c Merge pull request 'feat: dynamic summary, --hosts filter, and --json output (#24)' (#38) from issue/24-homelab-audit-sh-dynamic-summary-and-hosts-filter into main 2026-04-03 20:22:24 +00:00
Cal Corum
1a3785f01a feat: dynamic summary, --hosts filter, and --json output (#24)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
Closes #24

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 20:08:07 +00:00
cal
938240e1f9 Merge pull request 'fix: clean up VM 116 watchstate duplicate and document decommission candidacy (#31)' (#41) from issue/31-vm-116-resolve-watchstate-duplicate-and-clean-up-r into main
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 1s
Reviewed-on: #41
2026-04-03 20:01:27 +00:00
Cal Corum
66143f6090 fix: clean up VM 116 watchstate duplicate and document decommission candidacy (#31)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
- Removed stopped watchstate container from VM 116 (duplicate of manticore's canonical instance)
- Pruned 5 orphan images (watchstate, freetube, pihole, hello-world): 3.36 GB reclaimed
- Confirmed manticore watchstate is healthy and syncing Jellyfin state
- VM 116 now runs only Jellyfin (also runs on manticore)
- Added VM 116 (docker-home-servers) to hosts.yml as decommission candidate
- Updated proxmox-7-to-9-upgrade-plan.md status from Stopped/Investigate to Decommission Candidate

Closes #31

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 20:01:13 +00:00
cal
13483157a9 Merge pull request 'feat: session resumption + Agent SDK evaluation' (#43) from feature/3-agent-sdk-improvements into main
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
Reviewed-on: #43
2026-04-03 20:00:12 +00:00
Cal Corum
e321e7bd47 feat: add session resumption and Agent SDK evaluation
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
- runner.sh: opt-in session persistence via session_resumable and
  resume_last_session settings; fix read_setting to normalize booleans
- issue-poller.sh: capture and log session_id from worker invocations,
  include in result JSON
- pr-reviewer-dispatcher.sh: capture and log session_id from reviews
- n8n workflow: add --append-system-prompt to initial SSH node, add
  Follow Up Diagnostics node using --resume for deeper investigation,
  update Discord Alert with remediation details
- Add Agent SDK evaluation doc (CLI vs Python/TS SDK comparison)
- Update CONTEXT.md with session resumption documentation

Closes #3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-03 19:59:44 +00:00
cal
4e33e1cae3 Merge pull request 'fix: document per-core load threshold policy for health monitoring (#22)' (#42) from issue/22-tune-n8n-alert-thresholds-to-per-core-load-metrics into main
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 2s
2026-04-03 18:36:14 +00:00
Cal Corum
193ae68f96 docs: document per-core load threshold policy for server health monitoring (#22)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 5s
Closes #22

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-03 13:35:23 -05:00
Cal Corum
7c9c96eb52 docs: sync KB — troubleshooting.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 3s
2026-04-03 12:00:22 -05:00
cal
a8c85a8d91 Merge pull request 'chore: decommission VM 105 (docker-vpn) — repo cleanup' (#40) from chore/20-decommission-vm-105-docker-vpn into main
Some checks failed
Reindex Knowledge Base / reindex (push) Failing after 17s
2026-04-03 12:56:43 +00:00
Cal Corum
9e8346a8ab chore: decommission VM 105 (docker-vpn) — repo cleanup (#20)
All checks were successful
Auto-merge docs-only PRs / auto-merge-docs (pull_request) Successful in 2s
VM 105 was already destroyed on Proxmox. This removes stale references:
- Delete server-configs/proxmox/qemu/105.conf
- Comment out docker-vpn entries in example SSH config and server inventory
- Move VM 105 from Stopped/Investigate to Removed in upgrade plan
- Check off decommission task in wave2 migration results

Closes #20

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 23:57:55 -05:00
20 changed files with 1279 additions and 48 deletions

View File

@ -0,0 +1,265 @@
---
# Monthly Proxmox Maintenance Reboot — Shutdown & Reboot
#
# Orchestrates a graceful shutdown of all guests in dependency order,
# then issues a fire-and-forget reboot to the Proxmox host.
#
# After the host reboots, LXC 304 auto-starts via onboot:1 and the
# post-reboot-startup.yml playbook runs automatically via the
# ansible-post-reboot.service systemd unit (triggered by @reboot).
#
# Schedule: 1st Sunday of each month, 08:00 UTC (3 AM ET)
# Controller: LXC 304 (ansible-controller) at 10.10.0.232
#
# Usage:
# # Dry run
# ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --check
#
# # Full execution
# ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml
#
# # Shutdown only (skip the host reboot)
# ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --tags shutdown
#
# Note: VM 109 (homeassistant) is excluded from Ansible inventory
# (self-managed via HA Supervisor) but is included in pvesh start/stop.
- name: Pre-reboot health check and snapshots
hosts: pve-node
gather_facts: false
tags: [pre-reboot, shutdown]
tasks:
- name: Check Proxmox cluster health
ansible.builtin.command: pvesh get /cluster/status --output-format json
register: cluster_status
changed_when: false
- name: Get list of running QEMU VMs
ansible.builtin.shell: >
pvesh get /nodes/proxmox/qemu --output-format json |
python3 -c "import sys,json; [print(vm['vmid']) for vm in json.load(sys.stdin) if vm.get('status')=='running']"
register: running_vms
changed_when: false
- name: Get list of running LXC containers
ansible.builtin.shell: >
pvesh get /nodes/proxmox/lxc --output-format json |
python3 -c "import sys,json; [print(ct['vmid']) for ct in json.load(sys.stdin) if ct.get('status')=='running']"
register: running_lxcs
changed_when: false
- name: Display running guests
ansible.builtin.debug:
msg: "Running VMs: {{ running_vms.stdout_lines }} | Running LXCs: {{ running_lxcs.stdout_lines }}"
- name: Snapshot running VMs
ansible.builtin.command: >
pvesh create /nodes/proxmox/qemu/{{ item }}/snapshot
--snapname pre-maintenance-{{ lookup('pipe', 'date +%Y-%m-%d') }}
--description "Auto snapshot before monthly maintenance reboot"
loop: "{{ running_vms.stdout_lines }}"
when: running_vms.stdout_lines | length > 0
ignore_errors: true
- name: Snapshot running LXCs
ansible.builtin.command: >
pvesh create /nodes/proxmox/lxc/{{ item }}/snapshot
--snapname pre-maintenance-{{ lookup('pipe', 'date +%Y-%m-%d') }}
--description "Auto snapshot before monthly maintenance reboot"
loop: "{{ running_lxcs.stdout_lines }}"
when: running_lxcs.stdout_lines | length > 0
ignore_errors: true
- name: "Shutdown Tier 4 — Media & Others"
hosts: pve-node
gather_facts: false
tags: [shutdown]
vars:
tier4_vms: [109]
# LXC 303 (mcp-gateway) is onboot=0 and operator-managed — not included here
tier4_lxcs: [221, 222, 223, 302]
tasks:
- name: Shutdown Tier 4 VMs
ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/shutdown
loop: "{{ tier4_vms }}"
ignore_errors: true
- name: Shutdown Tier 4 LXCs
ansible.builtin.command: pvesh create /nodes/proxmox/lxc/{{ item }}/status/shutdown
loop: "{{ tier4_lxcs }}"
ignore_errors: true
- name: Wait for Tier 4 VMs to stop
ansible.builtin.shell: >
pvesh get /nodes/proxmox/qemu/{{ item }}/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
register: t4_vm_status
until: t4_vm_status.stdout.strip() == "stopped"
retries: 12
delay: 5
loop: "{{ tier4_vms }}"
ignore_errors: true
- name: Wait for Tier 4 LXCs to stop
ansible.builtin.shell: >
pvesh get /nodes/proxmox/lxc/{{ item }}/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
register: t4_lxc_status
until: t4_lxc_status.stdout.strip() == "stopped"
retries: 12
delay: 5
loop: "{{ tier4_lxcs }}"
ignore_errors: true
- name: "Shutdown Tier 3 — Applications"
hosts: pve-node
gather_facts: false
tags: [shutdown]
vars:
tier3_vms: [115, 110]
tier3_lxcs: [301]
tasks:
- name: Shutdown Tier 3 VMs
ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/shutdown
loop: "{{ tier3_vms }}"
ignore_errors: true
- name: Shutdown Tier 3 LXCs
ansible.builtin.command: pvesh create /nodes/proxmox/lxc/{{ item }}/status/shutdown
loop: "{{ tier3_lxcs }}"
ignore_errors: true
- name: Wait for Tier 3 VMs to stop
ansible.builtin.shell: >
pvesh get /nodes/proxmox/qemu/{{ item }}/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
register: t3_vm_status
until: t3_vm_status.stdout.strip() == "stopped"
retries: 12
delay: 5
loop: "{{ tier3_vms }}"
ignore_errors: true
- name: Wait for Tier 3 LXCs to stop
ansible.builtin.shell: >
pvesh get /nodes/proxmox/lxc/{{ item }}/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
register: t3_lxc_status
until: t3_lxc_status.stdout.strip() == "stopped"
retries: 12
delay: 5
loop: "{{ tier3_lxcs }}"
ignore_errors: true
- name: "Shutdown Tier 2 — Infrastructure"
hosts: pve-node
gather_facts: false
tags: [shutdown]
vars:
tier2_vms: [106, 116]
tier2_lxcs: [225, 210, 227]
tasks:
- name: Shutdown Tier 2 VMs
ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/shutdown
loop: "{{ tier2_vms }}"
ignore_errors: true
- name: Shutdown Tier 2 LXCs
ansible.builtin.command: pvesh create /nodes/proxmox/lxc/{{ item }}/status/shutdown
loop: "{{ tier2_lxcs }}"
ignore_errors: true
- name: Wait for Tier 2 VMs to stop
ansible.builtin.shell: >
pvesh get /nodes/proxmox/qemu/{{ item }}/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
register: t2_vm_status
until: t2_vm_status.stdout.strip() == "stopped"
retries: 12
delay: 5
loop: "{{ tier2_vms }}"
ignore_errors: true
- name: Wait for Tier 2 LXCs to stop
ansible.builtin.shell: >
pvesh get /nodes/proxmox/lxc/{{ item }}/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
register: t2_lxc_status
until: t2_lxc_status.stdout.strip() == "stopped"
retries: 12
delay: 5
loop: "{{ tier2_lxcs }}"
ignore_errors: true
- name: "Shutdown Tier 1 — Databases"
hosts: pve-node
gather_facts: false
tags: [shutdown]
vars:
tier1_vms: [112]
tasks:
- name: Shutdown database VMs
ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/shutdown
loop: "{{ tier1_vms }}"
ignore_errors: true
- name: Wait for database VMs to stop (up to 90s)
ansible.builtin.shell: >
pvesh get /nodes/proxmox/qemu/{{ item }}/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
register: t1_vm_status
until: t1_vm_status.stdout.strip() == "stopped"
retries: 18
delay: 5
loop: "{{ tier1_vms }}"
ignore_errors: true
- name: Force stop database VMs if still running
ansible.builtin.shell: >
status=$(pvesh get /nodes/proxmox/qemu/{{ item }}/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))");
if [ "$status" = "running" ]; then
pvesh create /nodes/proxmox/qemu/{{ item }}/status/stop;
echo "Force stopped VM {{ item }}";
else
echo "VM {{ item }} already stopped";
fi
loop: "{{ tier1_vms }}"
register: force_stop_result
changed_when: force_stop_result.results | default([]) | selectattr('stdout', 'defined') | selectattr('stdout', 'search', 'Force stopped') | list | length > 0
- name: "Verify and reboot Proxmox host"
hosts: pve-node
gather_facts: false
tags: [reboot]
tasks:
- name: Verify all guests are stopped (excluding LXC 304)
ansible.builtin.shell: >
running_vms=$(pvesh get /nodes/proxmox/qemu --output-format json |
python3 -c "import sys,json; vms=[v for v in json.load(sys.stdin) if v.get('status')=='running']; print(len(vms))");
running_lxcs=$(pvesh get /nodes/proxmox/lxc --output-format json |
python3 -c "import sys,json; cts=[c for c in json.load(sys.stdin) if c.get('status')=='running' and c['vmid'] != 304]; print(len(cts))");
echo "Running VMs: $running_vms, Running LXCs: $running_lxcs";
if [ "$running_vms" != "0" ] || [ "$running_lxcs" != "0" ]; then exit 1; fi
register: verify_stopped
- name: Issue fire-and-forget reboot (controller will be killed)
ansible.builtin.shell: >
nohup bash -c 'sleep 10 && reboot' &>/dev/null &
echo "Reboot scheduled in 10 seconds"
register: reboot_issued
when: not ansible_check_mode
- name: Log reboot issued
ansible.builtin.debug:
msg: "{{ reboot_issued.stdout }} — Ansible process will terminate when host reboots. Post-reboot startup handled by ansible-post-reboot.service on LXC 304."

View File

@ -0,0 +1,214 @@
---
# Post-Reboot Startup — Controlled Guest Startup After Proxmox Reboot
#
# Starts all guests in dependency order with staggered delays to avoid
# I/O storms. Runs automatically via ansible-post-reboot.service on
# LXC 304 after the Proxmox host reboots.
#
# Can also be run manually:
# ansible-playbook /opt/ansible/playbooks/post-reboot-startup.yml
#
# Note: VM 109 (homeassistant) is excluded from Ansible inventory
# (self-managed via HA Supervisor) but is included in pvesh start/stop.
- name: Wait for Proxmox API to be ready
hosts: pve-node
gather_facts: false
tags: [startup]
tasks:
- name: Wait for Proxmox API
ansible.builtin.command: pvesh get /version --output-format json
register: pve_version
until: pve_version.rc == 0
retries: 30
delay: 10
changed_when: false
- name: Display Proxmox version
ansible.builtin.debug:
msg: "Proxmox API ready: {{ pve_version.stdout | from_json | json_query('version') | default('unknown') }}"
- name: "Startup Tier 1 — Databases"
hosts: pve-node
gather_facts: false
tags: [startup]
tasks:
- name: Start database VM (112)
ansible.builtin.command: pvesh create /nodes/proxmox/qemu/112/status/start
ignore_errors: true
- name: Wait for VM 112 to be running
ansible.builtin.shell: >
pvesh get /nodes/proxmox/qemu/112/status/current --output-format json |
python3 -c "import sys,json; print(json.load(sys.stdin).get('status','unknown'))"
register: db_status
until: db_status.stdout.strip() == "running"
retries: 12
delay: 5
changed_when: false
- name: Wait for database services to initialize
ansible.builtin.pause:
seconds: 30
- name: "Startup Tier 2 — Infrastructure"
hosts: pve-node
gather_facts: false
tags: [startup]
vars:
tier2_vms: [106, 116]
tier2_lxcs: [225, 210, 227]
tasks:
- name: Start Tier 2 VMs
ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/start
loop: "{{ tier2_vms }}"
ignore_errors: true
- name: Start Tier 2 LXCs
ansible.builtin.command: pvesh create /nodes/proxmox/lxc/{{ item }}/status/start
loop: "{{ tier2_lxcs }}"
ignore_errors: true
- name: Wait for infrastructure to come up
ansible.builtin.pause:
seconds: 30
- name: "Startup Tier 3 — Applications"
hosts: pve-node
gather_facts: false
tags: [startup]
vars:
tier3_vms: [115, 110]
tier3_lxcs: [301]
tasks:
- name: Start Tier 3 VMs
ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/start
loop: "{{ tier3_vms }}"
ignore_errors: true
- name: Start Tier 3 LXCs
ansible.builtin.command: pvesh create /nodes/proxmox/lxc/{{ item }}/status/start
loop: "{{ tier3_lxcs }}"
ignore_errors: true
- name: Wait for applications to start
ansible.builtin.pause:
seconds: 30
- name: Restart Pi-hole container via SSH (UDP DNS fix)
ansible.builtin.command: ssh docker-home "docker restart pihole"
ignore_errors: true
- name: Wait for Pi-hole to stabilize
ansible.builtin.pause:
seconds: 10
- name: "Startup Tier 4 — Media & Others"
hosts: pve-node
gather_facts: false
tags: [startup]
vars:
tier4_vms: [109]
tier4_lxcs: [221, 222, 223, 302]
tasks:
- name: Start Tier 4 VMs
ansible.builtin.command: pvesh create /nodes/proxmox/qemu/{{ item }}/status/start
loop: "{{ tier4_vms }}"
ignore_errors: true
- name: Start Tier 4 LXCs
ansible.builtin.command: pvesh create /nodes/proxmox/lxc/{{ item }}/status/start
loop: "{{ tier4_lxcs }}"
ignore_errors: true
- name: Post-reboot validation
hosts: pve-node
gather_facts: false
tags: [startup, validate]
tasks:
- name: Wait for all services to initialize
ansible.builtin.pause:
seconds: 60
- name: Check all expected VMs are running
ansible.builtin.shell: >
pvesh get /nodes/proxmox/qemu --output-format json |
python3 -c "
import sys, json
vms = json.load(sys.stdin)
expected = {106, 109, 110, 112, 115, 116}
running = {v['vmid'] for v in vms if v.get('status') == 'running'}
missing = expected - running
if missing:
print(f'WARN: VMs not running: {missing}')
sys.exit(1)
print(f'All expected VMs running: {running & expected}')
"
register: vm_check
ignore_errors: true
- name: Check all expected LXCs are running
ansible.builtin.shell: >
pvesh get /nodes/proxmox/lxc --output-format json |
python3 -c "
import sys, json
cts = json.load(sys.stdin)
# LXC 303 (mcp-gateway) intentionally excluded — onboot=0, operator-managed
expected = {210, 221, 222, 223, 225, 227, 301, 302, 304}
running = {c['vmid'] for c in cts if c.get('status') == 'running'}
missing = expected - running
if missing:
print(f'WARN: LXCs not running: {missing}')
sys.exit(1)
print(f'All expected LXCs running: {running & expected}')
"
register: lxc_check
ignore_errors: true
- name: Clean up old maintenance snapshots (older than 7 days)
ansible.builtin.shell: >
cutoff=$(date -d '7 days ago' +%s);
for vmid in $(pvesh get /nodes/proxmox/qemu --output-format json |
python3 -c "import sys,json; [print(v['vmid']) for v in json.load(sys.stdin)]"); do
for snap in $(pvesh get /nodes/proxmox/qemu/$vmid/snapshot --output-format json |
python3 -c "import sys,json; [print(s['name']) for s in json.load(sys.stdin) if s['name'].startswith('pre-maintenance-')]" 2>/dev/null); do
snap_date=$(echo $snap | sed 's/pre-maintenance-//');
snap_epoch=$(date -d "$snap_date" +%s 2>/dev/null);
if [ -z "$snap_epoch" ]; then
echo "WARN: could not parse date for snapshot $snap on VM $vmid";
elif [ "$snap_epoch" -lt "$cutoff" ]; then
pvesh delete /nodes/proxmox/qemu/$vmid/snapshot/$snap && echo "Deleted $snap from VM $vmid";
fi
done
done;
for ctid in $(pvesh get /nodes/proxmox/lxc --output-format json |
python3 -c "import sys,json; [print(c['vmid']) for c in json.load(sys.stdin)]"); do
for snap in $(pvesh get /nodes/proxmox/lxc/$ctid/snapshot --output-format json |
python3 -c "import sys,json; [print(s['name']) for s in json.load(sys.stdin) if s['name'].startswith('pre-maintenance-')]" 2>/dev/null); do
snap_date=$(echo $snap | sed 's/pre-maintenance-//');
snap_epoch=$(date -d "$snap_date" +%s 2>/dev/null);
if [ -z "$snap_epoch" ]; then
echo "WARN: could not parse date for snapshot $snap on LXC $ctid";
elif [ "$snap_epoch" -lt "$cutoff" ]; then
pvesh delete /nodes/proxmox/lxc/$ctid/snapshot/$snap && echo "Deleted $snap from LXC $ctid";
fi
done
done;
echo "Snapshot cleanup complete"
ignore_errors: true
- name: Display validation results
ansible.builtin.debug:
msg:
- "VM status: {{ vm_check.stdout }}"
- "LXC status: {{ lxc_check.stdout }}"
- "Maintenance reboot complete — post-reboot startup finished"

View File

@ -0,0 +1,15 @@
[Unit]
Description=Monthly Proxmox maintenance reboot (Ansible)
After=network-online.target
Wants=network-online.target
[Service]
Type=oneshot
User=cal
WorkingDirectory=/opt/ansible
ExecStart=/usr/bin/ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml
StandardOutput=append:/opt/ansible/logs/monthly-reboot.log
StandardError=append:/opt/ansible/logs/monthly-reboot.log
TimeoutStartSec=900
# No [Install] section — this service is activated exclusively by ansible-monthly-reboot.timer

View File

@ -0,0 +1,13 @@
[Unit]
Description=Monthly Proxmox maintenance reboot timer
Documentation=https://git.manticorum.com/cal/claude-home/src/branch/main/server-configs/proxmox/maintenance-reboot.md
[Timer]
# First Sunday of the month at 08:00 UTC (3:00 AM ET during EDT)
# Day range 01-07 ensures it's always the first occurrence of that weekday
OnCalendar=Sun *-*-01..07 08:00:00
Persistent=true
RandomizedDelaySec=600
[Install]
WantedBy=timers.target

View File

@ -0,0 +1,21 @@
[Unit]
Description=Post-reboot controlled guest startup (Ansible)
After=network-online.target
Wants=network-online.target
# Only run after a fresh boot — not on service restart
ConditionUpTimeSec=600
[Service]
Type=oneshot
User=cal
WorkingDirectory=/opt/ansible
# Delay 120s to let Proxmox API stabilize and onboot guests settle
ExecStartPre=/bin/sleep 120
ExecStart=/usr/bin/ansible-playbook /opt/ansible/playbooks/post-reboot-startup.yml
StandardOutput=append:/opt/ansible/logs/post-reboot-startup.log
StandardError=append:/opt/ansible/logs/post-reboot-startup.log
TimeoutStartSec=1800
[Install]
# Runs automatically on every boot of LXC 304
WantedBy=multi-user.target

View File

@ -21,7 +21,7 @@
{
"parameters": {
"operation": "executeCommand",
"command": "/root/.local/bin/claude -p \"Run python3 ~/.claude/skills/server-diagnostics/client.py health paper-dynasty and analyze the results. If any containers are not running or there are critical issues, summarize them. Otherwise just say 'All systems healthy'.\" --output-format json --json-schema '{\"type\":\"object\",\"properties\":{\"status\":{\"type\":\"string\",\"enum\":[\"healthy\",\"issues_found\"]},\"summary\":{\"type\":\"string\"},\"root_cause\":{\"type\":\"string\"},\"severity\":{\"type\":\"string\",\"enum\":[\"low\",\"medium\",\"high\",\"critical\"]},\"affected_services\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}},\"actions_taken\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}}},\"required\":[\"status\",\"severity\",\"summary\"]}' --allowedTools \"Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)\"",
"command": "/root/.local/bin/claude -p \"Run python3 ~/.claude/skills/server-diagnostics/client.py health paper-dynasty and analyze the results. If any containers are not running or there are critical issues, summarize them. Otherwise just say 'All systems healthy'.\" --output-format json --json-schema '{\"type\":\"object\",\"properties\":{\"status\":{\"type\":\"string\",\"enum\":[\"healthy\",\"issues_found\"]},\"summary\":{\"type\":\"string\"},\"root_cause\":{\"type\":\"string\"},\"severity\":{\"type\":\"string\",\"enum\":[\"low\",\"medium\",\"high\",\"critical\"]},\"affected_services\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}},\"actions_taken\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}}},\"required\":[\"status\",\"severity\",\"summary\"]}' --allowedTools \"Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)\" --append-system-prompt \"You are a server diagnostics agent. Use the server-diagnostics skill client.py for all operations. Never run destructive commands.\"",
"options": {}
},
"id": "ssh-claude-code",
@ -75,20 +75,48 @@
"typeVersion": 2,
"position": [660, 0]
},
{
"parameters": {
"operation": "executeCommand",
"command": "=/root/.local/bin/claude -p \"The previous health check found issues. Investigate deeper: check container logs, resource usage, and recent events. Provide a detailed root cause analysis and recommended remediation steps.\" --resume \"{{ $json.session_id }}\" --output-format json --json-schema '{\"type\":\"object\",\"properties\":{\"root_cause_detail\":{\"type\":\"string\"},\"container_logs\":{\"type\":\"string\"},\"resource_status\":{\"type\":\"string\"},\"remediation_steps\":{\"type\":\"array\",\"items\":{\"type\":\"string\"}},\"requires_human\":{\"type\":\"boolean\"}},\"required\":[\"root_cause_detail\",\"remediation_steps\",\"requires_human\"]}' --allowedTools \"Read,Grep,Glob,Bash(python3 ~/.claude/skills/server-diagnostics/client.py *)\" --max-turns 15 --append-system-prompt \"You are a server diagnostics agent performing a follow-up investigation. The initial health check found issues. Dig deeper into logs and metrics. Never run destructive commands.\"",
"options": {}
},
"id": "ssh-followup",
"name": "Follow Up Diagnostics",
"type": "n8n-nodes-base.ssh",
"typeVersion": 1,
"position": [880, -200],
"credentials": {
"sshPassword": {
"id": "REPLACE_WITH_CREDENTIAL_ID",
"name": "Claude Code LXC"
}
}
},
{
"parameters": {
"jsCode": "// Parse follow-up diagnostics response\nconst stdout = $input.first().json.stdout || '';\nconst initial = $('Parse Claude Response').first().json;\n\ntry {\n const response = JSON.parse(stdout);\n const data = response.structured_output || JSON.parse(response.result || '{}');\n \n return [{\n json: {\n ...initial,\n followup: {\n root_cause_detail: data.root_cause_detail || 'No detail available',\n container_logs: data.container_logs || '',\n resource_status: data.resource_status || '',\n remediation_steps: data.remediation_steps || [],\n requires_human: data.requires_human || false,\n cost_usd: response.total_cost_usd,\n session_id: response.session_id\n },\n total_cost_usd: (initial.cost_usd || 0) + (response.total_cost_usd || 0)\n }\n }];\n} catch (e) {\n return [{\n json: {\n ...initial,\n followup: {\n error: e.message,\n root_cause_detail: 'Follow-up parse failed',\n remediation_steps: [],\n requires_human: true\n },\n total_cost_usd: initial.cost_usd || 0\n }\n }];\n}"
},
"id": "parse-followup",
"name": "Parse Follow-up Response",
"type": "n8n-nodes-base.code",
"typeVersion": 2,
"position": [1100, -200]
},
{
"parameters": {
"method": "POST",
"url": "https://discord.com/api/webhooks/1451783909409816763/O9PMDiNt6ZIWRf8HKocIZ_E4vMGV_lEwq50aAiZ9HVFR2UGwO6J1N9_wOm82p0MetIqT",
"sendBody": true,
"specifyBody": "json",
"jsonBody": "={\n \"embeds\": [{\n \"title\": \"{{ $json.severity === 'critical' ? '🔴' : $json.severity === 'high' ? '🟠' : '🟡' }} Server Alert\",\n \"description\": {{ JSON.stringify($json.summary) }},\n \"color\": {{ $json.severity === 'critical' ? 15158332 : $json.severity === 'high' ? 15105570 : 16776960 }},\n \"fields\": [\n {\n \"name\": \"Severity\",\n \"value\": \"{{ $json.severity.toUpperCase() }}\",\n \"inline\": true\n },\n {\n \"name\": \"Server\",\n \"value\": \"paper-dynasty (10.10.0.88)\",\n \"inline\": true\n },\n {\n \"name\": \"Cost\",\n \"value\": \"${{ $json.cost_usd ? $json.cost_usd.toFixed(4) : '0.0000' }}\",\n \"inline\": true\n },\n {\n \"name\": \"Root Cause\",\n \"value\": \"{{ $json.root_cause || 'N/A' }}\",\n \"inline\": false\n },\n {\n \"name\": \"Affected Services\",\n \"value\": \"{{ $json.affected_services.length ? $json.affected_services.join(', ') : 'None' }}\",\n \"inline\": false\n },\n {\n \"name\": \"Actions Taken\",\n \"value\": \"{{ $json.actions_taken.length ? $json.actions_taken.join('\\n') : 'None' }}\",\n \"inline\": false\n }\n ],\n \"timestamp\": \"{{ new Date().toISOString() }}\"\n }]\n}",
"jsonBody": "={\n \"embeds\": [{\n \"title\": \"{{ $json.severity === 'critical' ? '🔴' : $json.severity === 'high' ? '🟠' : '🟡' }} Server Alert\",\n \"description\": {{ JSON.stringify($json.summary) }},\n \"color\": {{ $json.severity === 'critical' ? 15158332 : $json.severity === 'high' ? 15105570 : 16776960 }},\n \"fields\": [\n {\n \"name\": \"Severity\",\n \"value\": \"{{ $json.severity.toUpperCase() }}\",\n \"inline\": true\n },\n {\n \"name\": \"Server\",\n \"value\": \"paper-dynasty (10.10.0.88)\",\n \"inline\": true\n },\n {\n \"name\": \"Cost\",\n \"value\": \"${{ $json.total_cost_usd ? $json.total_cost_usd.toFixed(4) : '0.0000' }}\",\n \"inline\": true\n },\n {\n \"name\": \"Root Cause\",\n \"value\": {{ JSON.stringify(($json.followup && $json.followup.root_cause_detail) || $json.root_cause || 'N/A') }},\n \"inline\": false\n },\n {\n \"name\": \"Affected Services\",\n \"value\": \"{{ $json.affected_services.length ? $json.affected_services.join(', ') : 'None' }}\",\n \"inline\": false\n },\n {\n \"name\": \"Remediation Steps\",\n \"value\": {{ JSON.stringify(($json.followup && $json.followup.remediation_steps.length) ? $json.followup.remediation_steps.map((s, i) => (i+1) + '. ' + s).join('\\n') : ($json.actions_taken.length ? $json.actions_taken.join('\\n') : 'None')) }},\n \"inline\": false\n },\n {\n \"name\": \"Requires Human?\",\n \"value\": \"{{ ($json.followup && $json.followup.requires_human) ? '⚠️ Yes' : '✅ No' }}\",\n \"inline\": true\n }\n ],\n \"timestamp\": \"{{ new Date().toISOString() }}\"\n }]\n}",
"options": {}
},
"id": "discord-alert",
"name": "Discord Alert",
"type": "n8n-nodes-base.httpRequest",
"typeVersion": 4.2,
"position": [880, -100]
"position": [1320, -200]
},
{
"parameters": {
@ -145,7 +173,7 @@
"main": [
[
{
"node": "Discord Alert",
"node": "Follow Up Diagnostics",
"type": "main",
"index": 0
}
@ -158,6 +186,28 @@
}
]
]
},
"Follow Up Diagnostics": {
"main": [
[
{
"node": "Parse Follow-up Response",
"type": "main",
"index": 0
}
]
]
},
"Parse Follow-up Response": {
"main": [
[
{
"node": "Discord Alert",
"type": "main",
"index": 0
}
]
]
}
},
"settings": {

View File

@ -5,7 +5,7 @@
# to collect system metrics, then generates a summary report.
#
# Usage:
# homelab-audit.sh [--output-dir DIR]
# homelab-audit.sh [--output-dir DIR] [--hosts label:ip,label:ip,...]
#
# Environment overrides:
# STUCK_PROC_CPU_WARN CPU% at which a D-state process is flagged (default: 10)
@ -29,6 +29,8 @@ LOAD_WARN=2.0
MEM_WARN=85
ZOMBIE_WARN=1
SWAP_WARN=512
HOSTS_FILTER="" # comma-separated host list from --hosts; empty = audit all
JSON_OUTPUT=0 # set to 1 by --json
while [[ $# -gt 0 ]]; do
case "$1" in
@ -40,6 +42,18 @@ while [[ $# -gt 0 ]]; do
REPORT_DIR="$2"
shift 2
;;
--hosts)
if [[ $# -lt 2 ]]; then
echo "Error: --hosts requires an argument" >&2
exit 1
fi
HOSTS_FILTER="$2"
shift 2
;;
--json)
JSON_OUTPUT=1
shift
;;
*)
echo "Unknown option: $1" >&2
exit 1
@ -50,6 +64,7 @@ done
mkdir -p "$REPORT_DIR"
SSH_FAILURES_LOG="$REPORT_DIR/ssh-failures.log"
FINDINGS_FILE="$REPORT_DIR/findings.txt"
AUDITED_HOSTS=() # populated in main; used by generate_summary for per-host counts
# ---------------------------------------------------------------------------
# Remote collector script
@ -281,6 +296,18 @@ generate_summary() {
printf " Critical : %d\n" "$crit_count"
echo "=============================="
if [[ ${#AUDITED_HOSTS[@]} -gt 0 ]] && ((warn_count + crit_count > 0)); then
echo ""
printf " %-30s %8s %8s\n" "Host" "Warnings" "Critical"
printf " %-30s %8s %8s\n" "----" "--------" "--------"
for host in "${AUDITED_HOSTS[@]}"; do
local hw hc
hw=$(grep -c "^WARN ${host}:" "$FINDINGS_FILE" 2>/dev/null || true)
hc=$(grep -c "^CRIT ${host}:" "$FINDINGS_FILE" 2>/dev/null || true)
((hw + hc > 0)) && printf " %-30s %8d %8d\n" "$host" "$hw" "$hc"
done
fi
if ((warn_count + crit_count > 0)); then
echo ""
echo "Findings:"
@ -293,6 +320,9 @@ generate_summary() {
grep '^SSH_FAILURE' "$SSH_FAILURES_LOG" | awk '{print " " $2 " (" $3 ")"}'
fi
echo ""
printf "Total: %d warning(s), %d critical across %d host(s)\n" \
"$warn_count" "$crit_count" "$host_count"
echo ""
echo "Reports: $REPORT_DIR"
}
@ -383,6 +413,69 @@ check_cert_expiry() {
done
}
# ---------------------------------------------------------------------------
# JSON report — writes findings.json to $REPORT_DIR when --json is used
# ---------------------------------------------------------------------------
write_json_report() {
local host_count="$1"
local json_file="$REPORT_DIR/findings.json"
local ssh_failure_count=0
local warn_count=0
local crit_count=0
[[ -f "$SSH_FAILURES_LOG" ]] &&
ssh_failure_count=$(grep -c '^SSH_FAILURE' "$SSH_FAILURES_LOG" 2>/dev/null || true)
[[ -f "$FINDINGS_FILE" ]] &&
warn_count=$(grep -c '^WARN' "$FINDINGS_FILE" 2>/dev/null || true)
[[ -f "$FINDINGS_FILE" ]] &&
crit_count=$(grep -c '^CRIT' "$FINDINGS_FILE" 2>/dev/null || true)
python3 - "$json_file" "$host_count" "$ssh_failure_count" \
"$warn_count" "$crit_count" "$FINDINGS_FILE" <<'PYEOF'
import sys, json, datetime
json_file = sys.argv[1]
host_count = int(sys.argv[2])
ssh_failure_count = int(sys.argv[3])
warn_count = int(sys.argv[4])
crit_count = int(sys.argv[5])
findings_file = sys.argv[6]
findings = []
try:
with open(findings_file) as f:
for line in f:
line = line.strip()
if not line:
continue
parts = line.split(None, 2)
if len(parts) < 3:
continue
severity, host_colon, message = parts[0], parts[1], parts[2]
findings.append({
"severity": severity,
"host": host_colon.rstrip(":"),
"message": message,
})
except FileNotFoundError:
pass
output = {
"timestamp": datetime.datetime.utcnow().isoformat() + "Z",
"hosts_audited": host_count,
"warnings": warn_count,
"critical": crit_count,
"ssh_failures": ssh_failure_count,
"total_findings": warn_count + crit_count,
"findings": findings,
}
with open(json_file, "w") as f:
json.dump(output, f, indent=2)
print(f"JSON report: {json_file}")
PYEOF
}
# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------
@ -390,22 +483,46 @@ main() {
echo "Starting homelab audit — $(date)"
echo "Report dir: $REPORT_DIR"
echo "STUCK_PROC_CPU_WARN threshold: ${STUCK_PROC_CPU_WARN}%"
[[ -n "$HOSTS_FILTER" ]] && echo "Host filter: $HOSTS_FILTER"
echo ""
>"$FINDINGS_FILE"
echo " Checking Proxmox backup recency..."
check_backup_recency
local host_count=0
while read -r label addr; do
echo " Auditing $label ($addr)..."
parse_and_report "$label" "$addr"
check_cert_expiry "$label" "$addr"
((host_count++)) || true
done < <(collect_inventory)
if [[ -n "$HOSTS_FILTER" ]]; then
# --hosts mode: audit specified hosts directly, skip Proxmox inventory
local check_proxmox=0
IFS=',' read -ra filter_hosts <<<"$HOSTS_FILTER"
for host in "${filter_hosts[@]}"; do
[[ "$host" == "proxmox" ]] && check_proxmox=1
done
if ((check_proxmox)); then
echo " Checking Proxmox backup recency..."
check_backup_recency
fi
for host in "${filter_hosts[@]}"; do
echo " Auditing $host..."
parse_and_report "$host" "$host"
check_cert_expiry "$host" "$host"
AUDITED_HOSTS+=("$host")
((host_count++)) || true
done
else
echo " Checking Proxmox backup recency..."
check_backup_recency
while read -r label addr; do
echo " Auditing $label ($addr)..."
parse_and_report "$label" "$addr"
check_cert_expiry "$label" "$addr"
AUDITED_HOSTS+=("$label")
((host_count++)) || true
done < <(collect_inventory)
fi
generate_summary "$host_count"
[[ "$JSON_OUTPUT" -eq 1 ]] && write_json_report "$host_count"
}
main "$@"

View File

@ -93,6 +93,34 @@ else
fail "disk_usage" "expected 'N /path', got: '$result'"
fi
# --- --hosts flag parsing ---
echo ""
echo "=== --hosts argument parsing tests ==="
# Single host
input="vm-115:10.10.0.88"
IFS=',' read -ra entries <<<"$input"
label="${entries[0]%%:*}"
addr="${entries[0]#*:}"
if [[ "$label" == "vm-115" && "$addr" == "10.10.0.88" ]]; then
pass "--hosts single entry parsed: $label $addr"
else
fail "--hosts single" "expected 'vm-115 10.10.0.88', got: '$label $addr'"
fi
# Multiple hosts
input="vm-115:10.10.0.88,lxc-225:10.10.0.225"
IFS=',' read -ra entries <<<"$input"
label1="${entries[0]%%:*}"
addr1="${entries[0]#*:}"
label2="${entries[1]%%:*}"
addr2="${entries[1]#*:}"
if [[ "$label1" == "vm-115" && "$addr1" == "10.10.0.88" && "$label2" == "lxc-225" && "$addr2" == "10.10.0.225" ]]; then
pass "--hosts multi entry parsed: $label1 $addr1, $label2 $addr2"
else
fail "--hosts multi" "unexpected parse result"
fi
echo ""
echo "=== Results: $PASS passed, $FAIL failed ==="
((FAIL == 0))

View File

@ -92,6 +92,42 @@ CT 302 does **not** have an SSH key registered with Gitea, so SSH git remotes wo
3. Commit to Gitea, pull on CT 302
4. Add Uptime Kuma monitors if desired
## Health Check Thresholds
Thresholds are evaluated in `health_check.py`. All load thresholds use **per-core** metrics
to avoid false positives from LXC containers (which see the Proxmox host's aggregate load).
### Load Average
| Metric | Value | Rationale |
|--------|-------|-----------|
| `LOAD_WARN_PER_CORE` | `0.7` | Elevated — investigate if sustained |
| `LOAD_CRIT_PER_CORE` | `1.0` | Saturated — CPU is a bottleneck |
| Sample window | 5-minute | Filters transient spikes (not 1-minute) |
**Formula**: `load_per_core = load_5m / nproc`
**Why per-core?** Proxmox LXC containers see the host's aggregate load average via the
shared kernel. A 32-core Proxmox host at load 9 is at 0.28/core (healthy), but a naive
absolute threshold of 2× would trigger at 9 for a 4-core LXC. Using `load_5m / nproc`
where `nproc` returns the host's visible core count gives the correct ratio.
**Validation examples**:
- Proxmox host: load 9 / 32 cores = 0.28/core → no alert ✓
- VM 116 at 0.75/core → warning ✓ (above 0.7 threshold)
- VM at 1.1/core → critical ✓
### Other Thresholds
| Check | Threshold | Notes |
|-------|-----------|-------|
| Zombie processes | 5 | Single zombies are transient noise; alert only if ≥ 5 |
| Swap usage | 30% of total swap | Percentage-based to handle varied swap sizes across hosts |
| Disk warning | 85% | |
| Disk critical | 95% | |
| Memory | 90% | |
| Uptime alert | Non-urgent Discord post | Not a page-level alert |
## Related
- [monitoring/CONTEXT.md](../CONTEXT.md) — Overall monitoring architecture

View File

@ -47,12 +47,13 @@ home_network:
services: ["media", "transcoding"]
description: "Tdarr media transcoding"
vpn_docker:
hostname: "10.10.0.121"
port: 22
user: "cal"
services: ["vpn", "docker"]
description: "VPN and Docker services"
# DECOMMISSIONED: vpn_docker (10.10.0.121) - VM 105 destroyed 2026-04
# vpn_docker:
# hostname: "10.10.0.121"
# port: 22
# user: "cal"
# services: ["vpn", "docker"]
# description: "VPN and Docker services"
remote_servers:
akamai_nano:

View File

@ -23,7 +23,7 @@ servers:
pihole: 10.10.0.16 # Pi-hole DNS and ad blocking
sba_pd_bots: 10.10.0.88 # SBa and PD bot services
tdarr: 10.10.0.43 # Media transcoding
vpn_docker: 10.10.0.121 # VPN and Docker services
# vpn_docker: 10.10.0.121 # DECOMMISSIONED — VM 105 destroyed, migrated to arr-stack LXC 221
```
### Cloud Servers
@ -175,11 +175,12 @@ Host tdarr media
Port 22
IdentityFile ~/.ssh/homelab_rsa
Host docker-vpn
HostName 10.10.0.121
User cal
Port 22
IdentityFile ~/.ssh/homelab_rsa
# DECOMMISSIONED: docker-vpn (10.10.0.121) - VM 105 destroyed, migrated to arr-stack LXC 221
# Host docker-vpn
# HostName 10.10.0.121
# User cal
# Port 22
# IdentityFile ~/.ssh/homelab_rsa
# Remote Cloud Servers
Host akamai-nano akamai

View File

@ -158,6 +158,23 @@ ls -t ~/.local/share/claude-scheduled/logs/backlog-triage/ | head -1
~/.config/claude-scheduled/runner.sh backlog-triage
```
## Session Resumption
Tasks can opt into session persistence for multi-step workflows:
```json
{
"session_resumable": true,
"resume_last_session": true
}
```
When `session_resumable` is `true`, runner.sh saves the `session_id` to `$LOG_DIR/last_session_id` after each run. When `resume_last_session` is also `true`, the next run resumes that session with `--resume`.
Issue-poller and PR-reviewer capture `session_id` in logs and result JSON for manual follow-up.
See also: [Agent SDK Evaluation](agent-sdk-evaluation.md) for CLI vs SDK comparison.
## Cost Safety
- Per-task `max_budget_usd` cap — runner.sh detects `error_max_budget_usd` and warns

View File

@ -0,0 +1,175 @@
---
title: "Agent SDK Evaluation — CLI vs Python/TypeScript SDK"
description: "Comparison of Claude Code CLI invocation (claude -p) vs the native Agent SDK for programmatic use in the headless-claude and claude-scheduled systems."
type: context
domain: scheduled-tasks
tags: [claude-code, sdk, agent-sdk, python, typescript, headless, automation, evaluation]
---
# Agent SDK Evaluation: CLI vs Python/TypeScript SDK
**Date:** 2026-04-03
**Status:** Evaluation complete — recommendation below
**Related:** Issue #3 (headless-claude: Additional Agent SDK improvements)
## 1. Current Approach — CLI via `claude -p`
All headless Claude invocations use the CLI subprocess pattern:
```bash
claude -p "<prompt>" \
--model sonnet \
--output-format json \
--allowedTools "Read,Grep,Glob" \
--append-system-prompt "..." \
--max-budget-usd 2.00
```
**Pros:**
- Simple to invoke from any language (bash, n8n SSH nodes, systemd units)
- Uses Claude Max OAuth — no API key needed, no per-token billing
- Mature and battle-tested in our scheduled-tasks framework
- CLAUDE.md and settings.json are loaded automatically
- No runtime dependencies beyond the CLI binary
**Cons:**
- Structured output requires parsing JSON from stdout
- Error handling is exit-code-based with stderr parsing
- No mid-stream observability (streaming requires JSONL parsing)
- Tool approval is allowlist-only — no dynamic per-call decisions
- Session resumption requires manual `--resume` flag plumbing
## 2. Python Agent SDK
**Package:** `claude-agent-sdk` (renamed from `claude-code`)
**Install:** `pip install claude-agent-sdk`
**Requires:** Python 3.10+, `ANTHROPIC_API_KEY` env var
```python
from claude_agent_sdk import query, ClaudeAgentOptions
async for message in query(
prompt="Diagnose server health",
options=ClaudeAgentOptions(
allowed_tools=["Read", "Grep", "Bash(python3 *)"],
output_format={"type": "json_schema", "schema": {...}},
max_budget_usd=2.00,
),
):
if hasattr(message, "result"):
print(message.result)
```
**Key features:**
- Async generator with typed `SDKMessage` objects (User, Assistant, Result, System)
- `ClaudeSDKClient` for stateful multi-turn conversations
- `can_use_tool` callback for dynamic per-call tool approval
- In-process hooks (`PreToolUse`, `PostToolUse`, `Stop`, etc.)
- `rewindFiles()` to restore filesystem to any prior message point
- Typed exception hierarchy (`CLINotFoundError`, `ProcessError`, etc.)
**Limitation:** Shells out to the Claude Code CLI binary — it is NOT a pure HTTP client. The binary must be installed.
## 3. TypeScript Agent SDK
**Package:** `@anthropic-ai/claude-agent-sdk` (renamed from `@anthropic-ai/claude-code`)
**Install:** `npm install @anthropic-ai/claude-agent-sdk`
**Requires:** Node 18+, `ANTHROPIC_API_KEY` env var
```typescript
import { query } from "@anthropic-ai/claude-agent-sdk";
for await (const message of query({
prompt: "Diagnose server health",
options: {
allowedTools: ["Read", "Grep", "Bash(python3 *)"],
maxBudgetUsd: 2.00,
}
})) {
if ("result" in message) console.log(message.result);
}
```
**Key features (superset of Python):**
- Same async generator pattern
- `"auto"` permission mode (model classifier per tool call) — TS-only
- `spawnClaudeCodeProcess` hook for remote/containerized execution
- `setMcpServers()` for dynamic MCP server swapping mid-session
- V2 preview: `send()` / `stream()` patterns for simpler multi-turn
- Bundles the Claude Code binary — no separate install needed
## 4. Comparison Matrix
| Capability | `claude -p` CLI | Python SDK | TypeScript SDK |
|---|---|---|---|
| **Auth** | OAuth (Claude Max) | API key only | API key only |
| **Invocation** | Shell subprocess | Async generator | Async generator |
| **Structured output** | `--json-schema` flag | Schema in options | Schema in options |
| **Streaming** | JSONL parsing | Typed messages | Typed messages |
| **Tool approval** | `--allowedTools` only | `can_use_tool` callback | `canUseTool` callback + auto mode |
| **Session resume** | `--resume` flag | `resume: sessionId` | `resume: sessionId` |
| **Cost tracking** | Parse result JSON | `ResultMessage.total_cost_usd` | Same + per-model breakdown |
| **Error handling** | Exit codes + stderr | Typed exceptions | Typed exceptions |
| **Hooks** | External shell scripts | In-process callbacks | In-process callbacks |
| **Custom tools** | Not available | `tool()` decorator | `tool()` + Zod schemas |
| **Subagents** | Not programmatic | `agents` option | `agents` option |
| **File rewind** | Not available | `rewindFiles()` | `rewindFiles()` |
| **MCP servers** | `--mcp-config` file | Inline config object | Inline + dynamic swap |
| **CLAUDE.md loading** | Automatic | Must opt-in (`settingSources`) | Must opt-in |
| **Dependencies** | CLI binary | CLI binary + Python | Node 18+ (bundles CLI) |
## 5. Integration Paths
### A. n8n Code Nodes
The n8n Code node supports JavaScript (not TypeScript directly, but the SDK's JS output works). This would replace the current SSH → CLI pattern:
```
Schedule Trigger → Code Node (JS, uses SDK) → IF → Discord
```
**Trade-off:** Eliminates the SSH hop to CT 300, but requires `ANTHROPIC_API_KEY` and n8n to have the npm package installed. Current n8n runs in a Docker container on CT 210 — would need the SDK and CLI binary in the image.
### B. Standalone Python Scripts
Replace `claude -p` subprocess calls in custom dispatchers with the Python SDK:
```python
# Instead of: subprocess.run(["claude", "-p", prompt, ...])
async for msg in query(prompt=prompt, options=opts):
...
```
**Trade-off:** Richer error handling and streaming, but our dispatchers are bash scripts, not Python. Would require rewriting `runner.sh` and dispatchers in Python.
### C. Systemd-triggered Tasks (Current Architecture)
Keep systemd timers → bash scripts, but optionally invoke a thin Python wrapper that uses the SDK instead of `claude -p` directly.
**Trade-off:** Adds Python as a dependency for scheduled tasks that currently only need bash + the CLI binary. Marginal benefit unless we need hooks or dynamic tool approval.
## 6. Recommendation
**Stay with CLI invocation for now. Revisit the Python SDK when we need dynamic tool approval or in-process hooks.**
### Rationale
1. **Auth is the blocker.** The SDK requires `ANTHROPIC_API_KEY` (API billing). Our entire scheduled-tasks framework runs on Claude Max OAuth at zero marginal cost. Switching to the SDK means paying per-token for every scheduled task, issue-worker, and PR-reviewer invocation. This alone makes the SDK non-viable for our current architecture.
2. **The CLI covers our needs.** With `--append-system-prompt` (done), `--resume` (this PR), `--json-schema`, and `--allowedTools`, the CLI provides everything we currently need. Session resumption was the last missing piece.
3. **Bash scripts are the right abstraction.** Our runners are launched by systemd timers. Bash + CLI is the natural fit — no runtime dependencies, no async event loops, no package management.
### When to Revisit
- If Anthropic adds OAuth support to the SDK (eliminating the billing difference)
- If we need dynamic tool approval (e.g., "allow this Bash command but deny that one" at runtime)
- If we build a long-running Python service that orchestrates multiple Claude sessions (the `ClaudeSDKClient` stateful pattern would be valuable there)
- If we move to n8n custom nodes written in TypeScript (the TS SDK bundles the CLI binary)
### Migration Path (If Needed Later)
1. Start with the Python SDK in a single task (e.g., `backlog-triage`) as a proof of concept
2. Create a thin `sdk-runner.py` wrapper that reads the same `settings.json` and `prompt.md` files
3. Swap the systemd unit's `ExecStart` from `runner.sh` to `sdk-runner.py`
4. Expand to other tasks if the POC proves valuable

View File

@ -245,11 +245,25 @@ hosts:
- sqlite-major-domo
- temp-postgres
# Docker Home Servers VM (Proxmox) - decommission candidate
# VM 116: Only Jellyfin remains after 2026-04-03 cleanup (watchstate removed — duplicate of manticore's canonical instance)
# Jellyfin on manticore already covers this service. VM 116 + VM 110 are candidates to reclaim 8 vCPUs + 16 GB RAM.
# See issue #31 for cleanup details.
docker-home-servers:
type: docker
ip: 10.10.0.124
vmid: 116
user: cal
description: "Legacy home servers VM — Jellyfin only, decommission candidate"
config_paths:
docker-compose: /home/cal/container-data
services:
- jellyfin # only remaining service; duplicate of ubuntu-manticore jellyfin
decommission_candidate: true
notes: "watchstate removed 2026-04-03 (duplicate of manticore); 3.36 GB images pruned; see issue #31"
# Decommissioned hosts (kept for reference)
# decommissioned:
# tdarr-old:
# ip: 10.10.0.43
# note: "Replaced by ubuntu-manticore tdarr"
# docker-home:
# ip: 10.10.0.124
# note: "Decommissioned"

View File

@ -0,0 +1,246 @@
---
title: "Proxmox Monthly Maintenance Reboot"
description: "Runbook for the first-Sunday-of-the-month Proxmox host reboot — dependency-aware shutdown/startup order, validation checklist, and Ansible automation."
type: runbook
domain: server-configs
tags: [proxmox, maintenance, reboot, ansible, operations, systemd]
---
# Proxmox Monthly Maintenance Reboot
## Overview
| Detail | Value |
|--------|-------|
| **Schedule** | 1st Sunday of every month, 3:00 AM ET (08:00 UTC) |
| **Expected downtime** | ~15 minutes (host reboot + VM/LXC startup) |
| **Orchestration** | Ansible on LXC 304 — shutdown playbook → host reboot → post-reboot startup playbook |
| **Calendar** | Google Calendar recurring event: "Proxmox Monthly Maintenance Reboot" |
| **HA DNS** | ubuntu-manticore (10.10.0.226) provides Pi-hole 2 during Proxmox downtime |
## Why
- Kernel updates accumulate without reboot and never take effect
- Long uptimes allow memory leaks and process state drift (e.g., avahi busy-loops)
- Validates that all VMs/LXCs auto-start cleanly with `onboot: 1`
## Architecture
The reboot is split into two playbooks because LXC 304 (the Ansible controller) is itself a guest on the Proxmox host being rebooted:
1. **`monthly-reboot.yml`** — Snapshots all guests, shuts them down in dependency order, issues a fire-and-forget `reboot` to the Proxmox host, then exits. LXC 304 is killed when the host reboots.
2. **`post-reboot-startup.yml`** — After the host reboots, LXC 304 auto-starts via `onboot: 1`. A systemd service (`ansible-post-reboot.service`) waits 120 seconds for the Proxmox API to stabilize, then starts all guests in dependency order with staggered delays.
The `onboot: 1` flag on all production guests acts as a safety net — even if the post-reboot playbook fails, Proxmox will start everything (though without controlled ordering).
## Prerequisites (Before Maintenance)
- [ ] Verify no active Tdarr transcodes on ubuntu-manticore
- [ ] Verify no running database backups
- [ ] Ensure workstation has Pi-hole 2 (10.10.0.226) as a fallback DNS server so it fails over automatically during downtime
- [ ] Confirm ubuntu-manticore Pi-hole 2 is healthy: `ssh manticore "docker exec pihole pihole status"`
## `onboot` Audit
All production VMs and LXCs must have `onboot: 1` so they restart automatically as a safety net.
**Check VMs:**
```bash
ssh proxmox "for id in \$(qm list | awk 'NR>1{print \$1}'); do \
name=\$(qm config \$id | grep '^name:' | awk '{print \$2}'); \
onboot=\$(qm config \$id | grep '^onboot:'); \
echo \"VM \$id (\$name): \${onboot:-onboot NOT SET}\"; \
done"
```
**Check LXCs:**
```bash
ssh proxmox "for id in \$(pct list | awk 'NR>1{print \$1}'); do \
name=\$(pct config \$id | grep '^hostname:' | awk '{print \$2}'); \
onboot=\$(pct config \$id | grep '^onboot:'); \
echo \"LXC \$id (\$name): \${onboot:-onboot NOT SET}\"; \
done"
```
**Audit results (2026-04-03):**
| ID | Name | Type | `onboot` | Status |
|----|------|------|----------|--------|
| 106 | docker-home | VM | 1 | OK |
| 109 | homeassistant | VM | 1 | OK (fixed 2026-04-03) |
| 110 | discord-bots | VM | 1 | OK |
| 112 | databases-bots | VM | 1 | OK |
| 115 | docker-sba | VM | 1 | OK |
| 116 | docker-home-servers | VM | 1 | OK |
| 210 | docker-n8n-lxc | LXC | 1 | OK |
| 221 | arr-stack | LXC | 1 | OK (fixed 2026-04-03) |
| 222 | memos | LXC | 1 | OK |
| 223 | foundry-lxc | LXC | 1 | OK (fixed 2026-04-03) |
| 225 | gitea | LXC | 1 | OK |
| 227 | uptime-kuma | LXC | 1 | OK |
| 301 | claude-discord-coordinator | LXC | 1 | OK |
| 302 | claude-runner | LXC | 1 | OK |
| 303 | mcp-gateway | LXC | 0 | Intentional (on-demand) |
| 304 | ansible-controller | LXC | 1 | OK |
**If any production guest is missing `onboot: 1`:**
```bash
ssh proxmox "qm set <VMID> --onboot 1" # for VMs
ssh proxmox "pct set <CTID> --onboot 1" # for LXCs
```
## Shutdown Order (Dependency-Aware)
Reverse of the validated startup sequence. Stop consumers before their dependencies. Each tier polls per-guest status rather than using fixed waits.
```
Tier 4 — Media & Others (no downstream dependents)
VM 109 homeassistant
LXC 221 arr-stack
LXC 222 memos
LXC 223 foundry-lxc
LXC 302 claude-runner
Tier 3 — Applications (depend on databases + infra)
VM 115 docker-sba (Paper Dynasty, Major Domo)
VM 110 discord-bots
LXC 301 claude-discord-coordinator
Tier 2 — Infrastructure + DNS (depend on databases)
VM 106 docker-home (Pi-hole 1, NPM)
LXC 225 gitea
LXC 210 docker-n8n-lxc
LXC 227 uptime-kuma
VM 116 docker-home-servers
Tier 1 — Databases (no dependencies, shut down last)
VM 112 databases-bots (force-stop after 90s if ACPI ignored)
→ LXC 304 issues fire-and-forget reboot to Proxmox host, then is killed
```
**Known quirks:**
- VM 112 (databases-bots) may ignore ACPI shutdown — playbook force-stops after 90s
- VM 109 (homeassistant) is self-managed via HA Supervisor, excluded from Ansible inventory
- LXC 303 (mcp-gateway) has `onboot: 0` and is operator-managed — not included in shutdown/startup. If it was running before maintenance, bring it up manually afterward
## Startup Order (Staggered)
After the Proxmox host reboots, LXC 304 auto-starts and the `ansible-post-reboot.service` waits 120s before running the controlled startup:
```
Tier 1 — Databases first
VM 112 databases-bots
→ wait 30s for DB to accept connections
Tier 2 — Infrastructure + DNS
VM 106 docker-home (Pi-hole 1, NPM)
LXC 225 gitea
LXC 210 docker-n8n-lxc
LXC 227 uptime-kuma
VM 116 docker-home-servers
→ wait 30s
Tier 3 — Applications
VM 115 docker-sba
VM 110 discord-bots
LXC 301 claude-discord-coordinator
→ wait 30s
Pi-hole fix — restart container via SSH to clear UDP DNS bug
ssh docker-home "docker restart pihole"
→ wait 10s
Tier 4 — Media & Others
VM 109 homeassistant
LXC 221 arr-stack
LXC 222 memos
LXC 223 foundry-lxc
LXC 302 claude-runner
```
## Post-Reboot Validation
- [ ] Pi-hole 1 DNS resolving: `ssh docker-home "docker exec pihole dig google.com @127.0.0.1"`
- [ ] Gitea accessible: `curl -sf https://git.manticorum.com/api/v1/version`
- [ ] n8n healthy: `ssh docker-n8n-lxc "docker ps --filter name=n8n --format '{{.Status}}'"`
- [ ] Discord bots responding (check Discord)
- [ ] Uptime Kuma dashboard green: `curl -sf http://10.10.0.227:3001/api/status-page/homelab`
- [ ] Home Assistant running: `curl -sf http://10.10.0.109:8123/api/ -H 'Authorization: Bearer <token>'`
- [ ] Maintenance snapshots cleaned up (auto, 7-day retention)
## Automation
### Ansible Playbooks
Both located at `/opt/ansible/playbooks/` on LXC 304.
```bash
# Dry run — shutdown only
ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --check"
# Manual full execution — shutdown + reboot
ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml"
# Manual post-reboot startup (if automatic startup failed)
ssh ansible "ansible-playbook /opt/ansible/playbooks/post-reboot-startup.yml"
# Shutdown only — skip the host reboot
ssh ansible "ansible-playbook /opt/ansible/playbooks/monthly-reboot.yml --tags shutdown"
```
### Systemd Units (on LXC 304)
| Unit | Purpose | Schedule |
|------|---------|----------|
| `ansible-monthly-reboot.timer` | Triggers shutdown + reboot playbook | 1st Sunday of month, 08:00 UTC |
| `ansible-monthly-reboot.service` | Runs `monthly-reboot.yml` | Activated by timer |
| `ansible-post-reboot.service` | Runs `post-reboot-startup.yml` | On boot (multi-user.target), only if uptime < 10 min |
```bash
# Check timer status
ssh ansible "systemctl status ansible-monthly-reboot.timer"
# Next scheduled run
ssh ansible "systemctl list-timers ansible-monthly-reboot.timer"
# Check post-reboot service status
ssh ansible "systemctl status ansible-post-reboot.service"
# Disable for a month (e.g., during an incident)
ssh ansible "systemctl stop ansible-monthly-reboot.timer"
```
### Deployment (one-time setup on LXC 304)
```bash
# Copy playbooks
scp ansible/playbooks/monthly-reboot.yml ansible:/opt/ansible/playbooks/
scp ansible/playbooks/post-reboot-startup.yml ansible:/opt/ansible/playbooks/
# Copy and enable systemd units
scp ansible/systemd/ansible-monthly-reboot.timer ansible:/etc/systemd/system/
scp ansible/systemd/ansible-monthly-reboot.service ansible:/etc/systemd/system/
scp ansible/systemd/ansible-post-reboot.service ansible:/etc/systemd/system/
ssh ansible "sudo systemctl daemon-reload && \
sudo systemctl enable --now ansible-monthly-reboot.timer && \
sudo systemctl enable ansible-post-reboot.service"
# Verify SSH key access from LXC 304 to docker-home (needed for Pi-hole restart)
ssh ansible "ssh -o BatchMode=yes docker-home 'echo ok'"
```
## Rollback
If a guest fails to start after reboot:
1. Check Proxmox web UI or `pvesh get /nodes/proxmox/qemu/<VMID>/status/current`
2. Review guest logs: `ssh proxmox "journalctl -u pve-guests -n 50"`
3. Manual start: `ssh proxmox "pvesh create /nodes/proxmox/qemu/<VMID>/status/start"`
4. If guest is corrupted, restore from the pre-reboot Proxmox snapshot
5. If post-reboot startup failed entirely, run manually: `ssh ansible "ansible-playbook /opt/ansible/playbooks/post-reboot-startup.yml"`
## Related Documentation
- [Ansible Controller Setup](../../vm-management/ansible-controller-setup.md) — LXC 304 details and inventory
- [Proxmox 7→9 Upgrade Plan](../../vm-management/proxmox-upgrades/proxmox-7-to-9-upgrade-plan.md) — original startup order and Phase 1 lessons
- [VM Decommission Runbook](../../vm-management/vm-decommission-runbook.md) — removing VMs from the rotation

View File

@ -1,15 +0,0 @@
agent: 1
boot: order=scsi0;net0
cores: 8
memory: 16384
meta: creation-qemu=6.1.0,ctime=1646688596
name: docker-vpn
net0: virtio=76:36:85:A7:6A:A3,bridge=vmbr0,firewall=1
numa: 0
onboot: 1
ostype: l26
scsi0: local-lvm:vm-105-disk-0,size=256G
scsihw: virtio-scsi-pci
smbios1: uuid=55061264-b9b1-4ce4-8d44-9c187affcb1d
sockets: 1
vmgenid: 30878bdf-66f9-41bf-be34-c31b400340f9

View File

@ -12,5 +12,5 @@ ostype: l26
scsi0: local-lvm:vm-115-disk-0,size=256G
scsihw: virtio-scsi-pci
smbios1: uuid=19be98ee-f60d-473d-acd2-9164717fcd11
sockets: 2
sockets: 1
vmgenid: 682dfeab-8c63-4f0b-8ed2-8828c2f808ef

View File

@ -28,8 +28,8 @@ tags: [proxmox, upgrade, pve, backup, rollback, infrastructure]
**Production Services** (7 LXC + 7 VMs) — cleaned up 2026-02-19:
- **Critical**: Paper Dynasty/Major Domo (VM 115), Discord bots (VM 110), Gitea (LXC 225), n8n (LXC 210), Home Assistant (VM 109), Databases (VM 112), docker-home/Pi-hole 1 (VM 106)
- **Important**: Claude Discord Coordinator (LXC 301), arr-stack (LXC 221), Uptime Kuma (LXC 227), Foundry VTT (LXC 223), Memos (LXC 222)
- **Stopped/Investigate**: docker-vpn (VM 105, decommissioning), docker-home-servers (VM 116, needs investigation)
- **Removed (2026-02-19)**: 108 (ansible), 224 (openclaw), 300 (openclaw-migrated), 101/102/104/111/211 (game servers), 107 (plex), 113 (tdarr - moved to .226), 114 (duplicate arr-stack), 117 (unused), 100/103 (old templates)
- **Decommission Candidate**: docker-home-servers (VM 116) — Jellyfin-only after 2026-04-03 cleanup; watchstate removed (duplicate of manticore); see issue #31
- **Removed (2026-02-19)**: 108 (ansible), 224 (openclaw), 300 (openclaw-migrated), 101/102/104/111/211 (game servers), 107 (plex), 113 (tdarr - moved to .226), 114 (duplicate arr-stack), 117 (unused), 100/103 (old templates), 105 (docker-vpn - decommissioned 2026-04)
**Key Constraints**:
- Home Assistant VM 109 requires dual network (vmbr1 for Matter support)

View File

@ -262,7 +262,7 @@ When connecting Jellyseerr to arr apps, be careful with tag configurations - inv
- [x] Test movie/show requests through Jellyseerr
### After 48 Hours
- [ ] Decommission VM 121 (docker-vpn)
- [x] Decommission VM 121 (docker-vpn)
- [ ] Clean up local migration temp files (`/tmp/arr-config-migration/`)
---

View File

@ -0,0 +1,33 @@
---
title: "Workstation Troubleshooting"
description: "Troubleshooting notes for Nobara/KDE Wayland workstation issues."
type: troubleshooting
domain: workstation
tags: [troubleshooting, wayland, kde]
---
# Workstation Troubleshooting
## Discord screen sharing shows no windows on KDE Wayland (2026-04-03)
**Severity:** Medium — cannot share screen via Discord desktop app
**Problem:** Clicking "Share Your Screen" in Discord desktop app (v0.0.131, Electron 37) opens the Discord picker but shows zero windows/screens. Same behavior in both the desktop app and the web app when using Discord's own picker. Affects both native Wayland and XWayland modes.
**Root Cause:** Discord's built-in screen picker uses Electron's `desktopCapturer.getSources()` which relies on X11 window enumeration. On KDE Wayland:
- In native Wayland mode: no X11 windows exist, so the picker is empty
- In forced X11/XWayland mode (`ELECTRON_OZONE_PLATFORM_HINT=x11`): Discord can only see other XWayland windows (itself, Android emulator), not native Wayland apps
- Discord ignores `--use-fake-ui-for-media-stream` and other Chromium flags that should force portal usage
- The `discord-flags.conf` file is **not read** by the Nobara/RPM Discord package — flags must go in the `.desktop` file `Exec=` line
**Fix:** Use **Discord web app in Firefox** for screen sharing. Firefox natively delegates to the XDG Desktop Portal via PipeWire, which shows the KDE screen picker with all windows. The desktop app's own picker remains broken on Wayland as of v0.0.131.
Configuration applied (for general Discord Wayland support):
- `~/.local/share/applications/discord.desktop` — overrides system `.desktop` with Wayland flags
- `~/.config/discord-flags.conf` — created but not read by this Discord build
**Lesson:**
- Discord desktop on Linux Wayland cannot do screen sharing through its own picker — always use the web app in Firefox for this
- Electron's `desktopCapturer` API is fundamentally X11-only; the PipeWire/portal path requires the app to use `getDisplayMedia()` instead, which Discord's desktop app does not do
- `discord-flags.conf` is unreliable across distros — always verify flags landed in `/proc/<pid>/cmdline`
- Vesktop (community client) is an alternative that properly implements portal-based screen sharing, if the web app is insufficient