diff --git a/development/ace-step-local-network.md b/development/ace-step-local-network.md new file mode 100644 index 0000000..efc30a5 --- /dev/null +++ b/development/ace-step-local-network.md @@ -0,0 +1,95 @@ +--- +title: "ACE-Step 1.5 — Local Network Setup Guide" +description: "How to run ACE-Step AI music generator on the local network via Gradio UI or REST API, including .env configuration and startup notes." +type: guide +domain: development +tags: [ace-step, ai, music-generation, gradio, gpu, cuda] +--- + +# ACE-Step 1.5 — Local Network Setup + +ACE-Step is an open-source AI music generation model. This guide covers running it on the workstation and serving the Gradio web UI to the local network. + +## Location + +``` +/mnt/NV2/Development/ACE-Step-1.5/ +``` + +Cloned from GitHub. Uses `uv` for dependency management — the `.venv` is created automatically on first run. + +## Quick Start (Gradio UI) + +```bash +cd /mnt/NV2/Development/ACE-Step-1.5 +./start_gradio_ui.sh +``` + +Accessible from any device on the network at **http://10.10.0.41:7860** (or whatever the workstation IP is). + +## .env Configuration + +The `.env` file in the project root persists settings across git updates. Current config: + +```env +SERVER_NAME=0.0.0.0 +PORT=7860 +LANGUAGE=en +``` + +### Key Settings + +| Variable | Default | Description | +|----------|---------|-------------| +| `SERVER_NAME` | `127.0.0.1` | Set to `0.0.0.0` for LAN access | +| `PORT` | `7860` | Gradio UI port | +| `LANGUAGE` | `en` | UI language (`en`, `zh`, `he`, `ja`). **Must be set** — empty value causes `unbound variable` error with the launcher's `set -u` | +| `ACESTEP_CONFIG_PATH` | `acestep-v15-turbo` | DiT model variant | +| `ACESTEP_LM_MODEL_PATH` | `acestep-5Hz-lm-0.6B` | Language model for lyrics/prompts | +| `ACESTEP_INIT_LLM` | `auto` | `auto` / `true` / `false` — auto detects based on VRAM | +| `CHECK_UPDATE` | `true` | Set to `false` to skip interactive update prompt (useful for background/automated starts) | + +See `.env.example` for the full list. + +## REST API Server (Alternative) + +For programmatic access instead of the web UI: + +```bash +cd /mnt/NV2/Development/ACE-Step-1.5 +./start_api_server.sh +``` + +Default: `http://127.0.0.1:8001`. To serve on LAN, edit `start_api_server.sh` line 12: + +```bash +HOST="0.0.0.0" +``` + +API docs available at `http://:8001/docs`. + +## Hardware Profile (Workstation) + +- **GPU**: NVIDIA RTX 4080 SUPER (16 GB VRAM) +- **Tier**: 16GB class — auto-enables CPU offload, INT8 quantization, LLM +- **Max batch (with LM)**: 4 +- **Max batch (without LM)**: 8 +- **Max duration (with LM)**: 480s (8 min) +- **Max duration (without LM)**: 600s (10 min) + +## Startup Behavior + +1. Loads `.env` configuration +2. Checks for git updates (interactive prompt — set `CHECK_UPDATE=false` to skip) +3. Creates `.venv` via `uv sync` if missing (slow on first run) +4. Runs legacy NVIDIA torch compatibility check +5. Loads DiT model → quantizes to INT8 → loads LM → allocates KV cache +6. Launches Gradio with queue for multi-user support + +Full startup takes ~30-40 seconds after first run. + +## Gotchas + +- **LANGUAGE must be set in `.env`**: The system `$LANGUAGE` locale variable can be empty, causing the launcher to crash with `unbound variable` due to `set -u`. Always include `LANGUAGE=en` in `.env`. +- **Update prompt blocks background execution**: If running headlessly or from a script, set `CHECK_UPDATE=false` to avoid the interactive Y/N prompt. +- **Model downloads**: First run downloads ~4-5 GB of model weights from HuggingFace. Subsequent runs use cached checkpoints in `./checkpoints/`. diff --git a/development/subagent-write-permission-blocked.md b/development/subagent-write-permission-blocked.md new file mode 100644 index 0000000..63e594e --- /dev/null +++ b/development/subagent-write-permission-blocked.md @@ -0,0 +1,80 @@ +--- +title: "Fix: Subagent Write/Edit tools blocked by permission mode mismatch" +description: "Claude Code subagents cannot use Write or Edit tools unless spawned with mode: acceptEdits — other permission modes (dontAsk, auto, bypassPermissions) do not grant file-write capability." +type: troubleshooting +domain: development +tags: [troubleshooting, claude-code, permissions, agents, subagents] +--- + +# Fix: Subagent Write/Edit tools blocked by permission mode mismatch + +**Date:** 2026-03-28 +**Severity:** Medium — blocks all agent-driven code generation workflows until identified + +## Problem + +When orchestrating multi-agent code generation (spawning engineer agents to write code in parallel), all subagents could Read/Glob/Grep files but Write and Edit tool calls were silently denied. Agents would complete their analysis, prepare the full file content, then report "blocked on Write/Edit permission." + +This happened across **every** permission mode tried: +- `mode: bypassPermissions` — denied (with worktree isolation) +- `mode: auto` — denied (with and without worktree isolation) +- `mode: dontAsk` — denied (with and without worktree isolation) + +## Root Cause + +Claude Code's Agent tool has multiple permission modes that control different things: + +| Mode | What it controls | Grants Write/Edit? | +|------|-----------------|-------------------| +| `default` | User prompted for each tool call | No — user must approve each | +| `dontAsk` | Suppresses user prompts | **No** — suppresses prompts but doesn't grant capability | +| `auto` | Auto-approves based on context | **No** — same issue | +| `bypassPermissions` | Skips permission-manager hooks | **No** — only bypasses plugin hooks, not tool-level gates | +| `acceptEdits` | Grants file modification capability | **Yes** — this is the correct mode | + +The key distinction: `dontAsk`/`auto`/`bypassPermissions` control the **user-facing permission prompt** (whether the user gets asked to approve). But Write/Edit tools have an **internal capability gate** that checks whether the agent was explicitly authorized to modify files. Only `acceptEdits` provides that authorization. + +## Additional Complication: permission-manager plugin + +The `permission-manager@agent-toolkit` plugin (`cmd-gate` PreToolUse hook) adds a second layer that blocks Bash-based file writes (output redirection `>`, `tee`, etc.). When agents fell back to Bash after Write/Edit failed, the plugin caught those too. + +- `bypassPermissions` mode is documented to skip cmd-gate entirely, but this didn't work reliably in worktree isolation +- Disabling the plugin (`/plugin` → toggle off `permission-manager@agent-toolkit`, then `/reload-plugins`) removed the Bash-level blocks but did NOT fix Write/Edit + +## Fix + +**Use `mode: acceptEdits`** when spawning any agent that needs to create or modify files: + +``` +Agent( + subagent_type="engineer", + mode="acceptEdits", # <-- This is the critical setting + prompt="..." +) +``` + +**Additional recommendations:** +- Worktree isolation (`isolation: "worktree"`) may compound permission issues — avoid it unless the agents genuinely need isolation (e.g., conflicting file edits) +- For agents that only read (reviewers, validators), any mode works +- If the permission-manager plugin is also blocking Bash fallbacks, disable it temporarily or add classifiers for the specific commands needed + +## Reproduction + +1. Spawn an engineer agent with `mode: dontAsk` and a prompt to create a new file +2. Agent will Read reference files successfully, prepare content, then report Write tool denied +3. Change to `mode: acceptEdits` — same prompt succeeds immediately + +## Environment + +- Claude Code CLI on Linux (Nobara/Fedora) +- Plugins: permission-manager@agent-toolkit (St0nefish/agent-toolkit) +- Agent types tested: engineer, general-purpose +- Models tested: sonnet subagents + +## Lessons + +- **Always use `acceptEdits` for code-writing agents.** The mode name is the clue — it's not just "accepting" edits from the user, it's granting the agent the capability to make edits. +- **`dontAsk` ≠ "can do anything."** It means "don't prompt the user" — but the capability to write files is a separate authorization layer. +- **Test agent permissions early.** When building a multi-agent orchestration workflow, verify the first agent can actually write before launching a full wave. A quick single-file test agent saves time. +- **Worktree isolation adds complexity.** Only use it when agents would genuinely conflict on the same files. For non-overlapping file changes, skip isolation. +- **The permission-manager plugin is a separate concern.** It blocks Bash file-write commands (>, tee, cat heredoc). Disabling it fixes Bash fallbacks but not Write/Edit tool calls. Both layers must be addressed independently. diff --git a/gaming/release-2026.4.02.md b/gaming/release-2026.4.02.md new file mode 100644 index 0000000..b3299f9 --- /dev/null +++ b/gaming/release-2026.4.02.md @@ -0,0 +1,50 @@ +--- +title: "MLB The Show Grind — 2026.4.02" +description: "Pack opening command, full cycle orchestrator, keyboard dismiss fix, package split." +type: reference +domain: gaming +tags: [release-notes, deployment, mlb-the-show, automation] +--- + +# MLB The Show Grind — 2026.4.02 + +**Date:** 2026-04-02 +**Project:** mlb-the-show (`/mnt/NV2/Development/mlb-the-show`) + +## Release Summary + +Added pack opening automation and a full buy→exchange→open cycle command. Fixed a critical bug where KEYCODE_BACK was closing the buy order modal instead of dismissing the keyboard, preventing all order placement. Split the 1600-line single-file script into a proper Python package. + +## Changes + +### New Features +- **`open-packs` command** — navigates to My Packs, finds the target pack by name (default: Exchange - Live Series Gold), rapid-taps Open Next at ~0.3s/pack with periodic verification +- **`cycle` command** — full orchestrated flow: buy silvers for specified OVR tiers → exchange all dupes into gold packs → open all gold packs +- **`DEFAULT_PACK_NAME` constant** — `"Exchange - Live Series Gold"` extracted from inline strings + +### Bug Fixes +- **Keyboard dismiss fix** — `KEYCODE_BACK` was closing the entire buy order modal instead of just dismissing the numeric keyboard. Replaced with `tap(540, 900)` to tap a neutral area. This was the root cause of all buy orders silently failing (0 orders placed despite cards having room). +- **`full_cycle` passed no args to `open_packs()`** — now passes `packs_exchanged` count to bound the open loop +- **`isinstance(result, dict)` dead code** removed from `full_cycle` — `grind_exchange` always returns `int` +- **`_find_nearest_open_button`** — added x-column constraint (200px) and zero-width element filtering to prevent matching ghost buttons from collapsed packs + +### Refactoring +- **Package split** — `scripts/grind.py` (1611 lines) → `scripts/grind/` package: + - `constants.py` (104 lines) — coordinates, price gates, UI element maps + - `adb_utils.py` (125 lines) — ADB shell, tap, swipe, dump_ui, element finders + - `navigation.py` (107 lines) — screen navigation (nav_to, nav_tab, FAB) + - `exchange.py` (283 lines) — gold exchange logic + - `market.py` (469 lines) — market scanning and buy order placement + - `packs.py` (131 lines) — pack opening + - `__main__.py` (390 lines) — CLI entry point and orchestrators (grind_loop, full_cycle) +- `scripts/grind.py` retained as a thin wrapper for `uv run` backward compatibility +- Invocation changed from `uv run scripts/grind.py` to `PYTHONPATH=scripts python3 -m grind` +- Raw `adb("input swipe ...")` calls replaced with `swipe()` helper + +## Session Stats + +- **Buy orders placed:** 532 orders across two runs (474 + 58) +- **Stubs spent:** ~63,655 +- **Gold packs exchanged:** 155 (94 + 61) +- **Gold packs opened:** 275 +- **OVR tiers worked:** 77 (primary), 78 (all above max price) diff --git a/gaming/troubleshooting.md b/gaming/troubleshooting.md index d1add29..86459df 100644 --- a/gaming/troubleshooting.md +++ b/gaming/troubleshooting.md @@ -214,6 +214,58 @@ For full HDR setup (vk-hdr-layer, KDE config, per-API env vars), see the **steam **Diagnostic tip**: Look for rapid retry patterns in Pi-hole logs (same domain queried every 1-3s from the Xbox IP) — this signals a blocked domain causing timeout loops. +## Gray Zone Warfare — EAC Failures on Proton (2026-03-31) [RESOLVED] + +**Severity:** High — game unplayable online +**Status:** RESOLVED — corrupted prebuild world cache file + +**Problem:** EAC errors when connecting to servers on Linux/Proton. Three error codes observed across attempts: +- `0x0002000A` — "The client failed an anti-cheat client runtime check" (the actual root cause) +- `0x0002000F` — "The client failed to register in time" (downstream timeout) +- `0x00020011` — "The client failed to start the session" (downstream session failure) + +Game launches fine, EAC bootstrapper reports success, but fails when joining a server at "Synchronizing Live Data". + +**Root Cause:** A corrupted/stale prebuild world cache file that EAC flagged during runtime checks: +``` +LogEOSAntiCheat: [AntiCheatClient] [PollStatusInternal] Client Violation with Type: 5 +Message: Unknown file version (GZW/Content/SKALLA/PrebuildWorldData/World/cache/0xb9af63cee2e43b6c_0x3cb3b3354fb31606.dat) +``` +EAC scanned this file, found an unrecognized version, and flagged a client violation. The other errors (`0x0002000F`, `0x00020011`) were downstream consequences — EAC couldn't complete session registration after the violation. + +Compounding factors that made diagnosis harder: +- Epic EOS scheduled maintenance (Fortnite v40.10, Apr 1 08:00-09:30 UTC) returned 503s from `api.epicgames.dev/auth/v1/oauth/token`, masking the real issue +- `steam_api64.dll` EOS SDK errors at startup are **benign noise** under Proton — red herring +- Nuking the compatdata prefix and upgrading Proton happened concurrently, adding confusion + +**Fix:** +1. Delete the specific cache file: `rm "GZW/Content/SKALLA/PrebuildWorldData/World/cache/0xb9af63cee2e43b6c_0x3cb3b3354fb31606.dat"` +2. Verify game files in Steam — Steam redownloads a fresh copy with different hash +3. Launch game — clean logs, no EAC errors + +Key detail: the file was the same size (60.7MB) before and after, but different md5 hash — Steam's verify replaced it with a corrected version. + +**Log locations:** +- EAC bootstrapper: `compatdata/2479810/pfx/drive_c/users/steamuser/AppData/Roaming/EasyAntiCheat/.../anticheatlauncher.log` +- Game log: `compatdata/2479810/pfx/drive_c/users/steamuser/AppData/Local/GZW/Saved/Logs/GZW.log` +- STL launch log: `~/.config/steamtinkerlaunch/logs/gamelaunch/id/2479810.log` + +**What did NOT fix it (for reference):** +1. Installing Proton EasyAntiCheat Runtime (AppID 1826330) — good to have but not the issue +2. Deleting the entire cache directory without re-verifying — Steam verify re-downloaded the same bad file the first time (20 files fixed); needed a second targeted delete + verify +3. Nuking compatdata prefix for clean rebuild +4. Switching Proton versions (GE-Proton9-25 ↔ GE-Proton10-25) + +**Lessons:** +- When EAC logs show "Unknown file version" for a specific `.dat` file, delete that file and verify — don't nuke the whole cache or prefix +- `steam_api64.dll` EOS errors are benign under Proton and not related to EAC failures +- Check Epic's status page for scheduled maintenance before deep-diving Proton issues +- Multiple verify-and-fix cycles may be needed — the first verify can redownload a stale cached version from Steam's CDN + +**Game version:** 0.4.0.0-231948-H (EA Pre-Alpha) +**Working Proton:** GE-Proton10-25 +**STL config:** `~/.config/steamtinkerlaunch/gamecfgs/id/2479810.conf` + ## Useful Commands ### Check Running Game Process diff --git a/major-domo/database-release-2026.4.1.md b/major-domo/database-release-2026.4.1.md new file mode 100644 index 0000000..1d2db66 --- /dev/null +++ b/major-domo/database-release-2026.4.1.md @@ -0,0 +1,34 @@ +--- +title: "Database API Release — 2026.4.1" +description: "Query limit caps to prevent worker timeouts, plus hotfix to exempt /players endpoint." +type: reference +domain: major-domo +tags: [release-notes, deployment, database, hotfix] +--- + +# Database API Release — 2026.4.1 + +**Date:** 2026-04-01 +**Tag:** `2026.3.7` + 3 post-tag commits (CI auto-generates CalVer on merge) +**Image:** `manticorum67/major-domo-database` +**Server:** akamai (`~/container-data/sba-database`) +**Deploy method:** `docker compose pull && docker compose down && docker compose up -d` + +## Release Summary + +Added bounded pagination (`MAX_LIMIT=500`, `DEFAULT_LIMIT=200`) to all list endpoints to prevent Gunicorn worker timeouts caused by unbounded queries. Two follow-up fixes corrected response `count` fields in fieldingstats that were computed after the limit was applied. A hotfix (PR #103) then removed the caps from the `/players` endpoint specifically, since the bot and website depend on fetching full player lists. + +## Changes + +### Bug Fixes +- **PR #99** — Fix unbounded API queries causing Gunicorn worker timeouts. Added `MAX_LIMIT=500` and `DEFAULT_LIMIT=200` constants in `dependencies.py`, enforced `le=MAX_LIMIT` on all list endpoints. Added middleware to strip empty query params preventing validation bypass. +- **PR #100** — Fix fieldingstats `get_fieldingstats` count: captured `total_count` before `.limit()` so the response reflects total rows, not page size. +- **PR #101** — Fix fieldingstats `get_totalstats`: removed line that overwrote `count` with `len(page)` after it was correctly set from `total_count`. + +### Hotfix +- **PR #103** — Remove output caps from `GET /api/v3/players`. Reverted `limit` param to `Optional[int] = Query(default=None, ge=1)` (no ceiling). The `/players` table is a bounded dataset (~1500 rows/season) and consumers depend on uncapped results. All other endpoints retain their caps. + +## Deployment Notes +- No migrations required +- No config changes +- Rollback: `docker compose pull manticorum67/major-domo-database: && docker compose down && docker compose up -d` diff --git a/major-domo/release-2026.3.31-2.md b/major-domo/release-2026.3.31-2.md new file mode 100644 index 0000000..54d38f6 --- /dev/null +++ b/major-domo/release-2026.3.31-2.md @@ -0,0 +1,38 @@ +--- +title: "Discord Bot Release — 2026.3.13" +description: "Enforce free agency lock deadline — block /dropadd FA pickups after week 14, plus performance batch from backlog issues." +type: reference +domain: major-domo +tags: [release-notes, deployment, discord, major-domo] +--- + +# Discord Bot Release — 2026.3.13 + +**Date:** 2026-03-31 +**Tag:** `2026.3.13` +**Image:** `manticorum67/major-domo-discordapp:2026.3.13` / `:production` +**Server:** akamai (`~/container-data/major-domo`) +**Deploy method:** `.scripts/deploy.sh -y` (docker compose pull + up) + +## Release Summary + +Enforces the previously unused `fa_lock_week` config (week 14) in the transaction builder. After the deadline, `/dropadd` blocks adding players FROM Free Agency while still allowing drops TO FA. Also includes a batch of performance PRs from the backlog that were merged between 2026.3.12 and this tag. + +## Changes + +### New Features +- **Free agency lock enforcement** — `TransactionBuilder.add_move()` now checks `current_week >= fa_lock_week` and rejects FA pickups after the deadline. Dropping to FA remains allowed. Config already existed at `fa_lock_week = 14` but was never enforced. (PR #122) + +### Performance +- Eliminate redundant API calls in trade views (PR #116, issue #94) +- Eliminate redundant GET after create/update and parallelize stats (PR #112, issue #95) +- Parallelize N+1 player/creator lookups with `asyncio.gather()` (PR #118, issue #89) +- Consolidate duplicate `league_service.get_current_state()` calls in `add_move()` into a single shared fetch (PR #122) + +### Bug Fixes +- Fix race condition: use per-user dict for `_checked_teams` in trade views (PR #116) + +## Deployment Notes +- No migrations required +- No config changes needed — `fa_lock_week = 14` already existed in config +- Rollback: `ssh akamai "cd ~/container-data/major-domo && docker pull manticorum67/major-domo-discordapp@sha256:94d59135f127d5863b142136aeeec9d63b06ee63e214ef59f803cedbd92b473e && docker tag manticorum67/major-domo-discordapp@sha256:94d59135f127d5863b142136aeeec9d63b06ee63e214ef59f803cedbd92b473e manticorum67/major-domo-discordapp:production && docker compose up -d discord-app"` diff --git a/major-domo/release-2026.3.31.md b/major-domo/release-2026.3.31.md new file mode 100644 index 0000000..4238b6e --- /dev/null +++ b/major-domo/release-2026.3.31.md @@ -0,0 +1,86 @@ +--- +title: "Discord Bot Release — 2026.3.12" +description: "Major catch-up release: trade deadline enforcement, performance parallelization, security fixes, CI/CD migration to CalVer, and 148 commits of accumulated improvements." +type: reference +domain: major-domo +tags: [release-notes, deployment, discord, major-domo] +--- + +# Discord Bot Release — 2026.3.12 + +**Date:** 2026-03-31 +**Tag:** `2026.3.12` +**Image:** `manticorum67/major-domo-discordapp:2026.3.12` / `:production` +**Server:** akamai (`~/container-data/major-domo`) +**Deploy method:** `.scripts/deploy.sh -y` (docker compose pull + up) +**Previous tag:** `v2.29.4` (148 commits behind) + +## Release Summary + +Large catch-up release covering months of accumulated work since the last tag. The headline feature is trade deadline enforcement — `/trade` commands are now blocked after the configured deadline week, with fail-closed behavior when API data is unavailable. Also includes significant performance improvements (parallelized API calls, cached signatures, Redis SCAN), security hardening, dependency pinning, and a full CI/CD migration from version-file bumps to CalVer tag-triggered builds. + +## Changes + +### New Features +- **Trade deadline enforcement** — `is_past_trade_deadline` property on Current model; guards on `/trade initiate`, submit button, and `_finalize_trade`. Fail-closed when API returns no data. 4 new tests. (PR #121) +- `is_admin()` helper in `utils/permissions.py` (#55) +- Team ownership verification on `/injury set-new` and `/injury clear` (#18) +- Current week number included in weekly-info channel posts +- Local deploy script for production deploys + +### Performance +- Parallelize independent API calls with `asyncio.gather()` (#90) +- Cache `inspect.signature()` at decoration time (#97) +- Replace `json.dumps` serialization test with `isinstance` fast path (#96) +- Use `channel.purge()` instead of per-message delete loops (#93) +- Parallelize schedule_service week fetches (#88) +- Replace Redis `KEYS` with `SCAN` in `clear_prefix` (#98) +- Reuse persistent `aiohttp.ClientSession` in GiphyService (#26) +- Cache user team lookup in player_autocomplete, reduce limit to 25 + +### Bug Fixes +- Fix chart_service path from `data/` to `storage/` +- Make ScorecardTracker methods async to match await callers +- Prevent partial DB writes and show detailed errors on scorecard submission failure +- Add trailing slashes to API URLs to prevent 307 redirects dropping POST bodies +- Trade validation: check against next week's projected roster, include pending trades and org affiliate transactions +- Prefix trade validation errors with team abbreviation +- Auto-detect player roster type in trade commands instead of assuming ML +- Fix key plays score text ("tied at X" instead of "Team up X-X") (#48) +- Fix scorebug stale data, win probability parsing, and read-failure tolerance (#39, #40) +- Batch quick-wins: 4 issues resolved (#37, #27, #25, #38) +- Fix ContextualLogger crash when callers pass `exc_info=True` +- Fix thaw report posting to use channel ID instead of channel names +- Use explicit America/Chicago timezone for freeze/thaw scheduling +- Replace broken `@self.tree.interaction_check` with MaintenanceAwareTree subclass +- Implement actual maintenance mode flag in `/admin-maintenance` (#28) +- Validate and sanitize pitching decision data from Google Sheets +- Fix `/player` autocomplete timeout by using current season only +- Split read-only data volume to allow state file writes (#85) +- Update roster labels to use Minor League and Injured List (#59) + +### Security +- Address 7 security issues across the codebase +- Remove 226 unused imports (#33) +- Pin all Python dependency versions in `requirements.txt` (#76) + +### Refactoring & Cleanup +- Extract duplicate command hash logic into `_compute_command_hash` (#31) +- Move 42 unnecessary lazy imports to top-level +- Remove dead maintenance mode artifacts in bot.py (#104) +- Remove unused `weeks_ahead` parameter from `get_upcoming_games` +- Invalidate roster cache after submission instead of force-refreshing + +## Infrastructure Changes +- **CI/CD migration**: Switched from version-file bumps to CalVer tag-triggered Docker builds +- Added `.scripts/release.sh` for creating CalVer tags +- Updated `.scripts/deploy.sh` for tag-triggered releases +- Docker build cache switched from `type=gha` to `type=registry` +- Used `docker-tags` composite action for multi-channel release support +- Fixed act_runner auth with short-form local actions + full GitHub URLs +- Use Gitea API for tag creation to avoid branch protection failures + +## Deployment Notes +- No migrations required +- No config changes needed +- Rollback: `ssh akamai "cd ~/container-data/major-domo && docker pull manticorum67/major-domo-discordapp@ && docker tag manticorum67/major-domo-discordapp:production && docker compose up -d discord-app"` diff --git a/major-domo/troubleshooting-gunicorn-worker-timeouts.md b/major-domo/troubleshooting-gunicorn-worker-timeouts.md new file mode 100644 index 0000000..b474b6a --- /dev/null +++ b/major-domo/troubleshooting-gunicorn-worker-timeouts.md @@ -0,0 +1,59 @@ +--- +title: "Fix: Gunicorn Worker Timeouts from Unbounded API Queries" +description: "External clients sent limit=99999 and empty filter params through the reverse proxy, causing API workers to timeout and get killed." +type: troubleshooting +domain: major-domo +tags: [troubleshooting, major-domo, database, deployment, docker] +--- + +# Fix: Gunicorn Worker Timeouts from Unbounded API Queries + +**Date:** 2026-04-01 +**PR:** cal/major-domo-database#99 +**Issues:** #98 (main), #100 (fieldingstats count bug), #101 (totalstats count overwrite, pre-existing) +**Severity:** Critical — active production instability during Season 12, 12 worker timeouts in 2 days and accelerating + +## Problem + +The monitoring app kept flagging the SBA API container (`sba_db_api`) as unhealthy and restarting it. Container logs showed repeated `CRITICAL WORKER TIMEOUT` and `WARNING Worker was sent SIGABRT` messages from Gunicorn. The container itself wasn't restarting (0 Docker restarts, up 2 weeks), but individual workers were being killed and respawned, causing brief API unavailability windows. + +## Root Cause + +External clients (via nginx-proxy-manager at `172.25.0.3`) were sending requests with `limit=99999` and empty filter parameters (e.g., `?game_id=&pitcher_id=`). The API had no defenses: + +- **No max limit cap** on any endpoint except `/players/search` (which had `le=50`). Clients could request 99,999 rows. +- **Empty string params passed validation** — FastAPI parsed `game_id=` as `['']`, which passed `if param is not None` checks but generated wasteful full-table-scan queries. +- **`/transactions` had no limit parameter at all** — always returned every matching row with recursive serialization (`model_to_dict(recurse=True)`). +- **Recursive serialization amplified cost** — each row triggered additional DB lookups for FK relations (player, team, etc.). + +Combined, these caused queries to exceed the 120-second Gunicorn timeout, killing the worker. + +### IP Attribution Gotcha + +Initial assumption was the Discord bot was the source (IP `172.25.0.3` was assumed to be the bot container). Docker IP mapping revealed `172.25.0.3` was actually **nginx-proxy-manager** — the queries came from external clients through the reverse proxy. The Discord bot is at `172.18.0.2` on a completely separate Docker network and generates none of these queries. + +```bash +# Command to map container IPs +docker inspect --format='{{.Name}} {{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' $(docker ps -q) +``` + +## Fix + +PR #99 merged into main with the following changes (27 files, 503 insertions): + +1. **`MAX_LIMIT=500` and `DEFAULT_LIMIT=200` constants** in `app/dependencies.py`, enforced with `le=MAX_LIMIT` across all list endpoints +2. **`strip_empty_query_params` middleware** in `app/main.py` — strips empty string values from query params before FastAPI parses them, so `?game_id=` is treated as absent +3. **`limit`/`offset` added to `/transactions`** — previously returned all rows; now defaults to 200, max 500, with `total_count` computed before pagination +4. **11 existing limit params capped** with `le=MAX_LIMIT` +5. **13 endpoints with no limit** received `limit`/`offset` params +6. **Manual `if limit < 1` guards removed** — now handled by FastAPI's `ge=1` validation +7. **5 unit tests** covering limit validation (422 on exceeding max, zero, negative), transaction response shape, and empty string stripping +8. **fieldingstats count bug fixed** — `.count()` was being called after `.limit()`, capping the reported count at the page size instead of total matching rows (#100) + +## Lessons + +- **Always verify container IP attribution** before investigating the wrong service. `docker inspect` with format string is the canonical way to map IPs to container names. Don't assume based on Docker network proximity. +- **APIs should never trust client-provided limits** — enforce `le=MAX_LIMIT` on every list endpoint. The only safe endpoint was `/players/search` which had been properly capped at `le=50`. +- **Empty string params are a silent danger** — FastAPI parses `?param=` as `['']`, not `None`. A global middleware is the right fix since it protects all endpoints including future ones. +- **Recursive serialization (`model_to_dict(recurse=True)`) is O(n * related_objects)** — dangerous on unbounded queries. Consider forcing `short_output=True` for large result sets. +- **Heavy reformatting mixed with functional changes obscures bugs** — the fieldingstats count bug was missed in review because the file had 262 lines of diff from quote/formatting changes. Separate cosmetic and functional changes into different commits. diff --git a/media-servers/troubleshooting.md b/media-servers/troubleshooting.md index bb89425..dedfc94 100644 --- a/media-servers/troubleshooting.md +++ b/media-servers/troubleshooting.md @@ -548,6 +548,20 @@ tar -czf ~/jellyfin-config-backup-$(date +%Y%m%d).tar.gz ~/docker/jellyfin/confi - Test on non-production instance if possible - Document current working configuration +## PGS Subtitle Default Flags Causing Roku Playback Hang (2026-04-01) + +**Severity:** Medium — affects all Roku/Apple TV clients attempting to play remuxes with PGS subtitles + +**Problem:** Playback on Roku hangs at "Loading" and stops at 0 ms. Jellyfin logs show ffmpeg extracting all subtitle streams (including PGS) from the full-length movie before playback can begin. User Staci reported Jurassic Park (1993) taking forever to start on the living room Roku. + +**Root Cause:** PGS (hdmv_pgs_subtitle) tracks flagged as `default` in MKV files cause the Roku client to auto-select them. Roku can't decode PGS natively, so Jellyfin must burn them in — triggering a full subtitle extraction pass and video transcode before any data reaches the client. 178 out of ~400 movies in the library had this flag set, mostly remuxes that predate the Tdarr `clrSubDef` flow plugin. + +**Fix:** +1. **Batch fix (existing library):** Wrote `fix-pgs-defaults.sh` — scans all MKVs with `mkvmerge -J`, finds PGS tracks with `default_track: true`, clears via `mkvpropedit --edit track:N --set flag-default=0`. Key gotcha: mkvpropedit uses 1-indexed track numbers (`track_id + 1`), NOT `track:=ID` (which matches by UID). Script is on manticore at `/tmp/fix-pgs-defaults.sh`. Fixed 178 files, no re-encoding needed. +2. **Going forward (Tdarr):** The flow already has a "Clear Subtitle Default Flags" custom function plugin (`clrSubDef`) that clears default disposition on non-forced subtitle tracks during transcoding. New files processed by Tdarr are handled automatically. + +**Lesson:** Remux files from automated downloaders almost always have PGS defaults set. Any bulk import of remuxes should be followed by a PGS default flag sweep. The CIFS media mount on manticore is read-only inside the Jellyfin container — mkvpropedit must run from the host against `/mnt/truenas/media/Movies`. + ## Related Documentation - **Setup Guide**: `/media-servers/jellyfin-ubuntu-manticore.md` - **NVIDIA Driver Management**: See jellyfin-ubuntu-manticore.md diff --git a/mlb-the-show/release-2026.3.28.md b/mlb-the-show/release-2026.3.28.md new file mode 100644 index 0000000..206b410 --- /dev/null +++ b/mlb-the-show/release-2026.3.28.md @@ -0,0 +1,37 @@ +--- +title: "MLB The Show Market Tracker — 0.1.0" +description: "Initial release of the CLI market scanner with flip scanning and exchange program support." +type: reference +domain: gaming +tags: [release-notes, deployment, mlb-the-show, rust] +--- + +# MLB The Show Market Tracker — 0.1.0 + +**Date:** 2026-03-28 +**Version:** `0.1.0` +**Repo:** `cal/mlb-the-show-market-tracker` on Gitea +**Deploy method:** Local CLI tool — `cargo build --release` on workstation + +## Release Summary + +Initial release of `showflip`, a Rust CLI tool for scanning the MLB The Show 26 Community Market. Supports finding profitable card flips and identifying silver cards at target buy-order prices for the gold pack exchange program. + +## Changes + +### New Features + +- **`scan` command** — Concurrent market scanner that finds profitable flip opportunities. Supports filters for rarity, team, position, budget, and sorting by profit/margin. Includes watch mode for repeated scans and optional Discord webhook alerts. +- **`exchange` command** — Scans for silver cards (OVR 77-79) priced within configurable buy-order gates for the gold pack exchange program. Tiers: 79 OVR (target 170/max 175), 78 OVR (target 140/max 145), 77 OVR (target 117/max 122). Groups results by OVR with color-coded target/OK status. +- **`detail` command** — Shows price history and recent sales for a specific card by name or UUID. +- **`meta` command** — Lists available series, brands, and sets for use as filter values. +- OVR-based price floor calculation for live and non-live series cards +- 10% Community Market tax built into all profit calculations +- Handles API price format inconsistencies (integers vs comma-formatted strings) +- HTTP client with 429 retry handling + +## Deployment Notes + +- No server deployment — runs locally via `cargo run -- ` +- API is public at `https://mlb26.theshow.com/apis/` — no auth required +- No tests or CI configured yet diff --git a/mlb-the-show/release-2026.3.31.md b/mlb-the-show/release-2026.3.31.md new file mode 100644 index 0000000..6b604cd --- /dev/null +++ b/mlb-the-show/release-2026.3.31.md @@ -0,0 +1,45 @@ +--- +title: "MLB The Show Companion Automation — 2026.3.31" +description: "Fix gold exchange navigation, add grind harness for automated buy→exchange loops, CLI cleanup." +type: reference +domain: gaming +tags: [release-notes, deployment, mlb-the-show, python, automation] +--- + +# MLB The Show Companion Automation — 2026.3.31 + +**Date:** 2026-03-31 +**Repo:** `cal/mlb-the-show-market-tracker` on Gitea +**Branch:** `main` (merge commit `ea66e2c`) +**Deploy method:** Local script — `uv run scripts/grind.py` + +## Release Summary + +Major fixes to the companion app automation (`grind.py`). The gold exchange navigation was broken — the script thought it had entered the card grid when it was still on the exchange selection list. Added a new `grind` command that orchestrates the full buy→exchange loop with multi-tier OVR rotation. + +## Changes + +### Bug Fixes +- Fixed `_is_on_exchange_grid()` to require `Exchange Value` card labels, distinguishing the card grid from the Exchange Players list page (`d4c038b`) +- Added retry loop (3 attempts, 2s apart) in `ensure_on_exchange_grid()` for variable load times +- Added `time.sleep(2)` after tapping into the Gold Exchange grid +- Removed low-OVR bail logic — the grid is sorted ascending, so bail fired on first screen before scrolling to profitable cards +- Fixed buy-orders market scroll — retry loop attempts up to 10 scrolls before giving up (was 1) (`6912a7e`). Note: scroll method itself was still broken (KEYCODE_PAGE_DOWN); fixed in 2026.4.01 release. +- Restored `_has_low_ovr_cards` fix lost during PR #2 merge (`c29af78`) + +### New Features +- **`grind` command** — automated buy→exchange loop with OVR tier rotation (`6912a7e`) + - Rotates through OVR tiers in descending order (default: 79, 78, 77) + - Buys 2 tiers per round, then exchanges all available dupes + - Flags: `--ovrs`, `--rounds`, `--max-players`, `--max-price`, `--budget`, `--max-packs` + - Per-round and cumulative summary output + - Clean Ctrl+C handling with final totals + +### CLI Changes +- Renamed `grind` → `exchange` (bulk exchange command) +- Removed redundant single-exchange command (use `exchange 1` instead) +- `grind` now refers to the full buy→exchange orchestration loop + +## Known Issues +- Default price gates (`MAX_BUY_PRICES`) may be too low during market inflation periods. Current gates: 79→170, 78→140, 77→125. Use `--max-price` to override. +- No order fulfillment polling — the grind loop relies on natural timing (2 buy rounds ≈ 2-5 min gives orders time to fill) diff --git a/mlb-the-show/release-2026.4.01.md b/mlb-the-show/release-2026.4.01.md new file mode 100644 index 0000000..3bae3b5 --- /dev/null +++ b/mlb-the-show/release-2026.4.01.md @@ -0,0 +1,26 @@ +--- +title: "MLB The Show Companion Automation — 2026.4.01" +description: "Fix buy-orders scroll to use touch swipes, optimize exchange card selection." +type: reference +domain: gaming +tags: [release-notes, deployment, mlb-the-show, python, automation] +--- + +# MLB The Show Companion Automation — 2026.4.01 + +**Date:** 2026-04-01 +**Repo:** `cal/mlb-the-show-market-tracker` on Gitea +**Branch:** `main` (latest `f15e98a`) +**Deploy method:** Local script — `uv run scripts/grind.py` + +## Release Summary + +Two fixes to the companion app automation. The buy-orders command couldn't scroll through the market list because it used keyboard events instead of touch swipes. The exchange command now stops selecting cards once it has enough points for a pack. + +## Changes + +### Bug Fixes +- **Fixed buy-orders market scrolling** — replaced `KEYCODE_PAGE_DOWN` (keyboard event ignored by WebView) with `scroll_load_jiggle()` which uses touch swipes + a reverse micro-swipe to trigger lazy loading. This matches the working exchange scroll strategy. (`49fe7b6`) + +### Optimizations +- **Early break in exchange card selection** — the selection loop now stops as soon as accumulated points meet the exchange threshold, avoiding unnecessary taps on additional card types the app won't consume. (`f15e98a`) diff --git a/monitoring/scripts/homelab-audit.sh b/monitoring/scripts/homelab-audit.sh new file mode 100755 index 0000000..f674383 --- /dev/null +++ b/monitoring/scripts/homelab-audit.sh @@ -0,0 +1,483 @@ +#!/usr/bin/env bash +# homelab-audit.sh — Comprehensive homelab infrastructure audit +# Collects resource allocation, utilization, stuck processes, and inefficiencies +# across all Proxmox VMs/CTs and physical hosts. +# +# Usage: ./homelab-audit.sh [--json] [--output FILE] +# Requires: SSH access to all hosts (via ~/.ssh/config aliases) + +# -e omitted intentionally — unreachable hosts should not abort the full audit +set -uo pipefail + +# ── Configuration ───────────────────────────────────────────────────────────── +PROXMOX_HOST="proxmox" +PHYSICAL_HOSTS=("manticore") +SSH_TIMEOUT=10 +OUTPUT_FORMAT="text" +OUTPUT_FILE="" +REPORT_DIR="/tmp/homelab-audit-$(date +%Y%m%d-%H%M%S)" + +# Thresholds for flagging issues +LOAD_PER_CORE_WARN=0.7 +LOAD_PER_CORE_CRIT=1.0 +MEM_USED_PCT_WARN=80 +MEM_USED_PCT_CRIT=95 +DISK_USED_PCT_WARN=80 +DISK_USED_PCT_CRIT=90 +SWAP_USED_MB_WARN=500 +UPTIME_DAYS_WARN=30 +ZOMBIE_WARN=1 +STUCK_PROC_CPU_WARN=10 # % CPU for a single process running >24h + +# ── Argument parsing ───────────────────────────────────────────────────────── +while [[ $# -gt 0 ]]; do + case $1 in + --json) + OUTPUT_FORMAT="json" + shift + ;; + --output) + OUTPUT_FILE="$2" + shift 2 + ;; + *) + echo "Unknown option: $1" + exit 1 + ;; + esac +done + +mkdir -p "$REPORT_DIR" + +# ── Helpers ────────────────────────────────────────────────────────────────── + +ssh_cmd() { + local host="$1" + shift + ssh -n -o ConnectTimeout=$SSH_TIMEOUT -o BatchMode=yes -o StrictHostKeyChecking=accept-new "$host" "$@" 2>>"$REPORT_DIR/ssh-failures.log" +} + +# ssh_stdin — like ssh_cmd but allows stdin (for heredocs/pipe input) +ssh_stdin() { + local host="$1" + shift + ssh -o ConnectTimeout=$SSH_TIMEOUT -o BatchMode=yes -o StrictHostKeyChecking=accept-new "$host" "$@" 2>>"$REPORT_DIR/ssh-failures.log" +} + +log_section() { echo -e "\n━━━ $1 ━━━"; } +log_subsection() { echo -e "\n ── $1 ──"; } +log_ok() { echo " ✓ $1"; } +log_warn() { echo " ⚠ $1"; } +log_crit() { echo " ✖ $1"; } +log_info() { echo " $1"; } + +# ── Per-host collection script (runs remotely) ────────────────────────────── +# This heredoc is sent to each host via SSH. It outputs structured key=value data. +COLLECTOR_SCRIPT=' +#!/bin/bash +echo "AUDIT_START" +echo "hostname=$(hostname)" +echo "uptime_seconds=$(cat /proc/uptime | cut -d" " -f1 | cut -d"." -f1)" +echo "uptime_days=$(( $(cat /proc/uptime | cut -d" " -f1 | cut -d"." -f1) / 86400 ))" + +# CPU +echo "cpu_cores=$(nproc 2>/dev/null || grep -c ^processor /proc/cpuinfo)" +read load1 load5 load15 rest < /proc/loadavg +echo "load_1m=$load1" +echo "load_5m=$load5" +echo "load_15m=$load15" + +# Memory (in MB) +mem_total=$(awk "/MemTotal/ {printf \"%.0f\", \$2/1024}" /proc/meminfo) +mem_avail=$(awk "/MemAvailable/ {printf \"%.0f\", \$2/1024}" /proc/meminfo) +mem_used=$((mem_total - mem_avail)) +swap_total=$(awk "/SwapTotal/ {printf \"%.0f\", \$2/1024}" /proc/meminfo) +swap_used=$(awk "/SwapFree/ {printf \"%.0f\", ($swap_total*1024 - \$2)/1024}" /proc/meminfo 2>/dev/null || echo 0) +echo "mem_total_mb=$mem_total" +echo "mem_used_mb=$mem_used" +echo "mem_avail_mb=$mem_avail" +echo "mem_used_pct=$((mem_used * 100 / (mem_total > 0 ? mem_total : 1)))" +echo "swap_total_mb=$swap_total" +echo "swap_used_mb=$swap_used" + +# Disk usage (non-tmpfs, non-overlay mounts) +echo "DISK_START" +df -h --output=target,size,used,avail,pcent -x tmpfs -x devtmpfs -x overlay -x squashfs 2>/dev/null | tail -n +2 | while read mount size used avail pct; do + echo "disk|$mount|$size|$used|$avail|$pct" +done +echo "DISK_END" + +# Top CPU processes (>5% CPU) +echo "PROCS_START" +ps aux --sort=-%cpu --no-headers 2>/dev/null | head -15 | while read user pid cpu mem vsz rss tty stat start time cmd; do + echo "proc|$user|$pid|$cpu|$mem|$start|$time|$cmd" +done +echo "PROCS_END" + +# Zombie processes +zombies=$(ps aux 2>/dev/null | awk "\$8 ~ /Z/" | wc -l) +echo "zombie_count=$zombies" + +# Failed systemd units +echo "FAILED_UNITS_START" +systemctl --failed --no-legend --no-pager 2>/dev/null | while read unit load active sub desc; do + echo "failed_unit|$unit" +done +echo "FAILED_UNITS_END" + +# Docker containers (if docker is available) +if command -v docker &>/dev/null; then + echo "DOCKER_START" + docker stats --no-stream --format "docker|{{.Name}}|{{.CPUPerc}}|{{.MemUsage}}|{{.NetIO}}|{{.PIDs}}" 2>/dev/null || true + echo "DOCKER_END" + echo "DOCKER_CONTAINERS_START" + docker ps -a --format "container|{{.Names}}|{{.Status}}|{{.Image}}|{{.Ports}}" 2>/dev/null || true + echo "DOCKER_CONTAINERS_END" +fi + +# Listening ports (services inventory) +echo "LISTENERS_START" +ss -tlnp 2>/dev/null | tail -n +2 | awk "{print \"listen|\" \$4 \"|\" \$6}" | head -30 +echo "LISTENERS_END" + +# Long-running high-CPU processes (potential stuck processes) +echo "STUCK_PROCS_START" +now=$(date +%s) +ps -eo pid,etimes,%cpu,comm --sort=-%cpu --no-headers 2>/dev/null | while read pid etime cpu comm; do + # etimes = elapsed time in seconds; flag if >24h and >STUCK threshold + cpu_int=${cpu%.*} + if [[ "$etime" -gt 86400 && "${cpu_int:-0}" -gt ${STUCK_PROC_CPU_WARN:-10} ]]; then + days=$((etime / 86400)) + echo "stuck|$pid|$cpu|${days}d|$comm" + fi +done +echo "STUCK_PROCS_END" + +echo "AUDIT_END" +' + +# ── Proxmox-specific collection ───────────────────────────────────────────── +collect_proxmox_inventory() { + log_section "PROXMOX HOST INVENTORY ($PROXMOX_HOST)" + + # Get host-level info + local pve_data + pve_data=$(ssh_stdin "$PROXMOX_HOST" "STUCK_PROC_CPU_WARN=$STUCK_PROC_CPU_WARN bash -s" <<<"$COLLECTOR_SCRIPT") + echo "$pve_data" >"$REPORT_DIR/proxmox-host.txt" + parse_and_report "proxmox" "$pve_data" + + # VM inventory with resource allocations + log_subsection "Virtual Machines" + printf " %-6s %-25s %-8s %6s %8s %10s\n" "VMID" "NAME" "STATUS" "vCPUs" "RAM(GB)" "DISK(GB)" + printf " %-6s %-25s %-8s %6s %8s %10s\n" "------" "-------------------------" "--------" "------" "--------" "----------" + + local total_vcpus=0 total_vm_ram=0 + + while IFS= read -r line; do + local vmid=$(echo "$line" | awk '{print $1}') + local name=$(echo "$line" | awk '{print $2}') + local status=$(echo "$line" | awk '{print $3}') + local mem_mb=$(echo "$line" | awk '{print $4}') + local disk_gb=$(echo "$line" | awk '{print $5}') + + # Get vCPU count from config + local vcpus + vcpus=$(ssh_cmd "$PROXMOX_HOST" "qm config $vmid 2>/dev/null | grep -E '^(cores|sockets)'" | + awk -F: '/cores/{c=$2} /sockets/{s=$2} END{printf "%.0f", (c+0)*(s>0?s:1)}') + [[ -z "$vcpus" ]] && vcpus=0 + + local mem_gb=$(echo "scale=1; $mem_mb / 1024" | bc 2>/dev/null || echo "?") + + if [[ "$status" == "running" ]]; then + total_vcpus=$((total_vcpus + vcpus)) + total_vm_ram=$((total_vm_ram + mem_mb)) + fi + + local flag="" + [[ "$status" == "stopped" ]] && flag=" (wasting disk)" + [[ "$status" == "running" && "$vcpus" -gt 8 ]] && flag=" (heavy)" + + printf " %-6s %-25s %-8s %6s %8s %10s%s\n" "$vmid" "$name" "$status" "$vcpus" "$mem_gb" "$disk_gb" "$flag" + done < <(ssh_cmd "$PROXMOX_HOST" "qm list 2>/dev/null" | tail -n +2 | awk 'NF{printf "%s %s %s %s %s\n", $1, $2, $3, $4, $5}') + + # CT inventory + log_subsection "LXC Containers" + printf " %-6s %-30s %-8s %6s %8s\n" "CTID" "NAME" "STATUS" "vCPUs" "RAM(MB)" + printf " %-6s %-30s %-8s %6s %8s\n" "------" "------------------------------" "--------" "------" "--------" + + while IFS= read -r line; do + local ctid=$(echo "$line" | awk '{print $1}') + local status=$(echo "$line" | awk '{print $2}') + local name=$(echo "$line" | awk '{print $NF}') + + # Get CT resource config + local ct_cores ct_mem + ct_cores=$(ssh_cmd "$PROXMOX_HOST" "pct config $ctid 2>/dev/null | grep ^cores" | awk -F: '{print $2}' | tr -d ' ') + ct_mem=$(ssh_cmd "$PROXMOX_HOST" "pct config $ctid 2>/dev/null | grep ^memory" | awk -F: '{print $2}' | tr -d ' ') + [[ -z "$ct_cores" ]] && ct_cores="(host)" + [[ -z "$ct_mem" ]] && ct_mem="?" + + if [[ "$status" == "running" && "$ct_cores" =~ ^[0-9]+$ ]]; then + total_vcpus=$((total_vcpus + ct_cores)) + fi + + printf " %-6s %-30s %-8s %6s %8s\n" "$ctid" "$name" "$status" "$ct_cores" "$ct_mem" + done < <(ssh_cmd "$PROXMOX_HOST" "pct list 2>/dev/null" | tail -n +2) + + # Physical cores + local phys_cores + phys_cores=$(ssh_cmd "$PROXMOX_HOST" "nproc") + + log_subsection "Resource Summary" + log_info "Physical cores: $phys_cores" + log_info "Total allocated vCPUs: $total_vcpus" + local ratio=$(echo "scale=2; $total_vcpus / $phys_cores" | bc 2>/dev/null || echo "?") + log_info "Overcommit ratio: ${ratio}:1" + local total_vm_ram_gb=$(echo "scale=1; $total_vm_ram / 1024" | bc 2>/dev/null || echo "?") + local phys_ram + phys_ram=$(ssh_cmd "$PROXMOX_HOST" "awk '/MemTotal/{printf \"%.1f\", \$2/1024/1024}' /proc/meminfo") + log_info "Physical RAM: ${phys_ram} GB" + log_info "Total allocated VM RAM: ${total_vm_ram_gb} GB" + + if (($(echo "$ratio > 1.5" | bc -l 2>/dev/null || echo 0))); then + log_warn "vCPU overcommit ratio ${ratio}:1 is high — may cause contention" + fi +} + +# ── Parse collector output and report findings ─────────────────────────────── +parse_and_report() { + local label="$1" + local data="$2" + + # Extract key values + local hostname=$(echo "$data" | grep "^hostname=" | cut -d= -f2) + local uptime_days=$(echo "$data" | grep "^uptime_days=" | cut -d= -f2) + local cpu_cores=$(echo "$data" | grep "^cpu_cores=" | cut -d= -f2) + local load_1m=$(echo "$data" | grep "^load_1m=" | cut -d= -f2) + local load_5m=$(echo "$data" | grep "^load_5m=" | cut -d= -f2) + local mem_total=$(echo "$data" | grep "^mem_total_mb=" | cut -d= -f2) + local mem_used=$(echo "$data" | grep "^mem_used_mb=" | cut -d= -f2) + local mem_used_pct=$(echo "$data" | grep "^mem_used_pct=" | cut -d= -f2) + local swap_used=$(echo "$data" | grep "^swap_used_mb=" | cut -d= -f2) + local zombie_count=$(echo "$data" | grep "^zombie_count=" | cut -d= -f2) + + log_subsection "System: $hostname" + + # Uptime + if [[ "${uptime_days:-0}" -gt "$UPTIME_DAYS_WARN" ]]; then + log_warn "Uptime: ${uptime_days} days — consider scheduling maintenance reboot" + else + log_ok "Uptime: ${uptime_days} days" + fi + + # Load per core + local load_per_core + load_per_core=$(echo "scale=2; ${load_5m:-0} / ${cpu_cores:-1}" | bc 2>/dev/null || echo "0") + if (($(echo "$load_per_core > $LOAD_PER_CORE_CRIT" | bc -l 2>/dev/null || echo 0))); then + log_crit "Load: ${load_1m}/${load_5m} on ${cpu_cores} cores (${load_per_core}/core) — OVERLOADED" + elif (($(echo "$load_per_core > $LOAD_PER_CORE_WARN" | bc -l 2>/dev/null || echo 0))); then + log_warn "Load: ${load_1m}/${load_5m} on ${cpu_cores} cores (${load_per_core}/core) — elevated" + else + log_ok "Load: ${load_1m}/${load_5m} on ${cpu_cores} cores (${load_per_core}/core)" + fi + + # Memory + if [[ "${mem_used_pct:-0}" -gt "$MEM_USED_PCT_CRIT" ]]; then + log_crit "Memory: ${mem_used}/${mem_total} MB (${mem_used_pct}%) — CRITICAL" + elif [[ "${mem_used_pct:-0}" -gt "$MEM_USED_PCT_WARN" ]]; then + log_warn "Memory: ${mem_used}/${mem_total} MB (${mem_used_pct}%)" + else + log_ok "Memory: ${mem_used}/${mem_total} MB (${mem_used_pct}%)" + fi + + # Swap + if [[ "${swap_used:-0}" -gt "$SWAP_USED_MB_WARN" ]]; then + log_warn "Swap: ${swap_used} MB in use" + fi + + # Disk + echo "$data" | sed -n '/DISK_START/,/DISK_END/p' | grep "^disk|" | while IFS='|' read _ mount size used avail pct; do + pct_num=${pct%%%} + if [[ "${pct_num:-0}" -gt "$DISK_USED_PCT_CRIT" ]]; then + log_crit "Disk $mount: ${used}/${size} (${pct}) — CRITICAL" + elif [[ "${pct_num:-0}" -gt "$DISK_USED_PCT_WARN" ]]; then + log_warn "Disk $mount: ${used}/${size} (${pct})" + else + log_ok "Disk $mount: ${used}/${size} (${pct})" + fi + done + + # Zombies + if [[ "${zombie_count:-0}" -gt 0 ]]; then + log_warn "Zombie processes: $zombie_count" + fi + + # Stuck processes + local stuck_procs + stuck_procs=$(echo "$data" | sed -n '/STUCK_PROCS_START/,/STUCK_PROCS_END/p' | grep "^stuck|") + if [[ -n "$stuck_procs" ]]; then + log_warn "Stuck/runaway processes detected:" + echo "$stuck_procs" | while IFS='|' read _ pid cpu age comm; do + log_info "PID $pid: $comm at ${cpu}% CPU for $age" + done + fi + + # Failed systemd units + local failed + failed=$(echo "$data" | sed -n '/FAILED_UNITS_START/,/FAILED_UNITS_END/p' | grep "^failed_unit|") + if [[ -n "$failed" ]]; then + log_warn "Failed systemd units:" + echo "$failed" | while IFS='|' read _ unit; do + log_info "$unit" + done + fi + + # Docker containers + local docker_data + docker_data=$(echo "$data" | sed -n '/DOCKER_START/,/DOCKER_END/p' | grep "^docker|") + if [[ -n "$docker_data" ]]; then + echo "" + log_info "Docker containers:" + printf " %-30s %8s %s\n" "CONTAINER" "CPU%" "MEMORY" + echo "$docker_data" | while IFS='|' read _ name cpu mem net pids; do + printf " %-30s %8s %s\n" "$name" "$cpu" "$mem" + done + fi + + # Stopped docker containers + local stopped_containers + stopped_containers=$(echo "$data" | sed -n '/DOCKER_CONTAINERS_START/,/DOCKER_CONTAINERS_END/p' | grep "^container|" | grep -i "exited") + if [[ -n "$stopped_containers" ]]; then + log_warn "Stopped Docker containers (wasting disk):" + echo "$stopped_containers" | while IFS='|' read _ name status image ports; do + log_info "$name ($image) — $status" + done + fi +} + +# ── Collect from individual VM/CT guests ───────────────────────────────────── +collect_guest() { + local label="$1" + local ssh_target="$2" + + local data + data=$(ssh_stdin "$ssh_target" "STUCK_PROC_CPU_WARN=$STUCK_PROC_CPU_WARN bash -s" <<<"$COLLECTOR_SCRIPT") || { + log_warn "Could not connect to $label ($ssh_target)" + return + } + echo "$data" >"$REPORT_DIR/${label}.txt" + parse_and_report "$label" "$data" +} + +# ── Build SSH target map from Proxmox ──────────────────────────────────────── +build_guest_map() { + # Map of VMID/CTID -> SSH target (IP) + # We get IPs from the guest agent or known SSH config + local -n map_ref=$1 + + # Get VM IPs via guest agent + while IFS= read -r line; do + local vmid=$(echo "$line" | awk '{print $1}') + local name=$(echo "$line" | awk '{print $2}') + local status=$(echo "$line" | awk '{print $3}') + [[ "$status" != "running" ]] && continue + + local ip + ip=$(ssh_cmd "$PROXMOX_HOST" "qm guest cmd $vmid network-get-interfaces 2>/dev/null" | + python3 -c " +import sys, json +data = json.load(sys.stdin) +for iface in data: + if iface.get('name') in ('lo',): continue + for addr in iface.get('ip-addresses', []): + if addr['ip-address-type'] == 'ipv4' and not addr['ip-address'].startswith('127.'): + print(addr['ip-address']) + sys.exit() +" 2>/dev/null) || true + + if [[ -n "$ip" ]]; then + map_ref["vm-${vmid}-${name}"]="$ip" + fi + done < <(ssh_cmd "$PROXMOX_HOST" "qm list 2>/dev/null" | tail -n +2 | awk 'NF{print $1, $2, $3}') + + # Get CT IPs + while IFS= read -r line; do + local ctid=$(echo "$line" | awk '{print $1}') + local status=$(echo "$line" | awk '{print $2}') + local name=$(echo "$line" | awk '{print $NF}') + [[ "$status" != "running" ]] && continue + + local ip + ip=$(ssh_cmd "$PROXMOX_HOST" "lxc-info -n $ctid -iH 2>/dev/null | head -1") || true + if [[ -z "$ip" ]]; then + ip=$(ssh_cmd "$PROXMOX_HOST" "pct config $ctid 2>/dev/null | grep -oP 'ip=\K[0-9.]+'") || true + fi + + if [[ -n "$ip" ]]; then + map_ref["ct-${ctid}-${name}"]="$ip" + fi + done < <(ssh_cmd "$PROXMOX_HOST" "pct list 2>/dev/null" | tail -n +2) +} + +# ── Summary and recommendations ────────────────────────────────────────────── +generate_summary() { + log_section "AUDIT SUMMARY & RECOMMENDATIONS" + + echo "" + echo " Raw data saved to: $REPORT_DIR/" + echo "" + + if [[ -s "$REPORT_DIR/ssh-failures.log" ]]; then + local ssh_fail_count + ssh_fail_count=$(wc -l <"$REPORT_DIR/ssh-failures.log") + log_warn "SSH failures: $ssh_fail_count error(s) logged to $REPORT_DIR/ssh-failures.log" + echo "" + fi + + echo " Review the flags above (⚠ warnings, ✖ critical) and consider:" + echo " 1. Kill stuck/runaway processes immediately" + echo " 2. Restart or remove failed systemd units" + echo " 3. Clean up stopped Docker containers and unused images" + echo " 4. Right-size VM/CT resource allocations based on actual usage" + echo " 5. Shut down or decommission VMs marked as stopped/unused" + echo " 6. Schedule maintenance reboots for long-uptime hosts" + echo " 7. Adjust monitoring thresholds to use per-core load metrics" + echo "" +} + +# ── Main ───────────────────────────────────────────────────────────────────── +main() { + echo "╔══════════════════════════════════════════════════════════════╗" + echo "║ HOMELAB INFRASTRUCTURE AUDIT ║" + echo "║ $(date '+%Y-%m-%d %H:%M:%S') ║" + echo "╚══════════════════════════════════════════════════════════════╝" + + # 1. Proxmox host + inventory + collect_proxmox_inventory + + # 2. Discover and audit all guests + log_section "GUEST AUDITS" + declare -A guest_map + build_guest_map guest_map + + for label in $(echo "${!guest_map[@]}" | tr ' ' '\n' | sort); do + local target="${guest_map[$label]}" + collect_guest "$label" "$target" + done + + # 3. Physical hosts + log_section "PHYSICAL HOSTS" + for host in "${PHYSICAL_HOSTS[@]}"; do + collect_guest "$host" "$host" + done + + # 4. Summary + generate_summary +} + +# Redirect output if --output was specified +if [[ -n "$OUTPUT_FILE" ]]; then + main 2>&1 | tee "$OUTPUT_FILE" +else + main 2>&1 +fi diff --git a/paper-dynasty/2026-03-30.md b/paper-dynasty/2026-03-30.md new file mode 100644 index 0000000..dc93ff1 --- /dev/null +++ b/paper-dynasty/2026-03-30.md @@ -0,0 +1,63 @@ +--- +title: "Refractor Phase 2: Integration — boost wiring, tests, and review" +description: "Implemented apply_tier_boost orchestration, dry_run evaluator, evaluate-game wiring with kill switch, and 51 new tests across paper-dynasty-database. PRs #176 and #177 merged." +type: context +domain: paper-dynasty +tags: [paper-dynasty-database, refractor, phase-2, testing] +--- + +# Refractor Phase 2: Integration — boost wiring, tests, and review + +**Date:** 2026-03-30 +**Branch:** `feature/refractor-phase2-integration` (merged to `main`) +**Repo:** paper-dynasty-database + +## What Was Done + +Full implementation of Refractor Phase 2 Integration — wiring the Phase 2 Foundation boost functions (PR #176) into the live evaluate-game endpoint so that tier-ups actually create boosted variant cards with modified ratings. + +1. **PR #176 merged (Foundation)** — Review findings fixed (renamed `evolution_tier` to `refractor_tier`, removed redundant parens), then merged via pd-ops +2. **`evaluate_card(dry_run=True)`** — Added dry_run parameter to separate tier detection from tier write. `apply_tier_boost()` becomes the sole writer of `current_tier`, ensuring atomicity with variant creation. Added `computed_tier` and `computed_fully_evolved` to return dict. +3. **`apply_tier_boost()` orchestration** — Full flow: source card lookup, boost application per vs_hand split, variant card + ratings creation with idempotency guards, audit record with idempotency guard, atomic state mutations via `db.atomic()`. Display stat helpers compute fresh avg/obp/slg. +4. **`evaluate_game()` wiring** — Calls evaluate_card with dry_run=True, loops through intermediate tiers on tier-up, handles partial multi-tier failures (reports last successful tier), `REFRACTOR_BOOST_ENABLED` env var kill switch, suppresses false notifications when boost is disabled or card_type is missing. +5. **79-sum documentation fix** — Clarified all references to "79-sum" across code, tests, and docs to note the 108-total card invariant (79 variable + 29 x-check for pitchers). +6. **51 new tests** — Display stat unit tests (12), integration tests for orchestration (27), HTTP endpoint tests (7), dry_run evaluator tests (6). Total suite: 223 passed. +7. **Five rounds of swarm reviews** — Each change reviewed individually by swarm-reviewer agents. All findings addressed: false notification on null card_type, wrong tier in log message, partial multi-tier failure reporting, atomicity test accuracy, audit idempotency gap, import os placement. +8. **PR #177 merged** — Review found two issues (import os inside function, audit idempotency gap on PostgreSQL UNIQUE constraint). Both fixed, pushed, approved by Claude, merged via pd-ops. + +## Decisions + +### Display stats computed fresh, not set to None +The original PO review note suggested setting avg/obp/slg to None on variant cards and deferring recalculation. Cal decided to compute them fresh using the exact Pydantic validator formulas instead — strictly better than stale or missing values. Design doc updated to reflect this. + +### Card/ratings creation outside db.atomic() +The design doc specified all writes inside `db.atomic()`. Implementation splits card/ratings creation outside (idempotent, retry-safe via get_or_none guards) with only state mutations (audit, tier write, Card.variant propagation) inside the atomic block. This is pragmatically correct — on retry, existing card/ratings are reused. Design doc updated. + +### Kill switch suppresses notifications entirely +When `REFRACTOR_BOOST_ENABLED=false`, the router skips both the boost AND the tier_up notification (via `continue`). This prevents false notifications to the Discord bot during maintenance windows. Initially the code fell through and emitted a notification without a variant — caught during coverage gap analysis and fixed. + +### Audit idempotency guard added +PR review identified that `RefractorBoostAudit` has a `UNIQUE(card_state_id, tier)` constraint in PostgreSQL (from the migration) that the SQLite test DB doesn't enforce. Added `get_or_none` before `create` to prevent IntegrityError on retry. + +## Follow-Up + +- Phase 3: Documentation updates in `card-creation` repo (docs only, no code) +- Phase 4a: Validation test cases in `database` repo +- Phase 4b: Discord bot tier-up notification fix (must ship alongside or after Phase 2 deploy) +- Deploy Phase 2 to dev: run migration `2026-03-28_refractor_phase2_boost.sql` on dev DB +- Stale branches to clean up in database repo: `feat/evolution-refractor-schema-migration`, `test/refractor-tier3` + +## Files Changed + +**paper-dynasty-database:** +- `app/services/refractor_boost.py` — apply_tier_boost orchestration, display stat helpers, card_type validation, audit idempotency guard +- `app/services/refractor_evaluator.py` — dry_run parameter, computed_tier/computed_fully_evolved in return dict +- `app/routers_v2/refractor.py` — evaluate_game wiring, kill switch, partial multi-tier failure, isoformat crash fix +- `tests/test_refractor_boost.py` — 12 new display stat tests, 79-sum comment fixes +- `tests/test_refractor_boost_integration.py` — 27 new integration tests (new file) +- `tests/test_postgame_refractor.py` — 7 new HTTP endpoint tests +- `tests/test_refractor_evaluator.py` — 6 new dry_run unit tests + +**paper-dynasty (parent repo):** +- `docs/refractor-phase2/01-phase1-foundation.md` — 79-sum clarifications +- `docs/refractor-phase2/02-phase2-integration.md` — atomicity boundary, display stats updates diff --git a/paper-dynasty/open-packs-checkin-crash.md b/paper-dynasty/open-packs-checkin-crash.md new file mode 100644 index 0000000..7d5196a --- /dev/null +++ b/paper-dynasty/open-packs-checkin-crash.md @@ -0,0 +1,48 @@ +--- +title: "Fix: /open-packs crash from orphaned Check-In Player packs" +description: "Check-In Player packs with hyphenated name caused empty Discord select menu (400 Bad Request) and KeyError in callback." +type: troubleshooting +domain: paper-dynasty +tags: [troubleshooting, discord, paper-dynasty, packs, hotfix] +--- + +# Fix: /open-packs crash from orphaned Check-In Player packs + +**Date:** 2026-03-26 +**PR:** #134 (hotfix branch based on prod tag 2026.3.4, merged to main) +**Tag:** 2026.3.8 +**Severity:** High --- any user with an orphaned Check-In Player pack could not open any packs at all + +## Problem + +Running `/open-packs` returned: `HTTPException: 400 Bad Request (error code: 50035): Invalid Form Body --- In data.components.0.components.0.options: This field is required` + +Discord rejected the message because the select menu had zero options. + +## Root Cause + +Two cascading bugs triggered by the "Check-In Player" pack type name containing a hyphen: + +1. **Empty select menu:** The `pretty_name` logic used `'-' not in key` to identify bare pack type names. "Check-In Player" contains a hyphen, so it fell into the `elif 'Team' in key` / `elif 'Cardset' in key` chain --- matching neither. `pretty_name` stayed `None`, no `SelectOption` was created, and Discord rejected the empty options list. + +2. **KeyError in callback (secondary):** Even if displayed, selecting "Check-In Player" would call `self.values[0].split('-')` producing `['Check', 'In Player']`, which matched none of the pack type tokens in the `if/elif` chain, raising `KeyError`. + +Check-In Player packs are normally auto-opened during the daily check-in (`/comeonmanineedthis`). An orphaned pack existed because `roll_for_cards` had previously failed mid-flow, leaving an unopened pack in inventory. + +## Fix + +Three-layer fix applied to both `cogs/economy.py` (production) and `cogs/economy_new/packs.py` (main): + +1. **Filter at source:** Added `AUTO_OPEN_TYPES = {"Check-In Player"}` set. Packs with these types are skipped during grouping with `continue`, so they never reach the select menu. + +2. **Fallback for hyphenated names:** Added `else: pretty_name = key` after the `Team`/`Cardset` checks, so any future hyphenated pack type names still get a display label. + +3. **Graceful error in callback:** Replaced `raise KeyError` with a user-facing ephemeral message ("This pack type cannot be opened manually. Please contact Cal.") and `return`. + +Also changed all "contact an admin" strings to "contact Cal" in `discord_ui/selectors.py`. + +## Lessons + +- **Production loads `cogs/economy.py`, not `cogs/economy_new/packs.py`.** The initial fix was applied to the wrong file. Always check which cogs are actually loaded by inspecting the bot startup logs (`Loaded cog: ...`) before assuming which file handles a command. +- **Hotfix branches based on old tags may have stale CI workflows.** The `docker-build.yml` at the tagged commit had an older trigger config (branch push, not tag push), so the CalVer tag silently failed to trigger CI. Cherry-pick the current workflow into hotfix branches. +- **Pack type names are used as dict keys and split on hyphens** throughout the open-packs flow. Any new pack type with a hyphen in its name will hit similar issues unless the grouping/parsing logic is refactored to stop using hyphen-delimited strings as composite keys. diff --git a/productivity/codex-agents-marketplace.md b/productivity/codex-agents-marketplace.md new file mode 100644 index 0000000..2152611 --- /dev/null +++ b/productivity/codex-agents-marketplace.md @@ -0,0 +1,62 @@ +--- +title: "Codex-to-Claude Agent Converter & Plugin Marketplace" +description: "Pipeline that converts VoltAgent/awesome-codex-subagents TOML definitions to Claude Code plugin marketplace format, hosted at cal/codex-agents on Gitea." +type: reference +domain: productivity +tags: [claude-code, automation, plugins, agents, gitea] +--- + +# Codex Agents Marketplace + +## Overview + +136+ specialized agent definitions converted from [VoltAgent/awesome-codex-subagents](https://github.com/VoltAgent/awesome-codex-subagents) (OpenAI Codex format) to Claude Code plugin marketplace format. + +- **Repo**: `cal/codex-agents` on Gitea (`git@git.manticorum.com:cal/codex-agents.git`) +- **Local path**: `/mnt/NV2/Development/codex-agents/` +- **Upstream**: Cloned to `upstream/` (gitignored), pulled on each sync + +## Sync Pipeline + +```bash +cd /mnt/NV2/Development/codex-agents +./sync.sh # pull upstream + convert changed agents +./sync.sh --force # re-convert all regardless of hash +./sync.sh --dry-run # preview only +./sync.sh --verbose # per-agent status +``` + +- `convert.py` handles TOML → Markdown+YAML frontmatter conversion +- SHA-256 per-file hashes in `codex-manifest.json` skip unchanged agents +- Deleted upstream agents are auto-removed locally +- `.claude-plugin/marketplace.json` is regenerated on each sync + +## Format Mapping + +| Codex | Claude Code | +|-------|------------| +| `gpt-5.4` + `high` | `model: opus` | +| `gpt-5.3-codex-spark` + `medium` | `model: sonnet` | +| `sandbox_mode: read-only` | `disallowedTools: Edit, Write` | +| `sandbox_mode: workspace-write` | full tool access | +| `developer_instructions` | markdown body | +| `"parent agent"` | replaced with `"orchestrating agent"` | + +## Installing Agents + +Add marketplace to `~/.claude/settings.json`: +```json +"extraKnownMarketplaces": { + "codex-agents": { "source": { "source": "git", "url": "https://git.manticorum.com/cal/codex-agents.git" } } +} +``` + +Then: +```bash +claude plugin update codex-agents +claude plugin install docker-expert@codex-agents --scope user +``` + +## Agent Categories + +10 categories: Core Development (12), Language Specialists (27), Infrastructure (16), Quality & Security (16), Data & AI (12), Developer Experience (13), Specialized Domains (12), Business & Product (11), Meta & Orchestration (10), Research & Analysis (7).