docs: sync KB — autonomous-nightly-2026-04-10.md,autonomous-pipeline-session-2026-04-10.md

This commit is contained in:
Cal Corum 2026-04-10 04:00:47 -05:00
parent 8d165efbe6
commit 87aeaf3309
2 changed files with 240 additions and 0 deletions

View File

@ -0,0 +1,95 @@
---
title: "Autonomous Nightly Run — 2026-04-10"
description: "First autonomous nightly run: 2 PRs shipped, 7 items queued, 0 rejections. Budget-constrained dispatch."
type: context
domain: paper-dynasty
tags: [autonomous-pipeline, nightly-run]
---
## Run Metadata
- Date: 2026-04-10
- Slots before: 10/10 S, 5/5 M (no active autonomous work)
- Slots after: 8/10 S, 5/5 M (2 S slots now in-flight via PRs)
- Open autonomous PRs before run: 0
- Recent rejections: 0
- Budget constraint: run hit the $5 USD ceiling early due to broad analyst sweep; dispatched 2 engineers instead of full slot fill.
## Findings
- Analyst produced 8 findings across database, discord-app, and autonomous pipeline
- Growth-po produced 5 findings (all discord-app, all S-sized, all Phase 2 roadmap items)
- Dedup haiku: **skipped** (0 open PRs + 0 rejections = no possible duplicates; all findings novel by construction)
## PO Decisions
### Database-po (4 findings)
| Finding ID | Decision | Size | Notes |
|---|---|---|---|
| analyst-2026-04-10-002 | approved | S | HTTPException(200) sweep across ~10 routers |
| analyst-2026-04-10-004 | approved | S | N+1 Paperdex fix; add query-count regression test |
| analyst-2026-04-10-006 | reshaped | M | Split into 3 S tickets, start with pack-opening tests |
| analyst-2026-04-10-008 | approved | S | Remove unfiltered pre-count in GET /packs **→ shipped** |
### Discord-po (8 findings)
| Finding ID | Decision | Size | Notes |
|---|---|---|---|
| analyst-2026-04-10-001 | approved | S | Delete dead gameplay_legacy.py **→ shipped** |
| analyst-2026-04-10-003 | approved | S | Economy tree.on_error override (play-lock bug) — **high priority** |
| analyst-2026-04-10-005 | reshaped | M | Two-phase cutover for economy_new/packs.py migration |
| growth-sweep-2026-04-10-001 | approved | S | Rarity celebration embeds — use canonical rarity vocab |
| growth-sweep-2026-04-10-002 | approved | S | /compare command — ephemeral by default, LHP/RHP split |
| growth-sweep-2026-04-10-003 | approved | S | Gauntlet results recap embed |
| growth-sweep-2026-04-10-004 | reshaped | M | Command usage telemetry — cross-repo, needs privacy review |
| growth-sweep-2026-04-10-005 | reshaped | S+M | Split: /gauntlet schedule (S) first, reminder scheduler (M) after scheduler approach specced |
### Self-improvement (auto-approved, no PO gate)
| Finding ID | Decision | Size | Notes |
|---|---|---|---|
| analyst-2026-04-10-007 | approved | S | Split run-nightly.sh stdout/stderr, write last-run-result.json, voice-notify on failure |
## PRs Created
- **discord-app#162**`chore(cogs): remove dead gameplay_legacy cog (4,723 lines, zero references)` — tests PASS (no new failures; 2 pre-existing SQLite path issues unchanged), labels applied, **pr-reviewer dispatch skipped (budget)** — https://git.manticorum.com/cal/paper-dynasty-discord/pulls/162
- **database#211**`fix(packs): remove unfiltered pre-count in GET /packs (3 round-trips → 2)` — tests PASS (266 passed, 13 pre-existing failures unchanged), consumer check clean (no 404 handlers in discord-app), labels applied, **pr-reviewer dispatch skipped (budget)** — https://git.manticorum.com/cal/paper-dynasty-database/pulls/211
- **Post-run diagnostic:** Pyright flagged 4 `Pack.id` attribute access errors after ruff reformatted the file. These are Peewee ORM false positives (`id` is added dynamically by Peewee's Model metaclass) and are pre-existing elsewhere in the codebase. Not a regression from this change.
## Mix Ratio
- No prior digests — this is the first autonomous nightly run. Default 1:1 interleave applied.
- This run shipped 2 stability items and 0 features. Next run should bias toward feature dispatches if budget permits.
## Wishlist Additions
- None. All approved items are S or M and could fit within a normal slot budget — no L-sized items surfaced in this sweep.
## Queued for Next Run (approved but not dispatched due to budget)
The following items are **approved and ready to ship** but were not dispatched this run. They should be picked up first thing next run:
**High priority (stability, real user impact):**
1. `analyst-2026-04-10-003` (S) — Economy cog overwrites global tree.on_error, bypassing play-lock release. **Players are getting stuck due to this bug.** Should be the first item dispatched next run.
2. `analyst-2026-04-10-002` (S) — HTTPException(200) sweep across ~10 DB routers.
3. `analyst-2026-04-10-004` (S) — N+1 Paperdex fix in players endpoints.
**Self-improvement:**
4. `analyst-2026-04-10-007` (S) — run-nightly.sh stdout/stderr split + last-run-result.json. This is a *prerequisite* for reliable future runs; should be prioritized.
**Features (growth):**
5. `growth-sweep-2026-04-10-001` (S) — Rarity celebration embeds.
6. `growth-sweep-2026-04-10-003` (S) — Gauntlet results recap embed.
7. `growth-sweep-2026-04-10-002` (S) — /compare command.
**Reshaped (needs spec work before dispatch):**
- `analyst-2026-04-10-006` (M) — first of 3 split tickets: pack-opening happy path + insufficient funds + duplicate handling.
- `analyst-2026-04-10-005` (M) — Phase 1 spec of economy.py vs economy_new/packs.py drift.
- `growth-sweep-2026-04-10-004` (M) — Cross-repo telemetry; needs privacy posture confirmation.
- `growth-sweep-2026-04-10-005` Issue A (S) — /gauntlet schedule command (pure read).
## Rejections
- None this run.
## Self-Improvement Notes
**The pipeline hit its $5 budget ceiling after dispatching analyst + growth-po + 2 POs + 2 engineers.** Breakdown of spend was top-heavy: the analyst agent alone consumed roughly half the budget due to a 411s, 104-tool-use deep audit. Observations for future runs:
1. **Analyst cap**: Consider passing a stricter cap (e.g., "limit to top 5 findings, max 30 tool uses") to the analyst to keep its spend predictable.
2. **Dedup skip was correct**: With 0 open PRs and 0 rejections, the dedup haiku call would have been pure overhead. Encoding this as an orchestrator shortcut (skip dedup when both inputs are empty) would save ~$0.10 per first-run scenario.
3. **pr-reviewer was skipped**: Engineer PRs #162 and #211 did not receive an automated review pass. Cal should manually review these before merge. Future runs should reserve ~$0.30 per PR for pr-reviewer.
4. **pd-plan CLI skipped**: Approved-but-queued items are documented in this digest only, not in the pd-plan database. Next run's preflight should parse this digest's "Queued for Next Run" section and dispatch those items first before generating new findings.
5. **Budget-aware slot filling**: Orchestrator should compute a rough budget forecast (analyst ~$2, each PO ~$0.30, each engineer ~$0.60, each pr-reviewer ~$0.30) before dispatching engineers, and cap engineer count at `(remaining_budget - digest_reserve) / (engineer_cost + reviewer_cost)`.
6. **The `analyst-2026-04-10-007` self-improvement item directly addresses observability gaps that made this digest harder to write** — prioritize it next run.

View File

@ -0,0 +1,145 @@
---
title: "Autonomous Improvement Pipeline — Build Session 2026-04-09/10"
description: "Single-session design + implementation + first smoke test of the Paper Dynasty autonomous improvement pipeline. 2 PRs shipped, system ready to run nightly pending one more test."
type: context
domain: paper-dynasty
tags: [autonomous-pipeline, session-summary, paper-dynasty, architecture]
---
## Summary
In a single session spanning 2026-04-09 evening through 2026-04-10 early morning, Cal and Claude designed, specced, planned, implemented, merged, and ran the first smoke test of a nightly autonomous improvement pipeline for the Paper Dynasty ecosystem. The goal: a system where Cal wakes up to a Monday-morning queue of "here's what Claude did for you" PRs he can review and merge, keeping momentum even when he's unavailable.
The system ships. It produced 2 real, mergeable PRs on its first run before hitting a budget ceiling. Post-run fixes are in. The systemd timer is installed but not enabled pending one more validation run.
## The arc of the session
### Phase 1 — Brainstorming (spec)
Cal arrived with a two-part idea: (1) introspection on the codebase to recommend updates, (2) recommendations for workflow/tooling optimization. Through ~15 clarifying exchanges, we landed on this shape:
- **Nightly scheduled** (not on-demand) — moves forward despite Cal's schedule
- **Autonomous PR dispatch** (not just reports) — Monday morning review queue
- **WIP slot limits** to prevent overwhelm: 10 S, 5 M, no autonomous L; L items go to a wishlist
- **1:1 stability/feature bias** — mix both types of work
- **Three repos in scope:** database, discord-app, card-creation (card-creation has its own autonomous dynamic now)
- **Separation of concerns:**
- New **analyst agent** does code audits with fresh eyes (no ownership bias)
- **growth-po** does product/roadmap sweeps in a new "sweep mode"
- **Domain POs** (database-po, discord-po, cards-po) gate findings with go/no-go decisions
- **Engineer agents** build approved S/M work in isolated worktrees
- **pr-reviewer** gates PRs before Cal sees them
- **Rolling 30-day rejection log** so the pipeline doesn't re-suggest rejected ideas
- **Hybrid tracking:** pd-plan for slot counts + wishlist, KB for digests + rejection log
- **Transparency as a core value** — every decision, rejection, and action documented so both humans and future agents have full context
### Phase 2 — Plan
20-task implementation plan written and self-reviewed against the spec. Caught one gap during self-review: the mix ratio (§9) wasn't explicitly implemented anywhere. Added a step 6b to the orchestrator prompt. Another round of refinements during plan review:
1. Wishlist → Run Digest connection (L items should appear in nightly digest)
2. Rolling 30-day rejection context fed to analyst + growth-po to avoid re-discovery
3. Pure-bash preflight for pure data lookups (slot check, git pull, PR inventory, rejection query) — no LLM spin-up on "no slots" nights
4. Dedup as a haiku call (not a script) — semantic matching catches rewording
### Phase 3 — Implementation (subagent-driven)
Created worktree `.worktrees/autonomous-pipeline` on branch `feat/autonomous-pipeline`. Executed plan via subagent-driven-development skill:
- **Task 1** (inline): scaffolded `autonomous/` directory with README
- **Batch A** (sonnet subagent, Tasks 2-5): extended `pd-plan` CLI with `slot`/`wishlist` schema columns, `slots`/`wishlist` subcommands, `--slot`/`--wishlist` flags on `add`/`update`, new summary section. 8 pytest tests, all passing.
- **Task 6** (sonnet subagent): `autonomous/lib/check_slots.py` with 3 pytest tests
- **Batch B** (sonnet subagent, Tasks 7-9): bash scripts `inventory_prs.sh`, `query_rejections.sh`, `preflight.sh`. Notable: switched from `tea pulls list` to `tea api` because the former returns labels as a flat string (not objects).
- **Batch C** (sonnet subagent, Tasks 10-14): `.claude/agents/analyst.md`, sweep-mode append to `growth-po.md`, `dedup-haiku.md`, `orchestrator.md` (284 lines), `run-nightly.sh` wrapper
- **Task 18** (inline): preflight skip smoke test — added 15 dummy initiatives, verified `preflight.sh` exits 1, cleaned up
11 commits on the feature branch. Fast-forward merged to main. Worktree force-removed. Branch deleted. Pushed to origin.
One snag worth noting: the first subagent dispatch hit a wall of permission prompts Cal had to click through. Existing memory already had the rule "code-writing subagents MUST use mode: acceptEdits" — I'd just failed to apply it. Fixed for all subsequent dispatches.
### Phase 4 — Integration (Gitea + systemd)
- **Gitea labels** created via pd-ops agent in all 3 sub-project repos: `autonomous`, `size:S`, `size:M`, `type:stability`, `type:feature` (colors: `#6366f1`, `#10b981`, `#f59e0b`, `#0891b2`, `#ec4899`). Umbrella repo got its own set later when the observability ticket was filed.
- **Scheduled task** at `~/.config/claude-scheduled/tasks/autonomous-nightly/` — settings.json (haiku outer, $1 budget, 3600s timeout), prompt.md (just runs the wrapper), mcp.json (empty; the inner claude inherits Cal's global MCP config including gitea-mcp)
- **Systemd timer** at `~/.config/systemd/user/claude-scheduled@autonomous-nightly.timer` — nightly 02:00 with 15-min random delay, Persistent=true. Registered but NOT enabled.
### Phase 5 — First smoke test
Kicked off `autonomous/run-nightly.sh` at 02:40:07 local. Ran 15 minutes. Terminated at 02:55:47 by the $5 budget ceiling.
**Despite the budget hit, the pipeline actually worked:**
- Preflight ran cleanly (slots 10S/5M free, 0 open PRs, 0 rejections)
- Analyst produced 8 findings across database, discord-app, autonomous (self-improvement)
- Growth-po produced 5 findings (all discord Phase 2 roadmap items, all S-sized)
- Dedup correctly skipped (empty inputs = no possible dupes)
- POs made real decisions: many approved, several thoughtfully reshaped
- 2 PRs shipped before budget ran out, both correctly labeled and mergeable
**PRs shipped:**
- **discord-app#162**`chore(cogs): remove dead gameplay_legacy cog (4,723 lines, zero references)` — caught that `cogs/gameplay_legacy.py` was 4,723 lines of dead code with zero inbound references
- **database#211**`fix(packs): remove unfiltered pre-count in GET /packs (3 round-trips to 2)` — caught a real correctness bug: unfiltered `Pack.select().count()` was returning 404 when no packs existed globally instead of returning empty filter results
**What went wrong:**
1. Analyst alone consumed ~$2.50 with a 411s, 104-tool-use deep sweep
2. `pr-reviewer` dispatch was skipped — budget ran out
3. Digest Write was permission-denied (inner claude wasn't running with --dangerously-skip-permissions) — manually extracted and saved from the JSON output
4. pd-plan integration skipped — approved queued items only in the digest
5. 7 approved items never dispatched, including a high-priority real bug (economy cog overwriting `tree.on_error` causing stuck play-lock)
6. Multiple Bash tool denials wasted budget on retries (compound commands, venv activation, `source`, curl, `diff <()`)
### Phase 6 — Post-run fixes
Spun up a yolo-mode `claude -p` agent to apply three critical fixes. Commit `a79efb2`:
1. Inner claude budget: $5 → $20
2. Added `--dangerously-skip-permissions` to inner claude in `run-nightly.sh`
3. Analyst scope tightened in `.claude/agents/analyst.md`: max findings 15 → 5, added 30 tool-use cap with budget starvation rationale
Also filed `cal/paper-dynasty-umbrella#3` (labels: `autonomous`, `size:S`, `type:stability`) for the observability self-improvement (split stdout/stderr, write `last-run-result.json`, voice-notify on failure). This is exactly the kind of ticket the pipeline could pick up on a future autonomous run.
## Current state (as of 2026-04-10)
- ✅ All code merged to main and pushed to origin
- ✅ 15 Gitea labels created across 4 repos (3 sub-projects + umbrella)
- ✅ Scheduled task installed
- ✅ Systemd timer unit installed
- ✅ 2 real PRs shipped (pending Cal review / reviewer pipeline)
- ✅ Observability ticket filed
- ✅ Post-run fixes applied
- ⏸️ Systemd timer **NOT ENABLED** — pending one more validation smoke test with the $20 budget + tightened analyst
## Queued work for next run
See `project_autonomous_first_run.md` memory file for the full list. Headline items:
1. `analyst-2026-04-10-003` — Economy cog `tree.on_error` bug (real stuck-user impact) — dispatch first
2. `cal/paper-dynasty-umbrella#3` — Observability improvement (unblocks future debugging) — dispatch early
3. 5 other approved items from the first run (3 features, 2 stability)
4. 4 reshaped items that need additional spec work before dispatch
## Why this matters
This was a meta-accomplishment: building the tooling that builds the tooling. The pipeline is now a standing autonomous capability in the Paper Dynasty ecosystem. Cal's availability is no longer the bottleneck for routine stability fixes, small features, and dead-code cleanup. As confidence builds, the slot limits can rise, the budget can expand, and the scope can broaden.
The first run also validated a deeper question: **can agents produce genuinely useful work without human guidance on what to build?** The answer, based on these 2 PRs, is yes — the pipeline caught a real correctness bug and a real dead-code pile that Cal had not flagged. That's the whole value proposition working on night one.
## Next session pickup
When resuming:
1. Check status of `cal/paper-dynasty-discord#162` and `cal/paper-dynasty-database#211` — merged? closed? pending?
2. Check status of `cal/paper-dynasty-umbrella#3` — has it been picked up?
3. Decide: enable the systemd timer, or run another manual smoke test first
4. If running another smoke test: expect ~$7-10 with the new config (analyst $2, growth-po $0.30, 2 POs × $0.30, 5 engineers × $0.80, 5 pr-reviewers × $0.30)
5. See `project_autonomous_pipeline.md` and `project_autonomous_first_run.md` in memory for full context
## References
- Spec: `docs/superpowers/specs/2026-04-09-autonomous-improvement-pipeline-design.md`
- Plan: `docs/superpowers/plans/2026-04-09-autonomous-improvement-pipeline.md`
- Commit log: `git log --oneline --grep='autonomous'` in paper-dynasty-umbrella
- First run digest: `autonomous-nightly-2026-04-10.md` (this same domain)
- Live system: `/mnt/NV2/Development/paper-dynasty/autonomous/`