From eaf4bdbd6cd712ebe2149056b01240eca1c95972 Mon Sep 17 00:00:00 2001 From: cal Date: Tue, 24 Mar 2026 21:09:06 +0000 Subject: [PATCH] docs: refractor Phase 2 design validation spec (#51) --- docs/REFRACTOR_PHASE2_VALIDATION_SPEC.md | 468 +++++++++++++++++++++++ 1 file changed, 468 insertions(+) create mode 100644 docs/REFRACTOR_PHASE2_VALIDATION_SPEC.md diff --git a/docs/REFRACTOR_PHASE2_VALIDATION_SPEC.md b/docs/REFRACTOR_PHASE2_VALIDATION_SPEC.md new file mode 100644 index 0000000..bd03257 --- /dev/null +++ b/docs/REFRACTOR_PHASE2_VALIDATION_SPEC.md @@ -0,0 +1,468 @@ +# Refractor Phase 2 — Design Validation Spec + +## Purpose + +This document captures the design validation test cases that must be verified before and during +Phase 2 (rating boosts) of the Refractor card progression system. Phase 1 — tracking, +milestone evaluation, and tier state persistence — is implemented. Phase 2 adds the rating boost +application logic (`apply_evolution_boosts`), rarity upgrade at T4, and variant hash creation. + +**When to reference this document:** + +- Before beginning Phase 2 implementation: review all cases to understand the design constraints + and edge cases the implementation must handle. +- During implementation: use each test case as an acceptance gate before the corresponding + feature is considered complete. +- During code review: each case documents the "risk if failed" so reviewers can assess whether + a proposed implementation correctly handles that scenario. +- After Phase 2 ships: run the cases as a regression checklist before any future change to the + boost logic, rarity assignment, or milestone evaluator. + +## Background: Rating Model + +Batter cards have 22 outcome columns summing to exactly 108 chances (derived from the D20 +probability system: 2d6 x 3 columns x 6 rows). Each Refractor tier (T1 through T4) awards a +1.0-chance budget — a flat shift from out columns to positive-outcome columns. The total +accumulated budget across all four tiers is 4.0 chances, equal to approximately 3.7% of the +108-chance total (4 / 108 ≈ 0.037). + +**Rarity naming cross-reference:** The PRD chapters (`prd-evolution/`) use the player-facing +display names. The codebase and this spec use the internal names from `rarity_thresholds.py`. +They map as follows: + +| PRD / Display Name | Codebase Name | ID | +|---|---|---| +| Replacement | Common | 5 | +| Reserve | Bronze | 4 | +| Starter | Silver | 3 | +| All-Star | Gold | 2 | +| MVP | Diamond | 1 | +| Hall of Fame | HoF | 99 | + +All rarity references in this spec use the codebase names. + +Rarity IDs in the codebase (from `rarity_thresholds.py`): + +| Rarity Name | ID | +|---|---| +| Common | 5 | +| Bronze | 4 | +| Silver | 3 | +| Gold | 2 | +| Diamond | 1 | +| Hall of Fame | 99 | + +The special value `99` for Hall of Fame means a naive `rarity_id + 1` increment is incorrect; +the upgrade logic must use an ordered rarity ladder, not arithmetic. + +--- + +## Test Cases + +--- + +### T4-1: 108-sum preservation under profile-based boosts + +**Status:** Pending — Phase 2 + +**Scenario:** + +`apply_evolution_boosts(card_ratings, boost_tier, player_profile)` redistributes 1.0 chance per +tier across outcome columns according to the player's detected profile (power hitter, contact +hitter, patient hitter, starting pitcher, relief pitcher). Every combination of profile and tier +must leave the 22-column sum exactly equal to 108 after the boost is applied. This must hold for +all four tier applications, cumulative as well as individual. + +The edge case: a batter card where `flyout_a = 0`. The power and contact hitter profiles draw +reductions from out columns including `flyout_a`. If the preferred reduction column is at zero, +the implementation must not produce a negative value and must not silently drop the remainder of +the budget. The 0-floor cap is enforced per column (see `05-rating-boosts.md` section 5.1: +"Truncated points are lost, not redistributed"). + +Verify: +- After each of T1, T2, T3, T4 boost applications, `sum(all outcome columns) == 108` exactly. +- A card with `flyout_a = 0` does not raise an error and does not produce a column below 0. +- When truncation occurs (column already at 0), the lost budget is discarded, not moved + elsewhere — the post-boost sum will be less than 108 + budget_added only in the case of + truncation, but must never exceed 108. + +**Expected Outcome:** + +Sum remains 108 after every boost under non-truncation conditions. Under truncation conditions +(a column hits 0), the final column sum must equal exactly `108 - truncated_amount` — where +`truncated_amount` is the portion of the 1.0-chance budget that was dropped due to the 0-floor +cap. This is a single combined assertion: `sum(columns) == 108 - truncated_amount`. Checking +"sum <= 108" and "truncated amount was discarded" as two independent conditions is insufficient +— a test can pass both checks while the sum is wrong for an unrelated reason (e.g., a positive +column also lost value due to a bug). No column value falls below 0. + +**Risk If Failed:** + +A broken 108-sum produces invalid game probabilities. The D20 engine derives per-outcome +probabilities from `column / 108`. If the sum drifts above or below 108, every outcome +probability on that card is subtly wrong for every future game that uses it. This error silently +corrupts game results without any visible failure. + +**Files Involved:** + +- `docs/prd-evolution/05-rating-boosts.md` — boost budget, profile definitions, cap behavior +- Phase 2: `pd_cards/evo/boost_profiles.py` (to be created) — `apply_evolution_boosts` +- `batters/creation.py` — `battingcardratings` model column set (22 columns) +- `pitchers/creation.py` — `pitchingcardratings` model column set (18 columns + 9 x-checks) + +--- + +### T4-2: D20 probability shift at T4 + +**Status:** Pending — Phase 2 + +**Scenario:** + +Take a representative Bronze-rarity batter (e.g., a player with total OPS near 0.730, +`homerun` ≈ 1.2, `single_one` ≈ 4.0, `walk` ≈ 3.0 in the base ratings). Apply all four +tier boosts cumulatively, distributing the total 4.0-chance budget across positive-outcome +columns (HR, singles, walk) with equal reductions from out columns. Calculate the resulting +absolute and relative probability change per D20 roll outcome. + +Design target: the full T4 evolution shifts approximately 3.7% of all outcomes from outs to +positive results (4.0 / 108 = 0.037). The shift should be perceptible to a player reviewing +their card stats but should not fundamentally alter the card's tier or role. A Bronze batter +does not become a Gold batter through evolution — they become an evolved Bronze batter. + +Worked example for validation reference: +- Pre-evolution: `homerun = 1.2` → probability per D20 = 1.2 / 108 ≈ 1.11% +- Post T4 with +0.5 to homerun per tier (4 tiers × 0.5 = +2.0 total): `homerun = 3.2` + → probability per D20 = 3.2 / 108 ≈ 2.96% — an increase of ~1.85 percentage points +- Across all positive outcomes: total shift = 4.0 / 108 ≈ 3.7% + +**Expected Outcome:** + +The cumulative 4.0-chance shift produces a ~3.7% total movement from negative to positive +outcomes. No single outcome column increases by more than 2.5 chances across the full T4 +journey under any profile. The card remains recognizably Bronze — it does not cross the Gold +OPS threshold (0.900 for 2024/2025 thresholds; confirmed in `rarity_thresholds.py` +`BATTER_THRESHOLDS_2024.gold` and `BATTER_THRESHOLDS_2025.gold`) unless it was already near +the boundary. Note: 0.700 is the Bronze floor (`bronze` field), not the Gold threshold. + +**Risk If Failed:** + +If the shift is too large, evolution becomes a rarity bypass — players grind low-rarity cards +to simulate an upgrade they cannot earn through pack pulls. If the shift is too small, the +system feels unrewarding and players lose motivation to complete tiers. Either miscalibration +undermines the core design intent. + +**Files Involved:** + +- `docs/prd-evolution/05-rating-boosts.md` — section 5.2 (boost budgets), section 5.3 (profiles) +- `rarity_thresholds.py` — OPS boundary values used to assess whether evolution crosses a rarity + threshold as a side effect (it should not for mid-range cards) +- Phase 2: `pd_cards/evo/boost_profiles.py` — boost distribution logic + +--- + +### T4-3: T4 rarity upgrade — pipeline collision risk + +**Status:** Pending — Phase 2 + +**Scenario:** + +The Refractor T4 rarity upgrade (`player.rarity_id` incremented by one ladder step) and the +live-series `post_player_updates()` rarity assignment (OPS-threshold-based, in +`batters/creation.py`) both write to the same `rarity_id` field on the player record. A +collision occurs when both run against the same player: + +1. Player completes Refractor T4. Evolution system upgrades rarity: Bronze (4) → Silver (3). + `evolution_card_state.final_rarity_id = 3` is written as an audit record. +2. Live-series update runs two weeks later. `post_player_updates()` recalculates OPS → maps to + Bronze (4) → writes `rarity_id = 4` to the player record. +3. The T4 rarity upgrade is silently overwritten. The player's card reverts to Bronze. The + `evolution_card_state` record still shows `final_rarity_id = 3` but the live card is Bronze. + +This is a conflict between two independent systems both writing to the same field without +awareness of each other. The current live-series pipeline has no concept of evolution state. + +Proposed resolution strategies (document and evaluate; do not implement during Phase 2 spec): +- **Guard clause in `post_player_updates()`:** Before writing `rarity_id`, check + `evolution_card_state.final_rarity_id` for the player. If an evolution upgrade is on record, + apply `max(ops_rarity, final_rarity_id_ladder_position)` — never downgrade past the T4 result. +- **Separate evolution rarity field:** Add `evolution_rarity_bump` (int, default 0) to the + card model. The game engine resolves effective rarity as `base_rarity + bump`. Live-series + updates only touch `base_rarity`; the bump is immutable once T4 is reached. +- **Deferred rarity upgrade:** T4 does not write `rarity_id` immediately. Instead, it sets a + flag on `evolution_card_state`. `post_player_updates()` checks the flag and applies the bump + after its own rarity calculation, ensuring the evolution upgrade layers on top of the current + OPS-derived rarity rather than competing with it. + +**Expected Outcome:** + +Phase 2 must implement one of these strategies (or an alternative that provides equivalent +protection). The collision scenario must be explicitly tested: evolve a Bronze card to T4, +run a live-series update that maps the same player to Bronze, confirm the displayed rarity is +Silver or higher — not Bronze. + +**Risk If Failed:** + +Live-series updates silently revert T4 rarity upgrades. Players invest significant game time +reaching T4, receive the visual rarity upgrade, then lose it after the next live-series run +with no explanation. This is one of the highest-trust violations the system can produce — a +reward that disappears invisibly. + +**Files Involved:** + +- `batters/creation.py` — `post_player_updates()` (lines ~304–480) +- `pitchers/creation.py` — equivalent `post_player_updates()` for pitchers +- `docs/prd-evolution/05-rating-boosts.md` — section 5.4 (rarity upgrade at T4), note on live + series interaction +- Phase 2: `pd_cards/evo/tier_completion.py` (to be created) — T4 completion handler +- Database: `evolution_card_state` table, `final_rarity_id` column + +--- + +### T4-4: T4 rarity cap for HoF cards + +**Status:** Pending — Phase 2 + +**Scenario:** + +A player card currently at Hall of Fame rarity (`rarity_id = 99`) completes Refractor T4. The +design specifies: HoF cards receive the T4 rating boost deltas (1.0 chance shift) but do not +receive a rarity upgrade. The rarity stays at 99. + +The implementation must handle this without producing an invalid rarity value. The rarity ID +sequence in `rarity_thresholds.py` is non-contiguous — the IDs are: + +``` +5 (Common) → 4 (Bronze) → 3 (Silver) → 2 (Gold) → 1 (Diamond) → 99 (Hall of Fame) +``` + +A naive `rarity_id + 1` would produce `100`, which is not a valid rarity. A lookup-table +approach on the ordered ladder must be used instead. At `99` (HoF), the ladder returns `99` +(no-op). Additionally, Diamond (1) cards that complete T4 should upgrade to HoF (99), not to +`rarity_id = 0` or any other invalid value. + +**Expected Outcome:** + +- `rarity_id = 99` (HoF): T4 boost applied, rarity unchanged at 99. +- `rarity_id = 1` (Diamond): T4 boost applied, rarity upgrades to 99 (HoF). +- `rarity_id = 2` (Gold): T4 boost applied, rarity upgrades to 1 (Diamond). +- `rarity_id = 3` (Silver): T4 boost applied, rarity upgrades to 2 (Gold). +- `rarity_id = 4` (Bronze): T4 boost applied, rarity upgrades to 3 (Silver). +- `rarity_id = 5` (Common): T4 boost applied, rarity upgrades to 4 (Bronze). +- No card ever receives `rarity_id` outside the set {1, 2, 3, 4, 5, 99}. + +**Risk If Failed:** + +An invalid rarity ID (e.g., 0, 100, or None) propagates into the game engine and Discord bot +display layer. Cards with invalid rarities may render incorrectly, break sort/filter operations +in pack-opening UX, or cause exceptions in code paths that switch on rarity values. + +**Files Involved:** + +- `rarity_thresholds.py` — authoritative rarity ID definitions +- `docs/prd-evolution/05-rating-boosts.md` — section 5.4 (HoF cap behavior) +- Phase 2: `pd_cards/evo/tier_completion.py` — rarity ladder lookup, T4 completion handler +- Database: `evolution_card_state.final_rarity_id` + +--- + +### T4-5: RP T1 achievability in realistic timeframe + +**Status:** Pending — Phase 2 + +**Scenario:** + +The Relief Pitcher track formula is `IP + K` with a T1 threshold of 3. The design intent is +"almost any active reliever hits this" in approximately 2 appearances (from `04-milestones.md` +section 4.2). The scenario to validate: a reliever who throws 1.2 IP (4 outs) with 1 K in an +appearance scores `1.33 + 1 = 2.33` — below T1. This reliever needs another appearance before +reaching T1. + +The validation question is whether this is a blocking problem. If typical active RP usage +(5+ team game appearances) reliably produces T1 within a few sessions of play, the design is +sound. If a reliever can appear 4–5 times and still not reach T1 due to short, low-strikeout +outings (e.g., a pure groundball closer who throws 1.0 IP / 0 K per outing), the threshold +may be too high for the RP role to feel rewarding. + +Reference calibration data from Season 10 (via `evo_milestone_simulator.py`): ~94% of all +relievers reached T1 under the IP+K formula with the threshold of 3. However, this is based on +a full or near-full season of data. The question is whether early-season RP usage (first 3–5 +team games) produces T1 reliably. + +Worked example for a pure-groundball closer: +- 5 appearances × (1.0 IP + 0 K) = 5.0 — reaches T1 (threshold 3) after appearance 3 +- 5 appearances × (0.2 IP + 0 K) = 1.0 — does not reach T1 after 5 appearances + +The second case (mop-up reliever with minimal usage) is expected to not reach T1 quickly, and +the design accepts this. What is NOT acceptable: a dedicated closer or setup man with 2+ IP per +session failing to reach T1 after 5+ appearances. + +**Expected Outcome:** + +A reliever averaging 1.0+ IP per appearance reaches T1 after 3 appearances. A reliever +averaging 0.5+ IP per appearance reaches T1 after 5–6 appearances. A reliever with fewer than +3 total appearances in a season is not expected to reach T1 — this is acceptable. The ~94% +Season 10 T1 rate confirms the threshold is calibrated correctly for active relievers. + +**Risk If Failed:** + +If active relievers (regular bullpen roles) cannot reach T1 within 5–10 team games, the +Refractor system is effectively dead for RP cards from launch. Players who pick up RP cards +expecting progression will see no reward for multiple play sessions, creating a negative first +impression of the entire system. + +**Files Involved:** + +- `docs/prd-evolution/04-milestones.md` — section 4.2 (RP track thresholds and design intent), + section 4.3 (Season 10 calibration data) +- `scripts/evo_milestone_simulator.py` — `formula_rp_ip_k`, `simulate_tiers` — re-run against + current season data to validate T1 achievability in early-season usage windows +- Database: `evolution_track` table — threshold values (admin-tunable, no code change required + if recalibration is needed) + +--- + +### T4-6: SP/RP T4 parity with batters + +**Status:** Pending — Phase 2 + +**Scenario:** + +The T4 thresholds are: + +| Position | T4 Threshold | Formula | +|---|---|---| +| Batter | 896 | PA + (TB x 2) | +| Starting Pitcher | 240 | IP + K | +| Relief Pitcher | 70 | IP + K | + +These were calibrated against Season 10 production data using `evo_milestone_simulator.py`. +The calibration target was approximately 3% of active players reaching T4 over a full season +across all position types. The validation here is that this parity holds: one position type +does not trivially farm Superfractors while another cannot reach T2 without extraordinary +performance. + +The specific risk: SP T4 requires 240 IP+K across the full season. Top Season 10 SPs (Harang: +163, deGrom: 143) were on pace for T4 at the time of measurement but had not crossed 240 yet. +If the final-season data shows a spike (e.g., 10–15% of SPs reaching T4 vs. 3% of batters), +the SP threshold needs adjustment. Conversely, if no reliever reaches T4 in a full season +where 94% reach T1, the RP T4 threshold of 70 may be achievable only by top closers in +extreme usage scenarios. + +Validation requires re-running `evo_milestone_simulator.py --season ` with the final +season data for all three position types and comparing T4 reach percentages. Accepted tolerance: +T4 reach rate within 2x across position types (e.g., if batters are at 3%, SP and RP should be +between 1.5% and 6%). + +**Expected Outcome:** + +All three position types produce T4 rates between 1% and 6% over a full season of active play. +No position type produces T4 rates above 10% (trivially farmable) or below 0.5% (effectively +unachievable). SP and RP T4 rates should be comparable because their thresholds were designed +together with the same 3% target in mind. + +**Risk If Failed:** + +If SP is easy (T4 in half a season) while RP is hard (T4 only for elite closers), then SP card +owners extract disproportionate value from the system. The Refractor system's balance premise +— "same tier, same reward, regardless of position" — breaks down, undermining player confidence +in the fairness of the progression. + +**Files Involved:** + +- `docs/prd-evolution/04-milestones.md` — section 4.3 (Season 10 calibration table) +- `scripts/evo_milestone_simulator.py` — primary validation tool; run with `--all-formulas + --pitchers-only` and `--batters-only` flags against final season data +- Database: `evolution_track` table — thresholds are admin-tunable; recalibration does not + require a code deployment + +--- + +### T4-7: Cross-season stat accumulation — design confirmation + +**Status:** Pending — Phase 2 + +**Scenario:** + +The milestone evaluator (Phase 1, already implemented) queries `BattingSeasonStats` and +`PitchingSeasonStats` and SUMs the formula metric across all rows for a given +`(player_id, team_id)` pair, regardless of season number. This means a player's Refractor +progress is cumulative across seasons: if a player reaches 400 batter points in Season 10 and +another 400 in Season 11, their total is 800 — within range of T4 (threshold: 896). + +This design must be confirmed as intentional before Phase 2 is implemented, because it has +significant downstream implications: + +1. **Progress does not reset between seasons.** A player who earns a card across multiple + seasons continues progressing the same Refractor state. Season boundaries are invisible to + the evaluator. +2. **New teams start from zero.** If a player trades away a card and acquires a new copy of the + same player, the new card's `evolution_card_state` row starts at T0. The stat accumulation + query is scoped to `(player_id, team_id)`, so historical stats from the previous owner are + not inherited. +3. **Live-series stat updates do not retroactively change progress.** The evaluator reads + finalized season stat rows. If a player's Season 10 stats are adjusted via a data correction, + the evaluator will pick up the change on the next evaluation run — progress could shift + backward if a data correction removes a game's stats. +4. **The "full season" targets in the design docs (e.g., "T4 requires ~120 games") assume + cumulative multi-season play, not a single season.** At ~7.5 batter points per game, T4 of + 896 requires approximately 120 in-game appearances. A player who plays 40 games per season + across three seasons reaches T4 in their third season. + +This is the confirmed intended design per `04-milestones.md`: "Cumulative within a season — +progress never resets mid-season." The document does not explicitly state "cumulative across +seasons," but the evaluator implementation (SUM across all rows, no season filter) makes this +behavior implicit. This test case exists to surface that ambiguity and require an explicit +design decision before Phase 2 ships. + +**Expected Outcome:** + +Before Phase 2 implementation begins, the design intent must be explicitly confirmed in writing +(update `04-milestones.md` section 4.1 with a cross-season statement) or the evaluator query +must be updated to add a season boundary. The options are: + +- **Option A (current behavior — accumulate across seasons):** Document explicitly. The + Refractor journey can span multiple seasons. Long-term card holders are rewarded for loyalty. +- **Option B (reset per season):** Add a season filter to the evaluator query. Refractor + progress resets at season start. T4 is achievable within a single full season. Cards earned + mid-season have a natural catch-up disadvantage. + +This spec takes no position on which option is correct. It records that the choice exists, +that the current implementation defaults to Option A, and that Phase 2 must not be built on +an unexamined assumption about which option is in effect. + +**Risk If Failed:** + +If Option A is unintentional and players discover their Refractor progress carries over across +seasons before it is documented as a feature, they will optimize around it in ways the design +did not anticipate (e.g., holding cards across seasons purely to farm Refractor tiers). If +Option B is unintentional and progress resets each season without warning, players who invested +heavily in T3 at season end will be angry when their progress disappears. + +**Files Involved:** + +- `docs/prd-evolution/04-milestones.md` — section 4.1 (design principles) — **requires update + to state the cross-season policy explicitly** +- Phase 1 (implemented): `pd_cards/evo/evaluator.py` — stat accumulation query; inspect the + WHERE clause for any season filter +- Database: `BattingSeasonStats`, `PitchingSeasonStats` — confirm schema includes `season` + column and whether the evaluator query filters on it +- Database: `evolution_card_state` — confirm there is no season-reset logic in the state + management layer + +--- + +## Summary Status + +| ID | Title | Status | +|---|---|---| +| T4-1 | 108-sum preservation under profile-based boosts | Pending — Phase 2 | +| T4-2 | D20 probability shift at T4 | Pending — Phase 2 | +| T4-3 | T4 rarity upgrade — pipeline collision risk | Pending — Phase 2 | +| T4-4 | T4 rarity cap for HoF cards | Pending — Phase 2 | +| T4-5 | RP T1 achievability in realistic timeframe | Pending — Phase 2 | +| T4-6 | SP/RP T4 parity with batters | Pending — Phase 2 | +| T4-7 | Cross-season stat accumulation — design confirmation | Pending — Phase 2 | + +All cases are unblocked pending Phase 2 implementation. T4-7 requires a design decision before +any Phase 2 code is written. T4-3 requires a resolution strategy to be selected before the T4 +completion handler is implemented.