docs: refractor Phase 2 design validation spec (#51)

2026-03-24 21:09:06 +00:00 · 2026-03-24 21:09:06 +00:00 · eaf4bdbd6c
commit eaf4bdbd6c
parent 8c00bacf59
1 changed files with 468 additions and 0 deletions
--- a/docs/REFRACTOR_PHASE2_VALIDATION_SPEC.md
+++ b/docs/REFRACTOR_PHASE2_VALIDATION_SPEC.md
@ -0,0 +1,468 @@
+# Refractor Phase 2 — Design Validation Spec
+
+## Purpose
+
+This document captures the design validation test cases that must be verified before and during
+Phase 2 (rating boosts) of the Refractor card progression system. Phase 1 — tracking,
+milestone evaluation, and tier state persistence — is implemented. Phase 2 adds the rating boost
+application logic (`apply_evolution_boosts`), rarity upgrade at T4, and variant hash creation.
+
+**When to reference this document:**
+
+- Before beginning Phase 2 implementation: review all cases to understand the design constraints
+  and edge cases the implementation must handle.
+- During implementation: use each test case as an acceptance gate before the corresponding
+  feature is considered complete.
+- During code review: each case documents the "risk if failed" so reviewers can assess whether
+  a proposed implementation correctly handles that scenario.
+- After Phase 2 ships: run the cases as a regression checklist before any future change to the
+  boost logic, rarity assignment, or milestone evaluator.
+
+## Background: Rating Model
+
+Batter cards have 22 outcome columns summing to exactly 108 chances (derived from the D20
+probability system: 2d6 x 3 columns x 6 rows). Each Refractor tier (T1 through T4) awards a
+1.0-chance budget — a flat shift from out columns to positive-outcome columns. The total
+accumulated budget across all four tiers is 4.0 chances, equal to approximately 3.7% of the
+108-chance total (4 / 108 ≈ 0.037).
+
+**Rarity naming cross-reference:** The PRD chapters (`prd-evolution/`) use the player-facing
+display names. The codebase and this spec use the internal names from `rarity_thresholds.py`.
+They map as follows:
+
+| PRD / Display Name | Codebase Name | ID  |
+|---|---|---|
+| Replacement        | Common        | 5   |
+| Reserve            | Bronze        | 4   |
+| Starter            | Silver        | 3   |
+| All-Star           | Gold          | 2   |
+| MVP                | Diamond       | 1   |
+| Hall of Fame       | HoF           | 99  |
+
+All rarity references in this spec use the codebase names.
+
+Rarity IDs in the codebase (from `rarity_thresholds.py`):
+
+| Rarity Name  | ID  |
+|---|---|
+| Common       | 5   |
+| Bronze       | 4   |
+| Silver       | 3   |
+| Gold         | 2   |
+| Diamond      | 1   |
+| Hall of Fame | 99  |
+
+The special value `99` for Hall of Fame means a naive `rarity_id + 1` increment is incorrect;
+the upgrade logic must use an ordered rarity ladder, not arithmetic.
+
+---
+
+## Test Cases
+
+---
+
+### T4-1: 108-sum preservation under profile-based boosts
+
+**Status:** Pending — Phase 2
+
+**Scenario:**
+
+`apply_evolution_boosts(card_ratings, boost_tier, player_profile)` redistributes 1.0 chance per
+tier across outcome columns according to the player's detected profile (power hitter, contact
+hitter, patient hitter, starting pitcher, relief pitcher). Every combination of profile and tier
+must leave the 22-column sum exactly equal to 108 after the boost is applied. This must hold for
+all four tier applications, cumulative as well as individual.
+
+The edge case: a batter card where `flyout_a = 0`. The power and contact hitter profiles draw
+reductions from out columns including `flyout_a`. If the preferred reduction column is at zero,
+the implementation must not produce a negative value and must not silently drop the remainder of
+the budget. The 0-floor cap is enforced per column (see `05-rating-boosts.md` section 5.1:
+"Truncated points are lost, not redistributed").
+
+Verify:
+- After each of T1, T2, T3, T4 boost applications, `sum(all outcome columns) == 108` exactly.
+- A card with `flyout_a = 0` does not raise an error and does not produce a column below 0.
+- When truncation occurs (column already at 0), the lost budget is discarded, not moved
+  elsewhere — the post-boost sum will be less than 108 + budget_added only in the case of
+  truncation, but must never exceed 108.
+
+**Expected Outcome:**
+
+Sum remains 108 after every boost under non-truncation conditions. Under truncation conditions
+(a column hits 0), the final column sum must equal exactly `108 - truncated_amount` — where
+`truncated_amount` is the portion of the 1.0-chance budget that was dropped due to the 0-floor
+cap. This is a single combined assertion: `sum(columns) == 108 - truncated_amount`. Checking
+"sum <= 108" and "truncated amount was discarded" as two independent conditions is insufficient
+— a test can pass both checks while the sum is wrong for an unrelated reason (e.g., a positive
+column also lost value due to a bug). No column value falls below 0.
+
+**Risk If Failed:**
+
+A broken 108-sum produces invalid game probabilities. The D20 engine derives per-outcome
+probabilities from `column / 108`. If the sum drifts above or below 108, every outcome
+probability on that card is subtly wrong for every future game that uses it. This error silently
+corrupts game results without any visible failure.
+
+**Files Involved:**
+
+- `docs/prd-evolution/05-rating-boosts.md` — boost budget, profile definitions, cap behavior
+- Phase 2: `pd_cards/evo/boost_profiles.py` (to be created) — `apply_evolution_boosts`
+- `batters/creation.py` — `battingcardratings` model column set (22 columns)
+- `pitchers/creation.py` — `pitchingcardratings` model column set (18 columns + 9 x-checks)
+
+---
+
+### T4-2: D20 probability shift at T4
+
+**Status:** Pending — Phase 2
+
+**Scenario:**
+
+Take a representative Bronze-rarity batter (e.g., a player with total OPS near 0.730,
+`homerun` ≈ 1.2, `single_one` ≈ 4.0, `walk` ≈ 3.0 in the base ratings). Apply all four
+tier boosts cumulatively, distributing the total 4.0-chance budget across positive-outcome
+columns (HR, singles, walk) with equal reductions from out columns. Calculate the resulting
+absolute and relative probability change per D20 roll outcome.
+
+Design target: the full T4 evolution shifts approximately 3.7% of all outcomes from outs to
+positive results (4.0 / 108 = 0.037). The shift should be perceptible to a player reviewing
+their card stats but should not fundamentally alter the card's tier or role. A Bronze batter
+does not become a Gold batter through evolution — they become an evolved Bronze batter.
+
+Worked example for validation reference:
+- Pre-evolution: `homerun = 1.2` → probability per D20 = 1.2 / 108 ≈ 1.11%
+- Post T4 with +0.5 to homerun per tier (4 tiers × 0.5 = +2.0 total): `homerun = 3.2`
+  → probability per D20 = 3.2 / 108 ≈ 2.96% — an increase of ~1.85 percentage points
+- Across all positive outcomes: total shift = 4.0 / 108 ≈ 3.7%
+
+**Expected Outcome:**
+
+The cumulative 4.0-chance shift produces a ~3.7% total movement from negative to positive
+outcomes. No single outcome column increases by more than 2.5 chances across the full T4
+journey under any profile. The card remains recognizably Bronze — it does not cross the Gold
+OPS threshold (0.900 for 2024/2025 thresholds; confirmed in `rarity_thresholds.py`
+`BATTER_THRESHOLDS_2024.gold` and `BATTER_THRESHOLDS_2025.gold`) unless it was already near
+the boundary. Note: 0.700 is the Bronze floor (`bronze` field), not the Gold threshold.
+
+**Risk If Failed:**
+
+If the shift is too large, evolution becomes a rarity bypass — players grind low-rarity cards
+to simulate an upgrade they cannot earn through pack pulls. If the shift is too small, the
+system feels unrewarding and players lose motivation to complete tiers. Either miscalibration
+undermines the core design intent.
+
+**Files Involved:**
+
+- `docs/prd-evolution/05-rating-boosts.md` — section 5.2 (boost budgets), section 5.3 (profiles)
+- `rarity_thresholds.py` — OPS boundary values used to assess whether evolution crosses a rarity
+  threshold as a side effect (it should not for mid-range cards)
+- Phase 2: `pd_cards/evo/boost_profiles.py` — boost distribution logic
+
+---
+
+### T4-3: T4 rarity upgrade — pipeline collision risk
+
+**Status:** Pending — Phase 2
+
+**Scenario:**
+
+The Refractor T4 rarity upgrade (`player.rarity_id` incremented by one ladder step) and the
+live-series `post_player_updates()` rarity assignment (OPS-threshold-based, in
+`batters/creation.py`) both write to the same `rarity_id` field on the player record. A
+collision occurs when both run against the same player:
+
+1. Player completes Refractor T4. Evolution system upgrades rarity: Bronze (4) → Silver (3).
+   `evolution_card_state.final_rarity_id = 3` is written as an audit record.
+2. Live-series update runs two weeks later. `post_player_updates()` recalculates OPS → maps to
+   Bronze (4) → writes `rarity_id = 4` to the player record.
+3. The T4 rarity upgrade is silently overwritten. The player's card reverts to Bronze. The
+   `evolution_card_state` record still shows `final_rarity_id = 3` but the live card is Bronze.
+
+This is a conflict between two independent systems both writing to the same field without
+awareness of each other. The current live-series pipeline has no concept of evolution state.
+
+Proposed resolution strategies (document and evaluate; do not implement during Phase 2 spec):
+- **Guard clause in `post_player_updates()`:** Before writing `rarity_id`, check
+  `evolution_card_state.final_rarity_id` for the player. If an evolution upgrade is on record,
+  apply `max(ops_rarity, final_rarity_id_ladder_position)` — never downgrade past the T4 result.
+- **Separate evolution rarity field:** Add `evolution_rarity_bump` (int, default 0) to the
+  card model. The game engine resolves effective rarity as `base_rarity + bump`. Live-series
+  updates only touch `base_rarity`; the bump is immutable once T4 is reached.
+- **Deferred rarity upgrade:** T4 does not write `rarity_id` immediately. Instead, it sets a
+  flag on `evolution_card_state`. `post_player_updates()` checks the flag and applies the bump
+  after its own rarity calculation, ensuring the evolution upgrade layers on top of the current
+  OPS-derived rarity rather than competing with it.
+
+**Expected Outcome:**
+
+Phase 2 must implement one of these strategies (or an alternative that provides equivalent
+protection). The collision scenario must be explicitly tested: evolve a Bronze card to T4,
+run a live-series update that maps the same player to Bronze, confirm the displayed rarity is
+Silver or higher — not Bronze.
+
+**Risk If Failed:**
+
+Live-series updates silently revert T4 rarity upgrades. Players invest significant game time
+reaching T4, receive the visual rarity upgrade, then lose it after the next live-series run
+with no explanation. This is one of the highest-trust violations the system can produce — a
+reward that disappears invisibly.
+
+**Files Involved:**
+
+- `batters/creation.py` — `post_player_updates()` (lines ~304–480)
+- `pitchers/creation.py` — equivalent `post_player_updates()` for pitchers
+- `docs/prd-evolution/05-rating-boosts.md` — section 5.4 (rarity upgrade at T4), note on live
+  series interaction
+- Phase 2: `pd_cards/evo/tier_completion.py` (to be created) — T4 completion handler
+- Database: `evolution_card_state` table, `final_rarity_id` column
+
+---
+
+### T4-4: T4 rarity cap for HoF cards
+
+**Status:** Pending — Phase 2
+
+**Scenario:**
+
+A player card currently at Hall of Fame rarity (`rarity_id = 99`) completes Refractor T4. The
+design specifies: HoF cards receive the T4 rating boost deltas (1.0 chance shift) but do not
+receive a rarity upgrade. The rarity stays at 99.
+
+The implementation must handle this without producing an invalid rarity value. The rarity ID
+sequence in `rarity_thresholds.py` is non-contiguous — the IDs are:
+
+```
+5 (Common) → 4 (Bronze) → 3 (Silver) → 2 (Gold) → 1 (Diamond) → 99 (Hall of Fame)
+```
+
+A naive `rarity_id + 1` would produce `100`, which is not a valid rarity. A lookup-table
+approach on the ordered ladder must be used instead. At `99` (HoF), the ladder returns `99`
+(no-op). Additionally, Diamond (1) cards that complete T4 should upgrade to HoF (99), not to
+`rarity_id = 0` or any other invalid value.
+
+**Expected Outcome:**
+
+- `rarity_id = 99` (HoF): T4 boost applied, rarity unchanged at 99.
+- `rarity_id = 1` (Diamond): T4 boost applied, rarity upgrades to 99 (HoF).
+- `rarity_id = 2` (Gold): T4 boost applied, rarity upgrades to 1 (Diamond).
+- `rarity_id = 3` (Silver): T4 boost applied, rarity upgrades to 2 (Gold).
+- `rarity_id = 4` (Bronze): T4 boost applied, rarity upgrades to 3 (Silver).
+- `rarity_id = 5` (Common): T4 boost applied, rarity upgrades to 4 (Bronze).
+- No card ever receives `rarity_id` outside the set {1, 2, 3, 4, 5, 99}.
+
+**Risk If Failed:**
+
+An invalid rarity ID (e.g., 0, 100, or None) propagates into the game engine and Discord bot
+display layer. Cards with invalid rarities may render incorrectly, break sort/filter operations
+in pack-opening UX, or cause exceptions in code paths that switch on rarity values.
+
+**Files Involved:**
+
+- `rarity_thresholds.py` — authoritative rarity ID definitions
+- `docs/prd-evolution/05-rating-boosts.md` — section 5.4 (HoF cap behavior)
+- Phase 2: `pd_cards/evo/tier_completion.py` — rarity ladder lookup, T4 completion handler
+- Database: `evolution_card_state.final_rarity_id`
+
+---
+
+### T4-5: RP T1 achievability in realistic timeframe
+
+**Status:** Pending — Phase 2
+
+**Scenario:**
+
+The Relief Pitcher track formula is `IP + K` with a T1 threshold of 3. The design intent is
+"almost any active reliever hits this" in approximately 2 appearances (from `04-milestones.md`
+section 4.2). The scenario to validate: a reliever who throws 1.2 IP (4 outs) with 1 K in an
+appearance scores `1.33 + 1 = 2.33` — below T1. This reliever needs another appearance before
+reaching T1.
+
+The validation question is whether this is a blocking problem. If typical active RP usage
+(5+ team game appearances) reliably produces T1 within a few sessions of play, the design is
+sound. If a reliever can appear 4–5 times and still not reach T1 due to short, low-strikeout
+outings (e.g., a pure groundball closer who throws 1.0 IP / 0 K per outing), the threshold
+may be too high for the RP role to feel rewarding.
+
+Reference calibration data from Season 10 (via `evo_milestone_simulator.py`): ~94% of all
+relievers reached T1 under the IP+K formula with the threshold of 3. However, this is based on
+a full or near-full season of data. The question is whether early-season RP usage (first 3–5
+team games) produces T1 reliably.
+
+Worked example for a pure-groundball closer:
+- 5 appearances × (1.0 IP + 0 K) = 5.0 — reaches T1 (threshold 3) after appearance 3
+- 5 appearances × (0.2 IP + 0 K) = 1.0 — does not reach T1 after 5 appearances
+
+The second case (mop-up reliever with minimal usage) is expected to not reach T1 quickly, and
+the design accepts this. What is NOT acceptable: a dedicated closer or setup man with 2+ IP per
+session failing to reach T1 after 5+ appearances.
+
+**Expected Outcome:**
+
+A reliever averaging 1.0+ IP per appearance reaches T1 after 3 appearances. A reliever
+averaging 0.5+ IP per appearance reaches T1 after 5–6 appearances. A reliever with fewer than
+3 total appearances in a season is not expected to reach T1 — this is acceptable. The ~94%
+Season 10 T1 rate confirms the threshold is calibrated correctly for active relievers.
+
+**Risk If Failed:**
+
+If active relievers (regular bullpen roles) cannot reach T1 within 5–10 team games, the
+Refractor system is effectively dead for RP cards from launch. Players who pick up RP cards
+expecting progression will see no reward for multiple play sessions, creating a negative first
+impression of the entire system.
+
+**Files Involved:**
+
+- `docs/prd-evolution/04-milestones.md` — section 4.2 (RP track thresholds and design intent),
+  section 4.3 (Season 10 calibration data)
+- `scripts/evo_milestone_simulator.py` — `formula_rp_ip_k`, `simulate_tiers` — re-run against
+  current season data to validate T1 achievability in early-season usage windows
+- Database: `evolution_track` table — threshold values (admin-tunable, no code change required
+  if recalibration is needed)
+
+---
+
+### T4-6: SP/RP T4 parity with batters
+
+**Status:** Pending — Phase 2
+
+**Scenario:**
+
+The T4 thresholds are:
+
+| Position | T4 Threshold | Formula |
+|---|---|---|
+| Batter | 896 | PA + (TB x 2) |
+| Starting Pitcher | 240 | IP + K |
+| Relief Pitcher | 70 | IP + K |
+
+These were calibrated against Season 10 production data using `evo_milestone_simulator.py`.
+The calibration target was approximately 3% of active players reaching T4 over a full season
+across all position types. The validation here is that this parity holds: one position type
+does not trivially farm Superfractors while another cannot reach T2 without extraordinary
+performance.
+
+The specific risk: SP T4 requires 240 IP+K across the full season. Top Season 10 SPs (Harang:
+163, deGrom: 143) were on pace for T4 at the time of measurement but had not crossed 240 yet.
+If the final-season data shows a spike (e.g., 10–15% of SPs reaching T4 vs. 3% of batters),
+the SP threshold needs adjustment. Conversely, if no reliever reaches T4 in a full season
+where 94% reach T1, the RP T4 threshold of 70 may be achievable only by top closers in
+extreme usage scenarios.
+
+Validation requires re-running `evo_milestone_simulator.py --season <current>` with the final
+season data for all three position types and comparing T4 reach percentages. Accepted tolerance:
+T4 reach rate within 2x across position types (e.g., if batters are at 3%, SP and RP should be
+between 1.5% and 6%).
+
+**Expected Outcome:**
+
+All three position types produce T4 rates between 1% and 6% over a full season of active play.
+No position type produces T4 rates above 10% (trivially farmable) or below 0.5% (effectively
+unachievable). SP and RP T4 rates should be comparable because their thresholds were designed
+together with the same 3% target in mind.
+
+**Risk If Failed:**
+
+If SP is easy (T4 in half a season) while RP is hard (T4 only for elite closers), then SP card
+owners extract disproportionate value from the system. The Refractor system's balance premise
+— "same tier, same reward, regardless of position" — breaks down, undermining player confidence
+in the fairness of the progression.
+
+**Files Involved:**
+
+- `docs/prd-evolution/04-milestones.md` — section 4.3 (Season 10 calibration table)
+- `scripts/evo_milestone_simulator.py` — primary validation tool; run with `--all-formulas
+  --pitchers-only` and `--batters-only` flags against final season data
+- Database: `evolution_track` table — thresholds are admin-tunable; recalibration does not
+  require a code deployment
+
+---
+
+### T4-7: Cross-season stat accumulation — design confirmation
+
+**Status:** Pending — Phase 2
+
+**Scenario:**
+
+The milestone evaluator (Phase 1, already implemented) queries `BattingSeasonStats` and
+`PitchingSeasonStats` and SUMs the formula metric across all rows for a given
+`(player_id, team_id)` pair, regardless of season number. This means a player's Refractor
+progress is cumulative across seasons: if a player reaches 400 batter points in Season 10 and
+another 400 in Season 11, their total is 800 — within range of T4 (threshold: 896).
+
+This design must be confirmed as intentional before Phase 2 is implemented, because it has
+significant downstream implications:
+
+1. **Progress does not reset between seasons.** A player who earns a card across multiple
+   seasons continues progressing the same Refractor state. Season boundaries are invisible to
+   the evaluator.
+2. **New teams start from zero.** If a player trades away a card and acquires a new copy of the
+   same player, the new card's `evolution_card_state` row starts at T0. The stat accumulation
+   query is scoped to `(player_id, team_id)`, so historical stats from the previous owner are
+   not inherited.
+3. **Live-series stat updates do not retroactively change progress.** The evaluator reads
+   finalized season stat rows. If a player's Season 10 stats are adjusted via a data correction,
+   the evaluator will pick up the change on the next evaluation run — progress could shift
+   backward if a data correction removes a game's stats.
+4. **The "full season" targets in the design docs (e.g., "T4 requires ~120 games") assume
+   cumulative multi-season play, not a single season.** At ~7.5 batter points per game, T4 of
+   896 requires approximately 120 in-game appearances. A player who plays 40 games per season
+   across three seasons reaches T4 in their third season.
+
+This is the confirmed intended design per `04-milestones.md`: "Cumulative within a season —
+progress never resets mid-season." The document does not explicitly state "cumulative across
+seasons," but the evaluator implementation (SUM across all rows, no season filter) makes this
+behavior implicit. This test case exists to surface that ambiguity and require an explicit
+design decision before Phase 2 ships.
+
+**Expected Outcome:**
+
+Before Phase 2 implementation begins, the design intent must be explicitly confirmed in writing
+(update `04-milestones.md` section 4.1 with a cross-season statement) or the evaluator query
+must be updated to add a season boundary. The options are:
+
+- **Option A (current behavior — accumulate across seasons):** Document explicitly. The
+  Refractor journey can span multiple seasons. Long-term card holders are rewarded for loyalty.
+- **Option B (reset per season):** Add a season filter to the evaluator query. Refractor
+  progress resets at season start. T4 is achievable within a single full season. Cards earned
+  mid-season have a natural catch-up disadvantage.
+
+This spec takes no position on which option is correct. It records that the choice exists,
+that the current implementation defaults to Option A, and that Phase 2 must not be built on
+an unexamined assumption about which option is in effect.
+
+**Risk If Failed:**
+
+If Option A is unintentional and players discover their Refractor progress carries over across
+seasons before it is documented as a feature, they will optimize around it in ways the design
+did not anticipate (e.g., holding cards across seasons purely to farm Refractor tiers). If
+Option B is unintentional and progress resets each season without warning, players who invested
+heavily in T3 at season end will be angry when their progress disappears.
+
+**Files Involved:**
+
+- `docs/prd-evolution/04-milestones.md` — section 4.1 (design principles) — **requires update
+  to state the cross-season policy explicitly**
+- Phase 1 (implemented): `pd_cards/evo/evaluator.py` — stat accumulation query; inspect the
+  WHERE clause for any season filter
+- Database: `BattingSeasonStats`, `PitchingSeasonStats` — confirm schema includes `season`
+  column and whether the evaluator query filters on it
+- Database: `evolution_card_state` — confirm there is no season-reset logic in the state
+  management layer
+
+---
+
+## Summary Status
+
+| ID | Title | Status |
+|---|---|---|
+| T4-1 | 108-sum preservation under profile-based boosts | Pending — Phase 2 |
+| T4-2 | D20 probability shift at T4 | Pending — Phase 2 |
+| T4-3 | T4 rarity upgrade — pipeline collision risk | Pending — Phase 2 |
+| T4-4 | T4 rarity cap for HoF cards | Pending — Phase 2 |
+| T4-5 | RP T1 achievability in realistic timeframe | Pending — Phase 2 |
+| T4-6 | SP/RP T4 parity with batters | Pending — Phase 2 |
+| T4-7 | Cross-season stat accumulation — design confirmation | Pending — Phase 2 |
+
+All cases are unblocked pending Phase 2 implementation. T4-7 requires a design decision before
+any Phase 2 code is written. T4-3 requires a resolution strategy to be selected before the T4
+completion handler is implemented.