docs: add Refractor Phase 2 design validation spec

Seven pre-implementation test cases covering: 108-sum invariant
preservation under profile-based boosts, D20 probability shift
magnitude at T4, pipeline collision risk between T4 rarity upgrade
and live-series post_player_updates, HoF rarity cap (non-contiguous
ID ladder), RP T1 achievability, SP/RP/batter T4 parity, and the
cross-season stat accumulation design decision that must be confirmed
before Phase 2 code is written.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Cal Corum 2026-03-24 09:07:09 -05:00
parent 8c00bacf59
commit f2c09d09e6

View File

@ -0,0 +1,447 @@
# Refractor Phase 2 — Design Validation Spec
## Purpose
This document captures the design validation test cases that must be verified before and during
Phase 2 (rating boosts) of the Refractor card progression system. Phase 1 — tracking,
milestone evaluation, and tier state persistence — is implemented. Phase 2 adds the rating boost
application logic (`apply_evolution_boosts`), rarity upgrade at T4, and variant hash creation.
**When to reference this document:**
- Before beginning Phase 2 implementation: review all cases to understand the design constraints
and edge cases the implementation must handle.
- During implementation: use each test case as an acceptance gate before the corresponding
feature is considered complete.
- During code review: each case documents the "risk if failed" so reviewers can assess whether
a proposed implementation correctly handles that scenario.
- After Phase 2 ships: run the cases as a regression checklist before any future change to the
boost logic, rarity assignment, or milestone evaluator.
## Background: Rating Model
Batter cards have 22 outcome columns summing to exactly 108 chances (derived from the D20
probability system: 2d6 x 3 columns x 6 rows). Each Refractor tier (T1 through T4) awards a
1.0-chance budget — a flat shift from out columns to positive-outcome columns. The total
accumulated budget across all four tiers is 4.0 chances, equal to approximately 3.7% of the
108-chance total (4 / 108 ≈ 0.037).
Rarity IDs in the codebase (from `rarity_thresholds.py`):
| Rarity Name | ID |
|---|---|
| Common | 5 |
| Bronze | 4 |
| Silver | 3 |
| Gold | 2 |
| Diamond | 1 |
| Hall of Fame | 99 |
The special value `99` for Hall of Fame means a naive `rarity_id + 1` increment is incorrect;
the upgrade logic must use an ordered rarity ladder, not arithmetic.
---
## Test Cases
---
### T4-1: 108-sum preservation under profile-based boosts
**Status:** Pending — Phase 2
**Scenario:**
`apply_evolution_boosts(card_ratings, boost_tier, player_profile)` redistributes 1.0 chance per
tier across outcome columns according to the player's detected profile (power hitter, contact
hitter, patient hitter, starting pitcher, relief pitcher). Every combination of profile and tier
must leave the 22-column sum exactly equal to 108 after the boost is applied. This must hold for
all four tier applications, cumulative as well as individual.
The edge case: a batter card where `flyout_a = 0`. The power and contact hitter profiles draw
reductions from out columns including `flyout_a`. If the preferred reduction column is at zero,
the implementation must not produce a negative value and must not silently drop the remainder of
the budget. The 0-floor cap is enforced per column (see `05-rating-boosts.md` section 5.1:
"Truncated points are lost, not redistributed").
Verify:
- After each of T1, T2, T3, T4 boost applications, `sum(all outcome columns) == 108` exactly.
- A card with `flyout_a = 0` does not raise an error and does not produce a column below 0.
- When truncation occurs (column already at 0), the lost budget is discarded, not moved
elsewhere — the post-boost sum will be less than 108 + budget_added only in the case of
truncation, but must never exceed 108.
**Expected Outcome:**
Sum remains 108 after every boost under non-truncation conditions. Under truncation conditions
(a column hits 0), the sum is reduced by the truncated amount — the implementation discards the
excess rather than redistributing it. No column value falls below 0.
**Risk If Failed:**
A broken 108-sum produces invalid game probabilities. The D20 engine derives per-outcome
probabilities from `column / 108`. If the sum drifts above or below 108, every outcome
probability on that card is subtly wrong for every future game that uses it. This error silently
corrupts game results without any visible failure.
**Files Involved:**
- `docs/prd-evolution/05-rating-boosts.md` — boost budget, profile definitions, cap behavior
- Phase 2: `pd_cards/evo/boost_profiles.py` (to be created) — `apply_evolution_boosts`
- `batters/creation.py``battingcardratings` model column set (22 columns)
- `pitchers/creation.py``pitchingcardratings` model column set (18 columns + 9 x-checks)
---
### T4-2: D20 probability shift at T4
**Status:** Pending — Phase 2
**Scenario:**
Take a representative Bronze-rarity batter (e.g., a player with total OPS near 0.730,
`homerun` ≈ 1.2, `single_one` ≈ 4.0, `walk` ≈ 3.0 in the base ratings). Apply all four
tier boosts cumulatively, distributing the total 4.0-chance budget across positive-outcome
columns (HR, singles, walk) with equal reductions from out columns. Calculate the resulting
absolute and relative probability change per D20 roll outcome.
Design target: the full T4 evolution shifts approximately 3.7% of all outcomes from outs to
positive results (4.0 / 108 = 0.037). The shift should be perceptible to a player reviewing
their card stats but should not fundamentally alter the card's tier or role. A Bronze batter
does not become a Gold batter through evolution — they become an evolved Bronze batter.
Worked example for validation reference:
- Pre-evolution: `homerun = 1.2` → probability per D20 = 1.2 / 108 ≈ 1.11%
- Post T4 with +0.5 to homerun per tier (4 tiers × 0.5 = +2.0 total): `homerun = 3.2`
→ probability per D20 = 3.2 / 108 ≈ 2.96% — an increase of ~1.85 percentage points
- Across all positive outcomes: total shift = 4.0 / 108 ≈ 3.7%
**Expected Outcome:**
The cumulative 4.0-chance shift produces a ~3.7% total movement from negative to positive
outcomes. No single outcome column increases by more than 2.5 chances across the full T4
journey under any profile. The card remains recognizably Bronze — it does not cross the Gold
OPS threshold (0.700 for 2024/2025 thresholds) unless it was already near the boundary.
**Risk If Failed:**
If the shift is too large, evolution becomes a rarity bypass — players grind low-rarity cards
to simulate an upgrade they cannot earn through pack pulls. If the shift is too small, the
system feels unrewarding and players lose motivation to complete tiers. Either miscalibration
undermines the core design intent.
**Files Involved:**
- `docs/prd-evolution/05-rating-boosts.md` — section 5.2 (boost budgets), section 5.3 (profiles)
- `rarity_thresholds.py` — OPS boundary values used to assess whether evolution crosses a rarity
threshold as a side effect (it should not for mid-range cards)
- Phase 2: `pd_cards/evo/boost_profiles.py` — boost distribution logic
---
### T4-3: T4 rarity upgrade — pipeline collision risk
**Status:** Pending — Phase 2
**Scenario:**
The Refractor T4 rarity upgrade (`player.rarity_id` incremented by one ladder step) and the
live-series `post_player_updates()` rarity assignment (OPS-threshold-based, in
`batters/creation.py`) both write to the same `rarity_id` field on the player record. A
collision occurs when both run against the same player:
1. Player completes Refractor T4. Evolution system upgrades rarity: Bronze (4) → Silver (3).
`evolution_card_state.final_rarity_id = 3` is written as an audit record.
2. Live-series update runs two weeks later. `post_player_updates()` recalculates OPS → maps to
Bronze (4) → writes `rarity_id = 4` to the player record.
3. The T4 rarity upgrade is silently overwritten. The player's card reverts to Bronze. The
`evolution_card_state` record still shows `final_rarity_id = 3` but the live card is Bronze.
This is a conflict between two independent systems both writing to the same field without
awareness of each other. The current live-series pipeline has no concept of evolution state.
Proposed resolution strategies (document and evaluate; do not implement during Phase 2 spec):
- **Guard clause in `post_player_updates()`:** Before writing `rarity_id`, check
`evolution_card_state.final_rarity_id` for the player. If an evolution upgrade is on record,
apply `max(ops_rarity, final_rarity_id_ladder_position)` — never downgrade past the T4 result.
- **Separate evolution rarity field:** Add `evolution_rarity_bump` (int, default 0) to the
card model. The game engine resolves effective rarity as `base_rarity + bump`. Live-series
updates only touch `base_rarity`; the bump is immutable once T4 is reached.
- **Deferred rarity upgrade:** T4 does not write `rarity_id` immediately. Instead, it sets a
flag on `evolution_card_state`. `post_player_updates()` checks the flag and applies the bump
after its own rarity calculation, ensuring the evolution upgrade layers on top of the current
OPS-derived rarity rather than competing with it.
**Expected Outcome:**
Phase 2 must implement one of these strategies (or an alternative that provides equivalent
protection). The collision scenario must be explicitly tested: evolve a Bronze card to T4,
run a live-series update that maps the same player to Bronze, confirm the displayed rarity is
Silver or higher — not Bronze.
**Risk If Failed:**
Live-series updates silently revert T4 rarity upgrades. Players invest significant game time
reaching T4, receive the visual rarity upgrade, then lose it after the next live-series run
with no explanation. This is one of the highest-trust violations the system can produce — a
reward that disappears invisibly.
**Files Involved:**
- `batters/creation.py``post_player_updates()` (lines ~304480)
- `pitchers/creation.py` — equivalent `post_player_updates()` for pitchers
- `docs/prd-evolution/05-rating-boosts.md` — section 5.4 (rarity upgrade at T4), note on live
series interaction
- Phase 2: `pd_cards/evo/tier_completion.py` (to be created) — T4 completion handler
- Database: `evolution_card_state` table, `final_rarity_id` column
---
### T4-4: T4 rarity cap for HoF cards
**Status:** Pending — Phase 2
**Scenario:**
A player card currently at Hall of Fame rarity (`rarity_id = 99`) completes Refractor T4. The
design specifies: HoF cards receive the T4 rating boost deltas (1.0 chance shift) but do not
receive a rarity upgrade. The rarity stays at 99.
The implementation must handle this without producing an invalid rarity value. The rarity ID
sequence in `rarity_thresholds.py` is non-contiguous — the IDs are:
```
5 (Common) → 4 (Bronze) → 3 (Silver) → 2 (Gold) → 1 (Diamond) → 99 (Hall of Fame)
```
A naive `rarity_id + 1` would produce `100`, which is not a valid rarity. A lookup-table
approach on the ordered ladder must be used instead. At `99` (HoF), the ladder returns `99`
(no-op). Additionally, Diamond (1) cards that complete T4 should upgrade to HoF (99), not to
`rarity_id = 0` or any other invalid value.
**Expected Outcome:**
- `rarity_id = 99` (HoF): T4 boost applied, rarity unchanged at 99.
- `rarity_id = 1` (Diamond): T4 boost applied, rarity upgrades to 99 (HoF).
- `rarity_id = 2` (Gold): T4 boost applied, rarity upgrades to 1 (Diamond).
- `rarity_id = 3` (Silver): T4 boost applied, rarity upgrades to 2 (Gold).
- `rarity_id = 4` (Bronze): T4 boost applied, rarity upgrades to 3 (Silver).
- `rarity_id = 5` (Common): T4 boost applied, rarity upgrades to 4 (Bronze).
- No card ever receives `rarity_id` outside the set {1, 2, 3, 4, 5, 99}.
**Risk If Failed:**
An invalid rarity ID (e.g., 0, 100, or None) propagates into the game engine and Discord bot
display layer. Cards with invalid rarities may render incorrectly, break sort/filter operations
in pack-opening UX, or cause exceptions in code paths that switch on rarity values.
**Files Involved:**
- `rarity_thresholds.py` — authoritative rarity ID definitions
- `docs/prd-evolution/05-rating-boosts.md` — section 5.4 (HoF cap behavior)
- Phase 2: `pd_cards/evo/tier_completion.py` — rarity ladder lookup, T4 completion handler
- Database: `evolution_card_state.final_rarity_id`
---
### T4-5: RP T1 achievability in realistic timeframe
**Status:** Pending — Phase 2
**Scenario:**
The Relief Pitcher track formula is `IP + K` with a T1 threshold of 3. The design intent is
"almost any active reliever hits this" in approximately 2 appearances (from `04-milestones.md`
section 4.2). The scenario to validate: a reliever who throws 1.2 IP (4 outs) with 1 K in an
appearance scores `1.33 + 1 = 2.33` — below T1. This reliever needs another appearance before
reaching T1.
The validation question is whether this is a blocking problem. If typical active RP usage
(5+ team game appearances) reliably produces T1 within a few sessions of play, the design is
sound. If a reliever can appear 45 times and still not reach T1 due to short, low-strikeout
outings (e.g., a pure groundball closer who throws 1.0 IP / 0 K per outing), the threshold
may be too high for the RP role to feel rewarding.
Reference calibration data from Season 10 (via `evo_milestone_simulator.py`): ~94% of all
relievers reached T1 under the IP+K formula with the threshold of 3. However, this is based on
a full or near-full season of data. The question is whether early-season RP usage (first 35
team games) produces T1 reliably.
Worked example for a pure-groundball closer:
- 5 appearances × (1.0 IP + 0 K) = 5.0 — reaches T1 (threshold 3) after appearance 3
- 5 appearances × (0.2 IP + 0 K) = 1.0 — does not reach T1 after 5 appearances
The second case (mop-up reliever with minimal usage) is expected to not reach T1 quickly, and
the design accepts this. What is NOT acceptable: a dedicated closer or setup man with 2+ IP per
session failing to reach T1 after 5+ appearances.
**Expected Outcome:**
A reliever averaging 1.0+ IP per appearance reaches T1 after 3 appearances. A reliever
averaging 0.5+ IP per appearance reaches T1 after 56 appearances. A reliever with fewer than
3 total appearances in a season is not expected to reach T1 — this is acceptable. The ~94%
Season 10 T1 rate confirms the threshold is calibrated correctly for active relievers.
**Risk If Failed:**
If active relievers (regular bullpen roles) cannot reach T1 within 510 team games, the
Refractor system is effectively dead for RP cards from launch. Players who pick up RP cards
expecting progression will see no reward for multiple play sessions, creating a negative first
impression of the entire system.
**Files Involved:**
- `docs/prd-evolution/04-milestones.md` — section 4.2 (RP track thresholds and design intent),
section 4.3 (Season 10 calibration data)
- `scripts/evo_milestone_simulator.py``formula_rp_ip_k`, `simulate_tiers` — re-run against
current season data to validate T1 achievability in early-season usage windows
- Database: `evolution_track` table — threshold values (admin-tunable, no code change required
if recalibration is needed)
---
### T4-6: SP/RP T4 parity with batters
**Status:** Pending — Phase 2
**Scenario:**
The T4 thresholds are:
| Position | T4 Threshold | Formula |
|---|---|---|
| Batter | 896 | PA + (TB x 2) |
| Starting Pitcher | 240 | IP + K |
| Relief Pitcher | 70 | IP + K |
These were calibrated against Season 10 production data using `evo_milestone_simulator.py`.
The calibration target was approximately 3% of active players reaching T4 over a full season
across all position types. The validation here is that this parity holds: one position type
does not trivially farm Superfractors while another cannot reach T2 without extraordinary
performance.
The specific risk: SP T4 requires 240 IP+K across the full season. Top Season 10 SPs (Harang:
163, deGrom: 143) were on pace for T4 at the time of measurement but had not crossed 240 yet.
If the final-season data shows a spike (e.g., 1015% of SPs reaching T4 vs. 3% of batters),
the SP threshold needs adjustment. Conversely, if no reliever reaches T4 in a full season
where 94% reach T1, the RP T4 threshold of 70 may be achievable only by top closers in
extreme usage scenarios.
Validation requires re-running `evo_milestone_simulator.py --season <current>` with the final
season data for all three position types and comparing T4 reach percentages. Accepted tolerance:
T4 reach rate within 2x across position types (e.g., if batters are at 3%, SP and RP should be
between 1.5% and 6%).
**Expected Outcome:**
All three position types produce T4 rates between 1% and 6% over a full season of active play.
No position type produces T4 rates above 10% (trivially farmable) or below 0.5% (effectively
unachievable). SP and RP T4 rates should be comparable because their thresholds were designed
together with the same 3% target in mind.
**Risk If Failed:**
If SP is easy (T4 in half a season) while RP is hard (T4 only for elite closers), then SP card
owners extract disproportionate value from the system. The Refractor system's balance premise
— "same tier, same reward, regardless of position" — breaks down, undermining player confidence
in the fairness of the progression.
**Files Involved:**
- `docs/prd-evolution/04-milestones.md` — section 4.3 (Season 10 calibration table)
- `scripts/evo_milestone_simulator.py` — primary validation tool; run with `--all-formulas
--pitchers-only` and `--batters-only` flags against final season data
- Database: `evolution_track` table — thresholds are admin-tunable; recalibration does not
require a code deployment
---
### T4-7: Cross-season stat accumulation — design confirmation
**Status:** Pending — Phase 2
**Scenario:**
The milestone evaluator (Phase 1, already implemented) queries `BattingSeasonStats` and
`PitchingSeasonStats` and SUMs the formula metric across all rows for a given
`(player_id, team_id)` pair, regardless of season number. This means a player's Refractor
progress is cumulative across seasons: if a player reaches 400 batter points in Season 10 and
another 400 in Season 11, their total is 800 — within range of T4 (threshold: 896).
This design must be confirmed as intentional before Phase 2 is implemented, because it has
significant downstream implications:
1. **Progress does not reset between seasons.** A player who earns a card across multiple
seasons continues progressing the same Refractor state. Season boundaries are invisible to
the evaluator.
2. **New teams start from zero.** If a player trades away a card and acquires a new copy of the
same player, the new card's `evolution_card_state` row starts at T0. The stat accumulation
query is scoped to `(player_id, team_id)`, so historical stats from the previous owner are
not inherited.
3. **Live-series stat updates do not retroactively change progress.** The evaluator reads
finalized season stat rows. If a player's Season 10 stats are adjusted via a data correction,
the evaluator will pick up the change on the next evaluation run — progress could shift
backward if a data correction removes a game's stats.
4. **The "full season" targets in the design docs (e.g., "T4 requires ~120 games") assume
cumulative multi-season play, not a single season.** At ~7.5 batter points per game, T4 of
896 requires approximately 120 in-game appearances. A player who plays 40 games per season
across three seasons reaches T4 in their third season.
This is the confirmed intended design per `04-milestones.md`: "Cumulative within a season —
progress never resets mid-season." The document does not explicitly state "cumulative across
seasons," but the evaluator implementation (SUM across all rows, no season filter) makes this
behavior implicit. This test case exists to surface that ambiguity and require an explicit
design decision before Phase 2 ships.
**Expected Outcome:**
Before Phase 2 implementation begins, the design intent must be explicitly confirmed in writing
(update `04-milestones.md` section 4.1 with a cross-season statement) or the evaluator query
must be updated to add a season boundary. The options are:
- **Option A (current behavior — accumulate across seasons):** Document explicitly. The
Refractor journey can span multiple seasons. Long-term card holders are rewarded for loyalty.
- **Option B (reset per season):** Add a season filter to the evaluator query. Refractor
progress resets at season start. T4 is achievable within a single full season. Cards earned
mid-season have a natural catch-up disadvantage.
This spec takes no position on which option is correct. It records that the choice exists,
that the current implementation defaults to Option A, and that Phase 2 must not be built on
an unexamined assumption about which option is in effect.
**Risk If Failed:**
If Option A is unintentional and players discover their Refractor progress carries over across
seasons before it is documented as a feature, they will optimize around it in ways the design
did not anticipate (e.g., holding cards across seasons purely to farm Refractor tiers). If
Option B is unintentional and progress resets each season without warning, players who invested
heavily in T3 at season end will be angry when their progress disappears.
**Files Involved:**
- `docs/prd-evolution/04-milestones.md` — section 4.1 (design principles) — **requires update
to state the cross-season policy explicitly**
- Phase 1 (implemented): `pd_cards/evo/evaluator.py` — stat accumulation query; inspect the
WHERE clause for any season filter
- Database: `BattingSeasonStats`, `PitchingSeasonStats` — confirm schema includes `season`
column and whether the evaluator query filters on it
- Database: `evolution_card_state` — confirm there is no season-reset logic in the state
management layer
---
## Summary Status
| ID | Title | Status |
|---|---|---|
| T4-1 | 108-sum preservation under profile-based boosts | Pending — Phase 2 |
| T4-2 | D20 probability shift at T4 | Pending — Phase 2 |
| T4-3 | T4 rarity upgrade — pipeline collision risk | Pending — Phase 2 |
| T4-4 | T4 rarity cap for HoF cards | Pending — Phase 2 |
| T4-5 | RP T1 achievability in realistic timeframe | Pending — Phase 2 |
| T4-6 | SP/RP T4 parity with batters | Pending — Phase 2 |
| T4-7 | Cross-season stat accumulation — design confirmation | Pending — Phase 2 |
All cases are unblocked pending Phase 2 implementation. T4-7 requires a design decision before
any Phase 2 code is written. T4-3 requires a resolution strategy to be selected before the T4
completion handler is implemented.