test: refractor system Tier 3 test coverage #171

cal · 2026-03-24T21:02:38Z

cal commented

2026-03-24 21:02:38 +00:00

Summary

7 new test cases covering medium-priority robustness gaps in the Refractor system
T3-1: Invalid card_type query parameter on /tracks endpoint (empty string, unknown type)
T3-6: last_evaluated_at is null for never-evaluated cards in API response
T3-7: fully_evolved/tier mismatch detection and auto-correction (2 tests)
T3-8: Multi-team same-season player stat isolation (2 tests)

Test plan

All 7 new tests pass (22 passed, 15 skipped — skips are existing PostgreSQL-only tests)
No ruff lint violations
Follows existing test patterns (SQLite in-memory, shared-cache for API tests)

Notable findings

T3-7: The evaluator always recomputes fully_evolved from tier — a corrupted flag is silently repaired on next evaluation
T3-8: Uses values where cross-contamination would cross a tier boundary, making isolation bugs detectable

🤖 Generated with Claude Code

## Summary - 7 new test cases covering medium-priority robustness gaps in the Refractor system - **T3-1**: Invalid card_type query parameter on /tracks endpoint (empty string, unknown type) - **T3-6**: last_evaluated_at is null for never-evaluated cards in API response - **T3-7**: fully_evolved/tier mismatch detection and auto-correction (2 tests) - **T3-8**: Multi-team same-season player stat isolation (2 tests) ## Test plan - [x] All 7 new tests pass (`22 passed, 15 skipped` — skips are existing PostgreSQL-only tests) - [x] No ruff lint violations - [x] Follows existing test patterns (SQLite in-memory, shared-cache for API tests) ## Notable findings - T3-7: The evaluator always recomputes `fully_evolved` from tier — a corrupted flag is silently repaired on next evaluation - T3-8: Uses values where cross-contamination would cross a tier boundary, making isolation bugs detectable 🤖 Generated with [Claude Code](https://claude.com/claude-code)

cal added 1 commit 2026-03-24 21:02:39 +00:00

test: add Tier 3 refractor test cases (T3-1, T3-6, T3-7, T3-8) fa96e79428

Adds four Tier 3 (medium-priority) test cases to the existing refractor test
suite.  All tests use SQLite in-memory databases and run without a PostgreSQL
connection.

T3-1 (test_refractor_track_api.py): Two tests verifying that
  GET /api/v2/refractor/tracks?card_type= returns 200 with count=0 for both
  an unrecognised card_type value ('foo') and an empty string, rather than
  a 4xx/5xx.  A full SQLite-backed TestClient is added to the track API test
  module for these cases.

T3-6 (test_refractor_state_api.py): Verifies that
  GET /api/v2/refractor/cards/{card_id} returns last_evaluated_at: null (not
  a crash or missing key) when the RefractorCardState was initialised but
  never evaluated.  Adds the SQLite test infrastructure (models, fixtures,
  helper factories, TestClient) to the state API test module.

T3-7 (test_refractor_evaluator.py): Two tests covering fully_evolved/tier
  mismatch correction.  When the database has fully_evolved=True but
  current_tier=3 (corruption), evaluate_card must re-derive fully_evolved
  from the freshly-computed tier (False for tier 3, True for tier 4).

T3-8 (test_refractor_evaluator.py): Two tests confirming per-team stat
  isolation.  A player with BattingSeasonStats on two different teams must
  have each team's RefractorCardState reflect only that team's stats — not
  a combined total.  Covers both same-season and multi-season scenarios.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cal reviewed 2026-03-24 21:04:42 +00:00

cal left a comment

AI Code Review

Files Reviewed

tests/test_refractor_evaluator.py (modified — T3-7, T3-8 added)
tests/test_refractor_state_api.py (modified — T3-6 added)
tests/test_refractor_track_api.py (modified — T3-1 added)

Findings

Correctness

All 7 tests are correctly mapped to real implementation behaviour. Verified each against the source.

T3-7 (TestFullyEvolvedFlagCorrection): The evaluator at line 176 of refractor_evaluator.py does card_state.fully_evolved = card_state.current_tier >= 4 — it always recomputes from the final tier, never trusts the stored flag. Both tests correctly exploit this:

test_fully_evolved_flag_corrected_when_tier_below_4: injects fully_evolved=True, tier=3, supplies pa=500 (value=500, T3 range), verifies the return dict AND the persisted row (via CardStateStub.get_by_id(state.id)) are both corrected to False. The DB reload is a meaningful persistence check beyond just the return dict.
test_fully_evolved_flag_preserved_when_tier_reaches_4: same corruption setup but pa=900 (value=900 >= T4=896) legitimately reaches T4, so fully_evolved=True is the correct output. The pair cleanly separates "flag corrected" from "flag legitimately set."

T3-8 (TestMultiTeamStatIsolation): The evaluator WHERE clause (evaluator.py line 111) filters player_id AND team_id. Test values are well-chosen: the same-season test uses pa=80 and pa=120 where combined=200 crosses T2=149, making contamination detectable by both value and tier. The multi-season test uses per-team totals of 160/180 vs combined 340 — both teams land T2 either way, but the value assertions (160.0 vs 180.0) are what catch isolation bugs, and the docstring correctly calls this out.

T3-1 (invalid card_type): The list_tracks router (refractor.py lines 69-70) does query.where(RefractorTrack.card_type == card_type) when card_type is not None. An empty string is not None, so it passes through as a literal WHERE card_type = '' — no rows match. The fixture correctly seeds two real tracks (batter, sp) before the test runs, so the result distinguishes "filter excluded them" from "table was empty."

T3-6 (null last_evaluated_at): The _build_card_state_response helper (refractor.py lines 49-51) uses state.last_evaluated_at.isoformat() if state.last_evaluated_at else None. The test directly exercises the else None branch and asserts the key is present with a null value — the 200-vs-500-crash scenario is the real risk being guarded here.

No math errors in threshold boundary calculations. All comparisons use >= consistently with the stub and production _tier_from_value implementations.

Security

No concerns. All changes are test-only. No credentials, user input surfaces, or new dependencies introduced.

Style & Conventions

Fully consistent with the established project patterns:

SQLite shared-cache databases (file:name?mode=memory&cache=shared) for API tests, matching the existing pattern.
Dependency injection via the four _*_model / _*_fn overrides for evaluator tests — identical to all existing evaluator tests.
_make_state / _make_stats helpers reused from existing module-level helpers.
"What / Why" docstring structure followed throughout.
os.environ.setdefault("API_TOKEN", "test-token") correctly placed before the app.db_engine import in both new test sections; # noqa: E402 carried where needed.
autouse=False on the new SQLite fixtures is correct — prevents them from conflicting with the autouse setup_test_db in conftest.py.

Suggestions (non-blocking)

test_fully_evolved_flag_preserved_when_tier_reaches_4 does not reload the DB row to confirm persistence of fully_evolved=True, unlike its sibling which does. If card_state.save() were omitted in the evaluator, the first test would catch it, but a get_by_id recheck would make the pair symmetric.
test_get_card_state_last_evaluated_at_null creates the RefractorCardState inline rather than using the locally-defined _sa_make_state helper. Both approaches are correct; the inline version is more explicit about the last_evaluated_at=None field being the meaningful distinction.

Verdict: APPROVED

All 7 tests are correctly specified, correctly implement their stated scenarios, and directly exercise real implementation paths. Threshold math is accurate. The SQLite-backed API test infrastructure follows established project patterns. No correctness, security, or convention issues found. This is safe to merge.

Automated review by Claude PR Reviewer

## AI Code Review ### Files Reviewed - `tests/test_refractor_evaluator.py` (modified — T3-7, T3-8 added) - `tests/test_refractor_state_api.py` (modified — T3-6 added) - `tests/test_refractor_track_api.py` (modified — T3-1 added) ### Findings #### Correctness All 7 tests are correctly mapped to real implementation behaviour. Verified each against the source. **T3-7 (TestFullyEvolvedFlagCorrection):** The evaluator at line 176 of `refractor_evaluator.py` does `card_state.fully_evolved = card_state.current_tier >= 4` — it always recomputes from the final tier, never trusts the stored flag. Both tests correctly exploit this: - `test_fully_evolved_flag_corrected_when_tier_below_4`: injects `fully_evolved=True, tier=3`, supplies `pa=500` (value=500, T3 range), verifies the return dict AND the persisted row (via `CardStateStub.get_by_id(state.id)`) are both corrected to `False`. The DB reload is a meaningful persistence check beyond just the return dict. - `test_fully_evolved_flag_preserved_when_tier_reaches_4`: same corruption setup but `pa=900` (value=900 >= T4=896) legitimately reaches T4, so `fully_evolved=True` is the correct output. The pair cleanly separates "flag corrected" from "flag legitimately set." **T3-8 (TestMultiTeamStatIsolation):** The evaluator WHERE clause (evaluator.py line 111) filters `player_id AND team_id`. Test values are well-chosen: the same-season test uses pa=80 and pa=120 where combined=200 crosses T2=149, making contamination detectable by both value and tier. The multi-season test uses per-team totals of 160/180 vs combined 340 — both teams land T2 either way, but the value assertions (160.0 vs 180.0) are what catch isolation bugs, and the docstring correctly calls this out. **T3-1 (invalid card_type):** The `list_tracks` router (refractor.py lines 69-70) does `query.where(RefractorTrack.card_type == card_type)` when `card_type is not None`. An empty string is not None, so it passes through as a literal `WHERE card_type = ''` — no rows match. The fixture correctly seeds two real tracks (batter, sp) before the test runs, so the result distinguishes "filter excluded them" from "table was empty." **T3-6 (null last_evaluated_at):** The `_build_card_state_response` helper (refractor.py lines 49-51) uses `state.last_evaluated_at.isoformat() if state.last_evaluated_at else None`. The test directly exercises the `else None` branch and asserts the key is present with a null value — the 200-vs-500-crash scenario is the real risk being guarded here. No math errors in threshold boundary calculations. All comparisons use `>=` consistently with the stub and production `_tier_from_value` implementations. #### Security No concerns. All changes are test-only. No credentials, user input surfaces, or new dependencies introduced. #### Style & Conventions Fully consistent with the established project patterns: - SQLite shared-cache databases (`file:name?mode=memory&cache=shared`) for API tests, matching the existing pattern. - Dependency injection via the four `_*_model` / `_*_fn` overrides for evaluator tests — identical to all existing evaluator tests. - `_make_state` / `_make_stats` helpers reused from existing module-level helpers. - "What / Why" docstring structure followed throughout. - `os.environ.setdefault("API_TOKEN", "test-token")` correctly placed before the `app.db_engine` import in both new test sections; `# noqa: E402` carried where needed. - `autouse=False` on the new SQLite fixtures is correct — prevents them from conflicting with the autouse `setup_test_db` in `conftest.py`. #### Suggestions (non-blocking) - `test_fully_evolved_flag_preserved_when_tier_reaches_4` does not reload the DB row to confirm persistence of `fully_evolved=True`, unlike its sibling which does. If `card_state.save()` were omitted in the evaluator, the first test would catch it, but a `get_by_id` recheck would make the pair symmetric. - `test_get_card_state_last_evaluated_at_null` creates the `RefractorCardState` inline rather than using the locally-defined `_sa_make_state` helper. Both approaches are correct; the inline version is more explicit about the `last_evaluated_at=None` field being the meaningful distinction. ### Verdict: APPROVED All 7 tests are correctly specified, correctly implement their stated scenarios, and directly exercise real implementation paths. Threshold math is accurate. The SQLite-backed API test infrastructure follows established project patterns. No correctness, security, or convention issues found. This is safe to merge. --- *Automated review by Claude PR Reviewer*

Claude added the

ai-reviewing

label 2026-03-24 21:15:46 +00:00

Claude approved these changes 2026-03-24 21:18:25 +00:00

Dismissed

Claude left a comment

Approved — 7 tests verified by pr-reviewer agent. Safe to merge.

Claude approved these changes 2026-03-24 21:19:15 +00:00

Dismissed

Claude left a comment

AI Code Review

Files Reviewed

tests/test_refractor_evaluator.py (modified — +212 lines, 2 new test classes)
tests/test_refractor_state_api.py (modified — SQLite boilerplate + T3-6 test)
tests/test_refractor_track_api.py (modified — SQLite boilerplate + 2 T3-1 tests)

Findings

Correctness

All value calculations verified manually against the stub formula and track thresholds:

T3-7a: pa=500 → value=500 → T3 (448≤500<896) → fully_evolved=(3≥4)=False ✓
T3-7b: pa=900 → value=900 → T4 (900≥896) → fully_evolved=(4≥4)=True ✓
T3-8a: team1 pa=80→T1, team2 pa=120→T1; combined 200→T2 — isolation confirmed ✓
T3-8b: team1 90+70=160→T2, team2 100+80=180→T2; exact value assertions catch bleed ✓
T3-1: two known tracks seeded before filter tests — confirms filter excludes them rather than the table being empty ✓
T3-6: verifies last_evaluated_at: null in API response and that the None branch in _build_card_state_response doesn't crash ✓

The state_reloaded = CardStateStub.get_by_id(state.id) assertion in T3-7a is a good detail — it confirms the corrected flag was persisted, not just returned in the dict.

No logical inconsistency between T2-1 (fully_evolved persists when stats drop from T4) and T3-7 (corrupted fully_evolved=True at T3 is corrected) — both follow correctly from the evaluator always deriving fully_evolved = (post-regression tier ≥ 4).

Security

No concerns — test code only, no production paths changed.

Style & Conventions

Follows the SQLite shared-cache pattern established in PR #170 (same SqliteDatabase("file:...?mode=memory&cache=shared", uri=True), same db middleware, same fixture structure). Consistent with project conventions.

Suggestions

_sa_make_state is defined in the T3-6 boilerplate but the actual test (test_get_card_state_last_evaluated_at_null) calls RefractorCardState.create(...) directly instead of using the helper. Minor DRY inconsistency — non-blocking.
No _state_api_db.close() / _track_api_db.close() in fixture teardown (same pattern as the T2 tests on main). Fine in practice with SQLite, consistent.
# noqa: E402 present on the app.db_engine import in test_refractor_track_api.py but not in test_refractor_state_api.py. Ruff config likely handles it globally; PR reports no violations.

Verdict: APPROVED

All 7 test cases are correct, mathematically verified, and well-documented with clear "What / Why" docstrings. Tests are isolated, use the established SQLite in-memory pattern, and cover real behavioral properties (flag corruption repair, multi-team stat isolation, null timestamp serialization). No blockers.

Automated review by Claude PR Reviewer

## AI Code Review ### Files Reviewed - `tests/test_refractor_evaluator.py` (modified — +212 lines, 2 new test classes) - `tests/test_refractor_state_api.py` (modified — SQLite boilerplate + T3-6 test) - `tests/test_refractor_track_api.py` (modified — SQLite boilerplate + 2 T3-1 tests) ### Findings #### Correctness All value calculations verified manually against the stub formula and track thresholds: - T3-7a: `pa=500` → `value=500` → T3 (448≤500<896) → `fully_evolved=(3≥4)=False` ✓ - T3-7b: `pa=900` → `value=900` → T4 (900≥896) → `fully_evolved=(4≥4)=True` ✓ - T3-8a: team1 `pa=80`→T1, team2 `pa=120`→T1; combined 200→T2 — isolation confirmed ✓ - T3-8b: team1 `90+70=160`→T2, team2 `100+80=180`→T2; exact value assertions catch bleed ✓ - T3-1: two known tracks seeded before filter tests — confirms filter excludes them rather than the table being empty ✓ - T3-6: verifies `last_evaluated_at: null` in API response and that the `None` branch in `_build_card_state_response` doesn't crash ✓ The `state_reloaded = CardStateStub.get_by_id(state.id)` assertion in T3-7a is a good detail — it confirms the corrected flag was persisted, not just returned in the dict. No logical inconsistency between T2-1 (`fully_evolved` persists when stats drop from T4) and T3-7 (corrupted `fully_evolved=True` at T3 is corrected) — both follow correctly from the evaluator always deriving `fully_evolved = (post-regression tier ≥ 4)`. #### Security No concerns — test code only, no production paths changed. #### Style & Conventions Follows the SQLite shared-cache pattern established in PR #170 (same `SqliteDatabase("file:...?mode=memory&cache=shared", uri=True)`, same db middleware, same fixture structure). Consistent with project conventions. #### Suggestions - `_sa_make_state` is defined in the T3-6 boilerplate but the actual test (`test_get_card_state_last_evaluated_at_null`) calls `RefractorCardState.create(...)` directly instead of using the helper. Minor DRY inconsistency — non-blocking. - No `_state_api_db.close()` / `_track_api_db.close()` in fixture teardown (same pattern as the T2 tests on main). Fine in practice with SQLite, consistent. - `# noqa: E402` present on the `app.db_engine` import in `test_refractor_track_api.py` but not in `test_refractor_state_api.py`. Ruff config likely handles it globally; PR reports no violations. ### Verdict: APPROVED All 7 test cases are correct, mathematically verified, and well-documented with clear "What / Why" docstrings. Tests are isolated, use the established SQLite in-memory pattern, and cover real behavioral properties (flag corruption repair, multi-team stat isolation, null timestamp serialization). No blockers. --- *Automated review by Claude PR Reviewer*

Claude added

ai-reviewed

and removed

ai-reviewing

labels 2026-03-24 21:19:52 +00:00

cal force-pushed test/refractor-tier3 from fa96e79428 to 906d6e575a

2026-03-25 02:38:36 +00:00

Compare

cal dismissed Claude’s review 2026-03-25 02:38:36 +00:00

Reason:

New commits pushed, approval review dismissed automatically according to repository settings

Claude approved these changes 2026-03-25 04:13:09 +00:00

Claude left a comment

Re-approving after conflict resolution rebase — all 7 tests preserved, T3-6 merged into shared SQLite section. Logic unchanged.

cal merged commit de9b511ae9 into main

2026-03-25 04:13:18 +00:00

cal deleted branch test/refractor-tier3

2026-03-25 04:13:18 +00:00

cal referenced this issue from a commit

2026-03-25 04:13:19 +00:00

Merge pull request 'test: refractor system Tier 3 test coverage' (#171) from test/refractor-tier3 into main

Sign in to join this conversation.

No reviewers

No Milestone

No project

No Assignees

2 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: cal/paper-dynasty-database#171