Root cause: post_positions() was upserting cardpositions, leaving stale DH entries from the previous buggy run where outfielders had no defensive positions. Solution: Modified post_positions() to DELETE all existing cardpositions for the cardset before posting new ones. This ensures: - Stale DH positions are removed when players gain defensive positions - Cards show only current, accurate positions - No phantom positions persist across script runs Example: Ichiro previously had both "RF" and "DH" cardpositions. With this fix, only "RF" remains after re-running the script. Updated CLAUDE.md with explanation of the cleanup logic. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
13 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
This is a baseball card creation system for Paper Dynasty, a sports card simulation game. The system pulls real baseball statistics from FanGraphs and Baseball Reference, processes them through calculation algorithms, and generates statistical cards for players. All generated data is POSTed directly to the Paper Dynasty API, and cards are dynamically generated when accessed via card URLs (cached by nginx gateway).
Key Architecture Components
Core Modules
- batters/: Batting card creation with rating calculations (calcs_batter.py) and card generation (creation.py)
- pitchers/: Pitching card creation with ERA/WHIP calculations (calcs_pitcher.py) and card generation (creation.py)
- defenders/: Defensive rating calculations and fielding card generation (calcs_defense.py, creation.py)
- db_calls.py: Paper Dynasty API interface with authentication and CRUD operations
- creation_helpers.py: Shared utilities including D20 probability tables, stat normalization, and data sanitization
Data Flow
- Input: CSV files from FanGraphs/Baseball Reference placed in
data-input/[Year] [Type] Cardset/ - Processing: Statistics are normalized using league averages and converted to D20-based game mechanics
- Output: Generated card data is POSTed directly to Paper Dynasty API; cards rendered on-demand when URLs accessed
Entry Points
- live_series_update.py: Main script for live season card updates (in-season cards)
- retrosheet_data.py: Main script for historical replay cardsets
- refresh_cards.py: Updates existing player card images and metadata
- check_cards.py: Validates card data and generates test outputs
- check_cards_and_upload.py: Fetches card images from API and uploads to AWS S3 with cache-busting URLs
- scouting_batters.py / scouting_pitchers.py: Generate scouting reports and ratings comparisons
Common Commands
Testing
pytest # Run all tests
pytest tests/test_*.py # Run specific test file
Card Generation
python live_series_update.py # Generate live series cards
python retrosheet_data.py # Generate historical replay cards
python refresh_cards.py # Update existing card images
python check_cards.py # Validate card data
Scouting Reports
python scouting_batters.py # Generate batting scouting data
python scouting_pitchers.py # Generate pitching scouting data
AWS S3 Card Upload
python check_cards_and_upload.py # Fetch cards from API and upload to S3
Analysis and Reporting
python analyze_cardset_rarity.py # Analyze players by franchise and rarity (batters/pitchers/combined)
python rank_pitching_staffs.py # Rank teams 1-30 by pitching staff quality
Position Validation
# Verify position assignments after card generation (recommended after every run)
./scripts/check_positions.sh <cardset_id> [api_url]
# Examples:
./scripts/check_positions.sh 27 # Check production
./scripts/check_positions.sh 27 https://pddev.manticorum.com/api # Check dev
# The script flags:
# - Anomalous DH counts (should be <5 for full-season cards)
# - Missing outfield positions (indicates defensive calculation failures)
# - Mismatches between player positions and cardpositions table
Data Input Requirements
FanGraphs Data (place in data-input/[YEAR] [TYPE] Cardset/)
- vlhp-basic.csv / vlhp-rate.csv: vs Left-handed Pitching stats
- vrhp-basic.csv / vrhp-rate.csv: vs Right-handed Pitching stats
- vlhh-basic.csv / vlhh-rate.csv: vs Left-handed Hitting stats
- vrhh-basic.csv / vrhh-rate.csv: vs Right-handed Hitting stats
Baseball Reference Data
- running.csv: Baserunning statistics
- pitching.csv: Standard pitching statistics
- defense_*.csv: Defensive statistics for each position (c, 1b, 2b, 3b, ss, lf, cf, rf, of, p)
Retrosheet Play-by-Play Data
- retrosheet_transformer.py: Preprocesses new Retrosheet CSV format to legacy format with smart caching
- Place source files in
data-input/retrosheet/directory - Transformer automatically checks timestamps and only re-processes if source is newer than cache
- Normalized cache files saved as
*_normalized.csvfor fast subsequent runs - Performance: ~5 seconds for initial transformation, <1 second for cached loads
Defense CSV Requirements
All defense files must use underscore naming (defense_c.csv, not defense-c.csv) and include these standardized column names:
key_bbref: Player identifier (required as index key)Inn_def: Innings played at positionchances: Total fielding chancesE_def: ErrorsDP_def: Double playsfielding_perc: Fielding percentagetz_runs_total: Total Zone runs savedtz_runs_field: Zone runs (fielding only)tz_runs_infield: Zone runs (infield only)range_factor_per_nine: Range factor per 9 inningsrange_factor_per_game: Range factor per game- Catchers only:
caught_stealing_perc,pickoffs(not PO) - Position players:
POfor putouts (not pickoffs)
Minimum Playing Time Thresholds
- Live Series: 20 PA vs L / 40 PA vs R (batters), 20 TBF vs L / 40 TBF vs R (pitchers)
- Season Cards: 50 PA vs L / 75 PA vs R (batters), 50 TBF vs L / 75 TBF vs R (pitchers)
Configuration
Database Settings (db_calls.py)
- Production:
https://pd.manticorum.com/api - Development:
https://pddev.manticorum.com/api - Change
alt_databasevariable to switch environments
Live Series Settings (live_series_update.py)
SEASON: Current year for live updatesCARDSET_NAME: Target cardset (e.g., "2025 Live")GAMES_PLAYED: Season progress for live series calculationsIGNORE_LIMITS: Override minimum playing time requirements
Retrosheet Data Settings (retrosheet_data.py)
Before running retrosheet_data.py, verify these configuration settings:
PLAYER_DESCRIPTION: 'Live' for season cards, or ' PotM' for promotional cardsCARDSET_ID: Correct cardset ID (e.g., 27 for 2005 Live, 28 for 2005 Promos)START_DATE/END_DATE: Date range in YYYYMMDD format matching your Retrosheet dataSEASON_PCT: Percentage of season completed (162/162 for full season)MIN_PA_VL/MIN_PA_VR: Minimum plate appearances (50/75 for full season, 1/1 for promos)DATA_INPUT_FILE_PATH: Path to data directory (usuallydata-input/[Year] [Type] Cardset/)EVENTS_FILENAME: Retrosheet CSV filename (e.g.,retrosheets_events_2005.csv)
Configuration Checklist Before Running:
- Database environment (
alt_databasein db_calls.py) - Cardset ID matches intended target
- Date range matches Retrosheet data year
- Defense CSV files present and properly named
- Running/pitching CSV files present
AWS S3 Upload Settings (check_cards_and_upload.py)
CARDSET_NAME: Target cardset name to fetch players from (e.g., "2005 Live")START_ID: Optional player_id to start from (useful for resuming uploads)TEST_COUNT: Limit number of cards to process (set to None for all cards)HTML_CARDS: Set to True to fetch HTML preview cards instead of PNGUPLOAD_TO_S3: Enable/disable S3 upload (True for production)UPDATE_PLAYER_URLS: Enable/disable updating player records with S3 URLs (careful - modifies database)AWS_BUCKET_NAME: S3 bucket name (default: 'paper-dynasty')AWS_REGION: AWS region (default: 'us-east-1')
S3 URL Structure: cards/cardset-{cardset_id:03d}/player-{player_id}/{batting|pitching}card.png?d={release_date}
- Uses zero-padded 3-digit cardset ID for consistent sorting
- Includes cache-busting query parameter with date (YYYY-M-D format)
- Uses persistent aiohttp session for efficient connection reuse
AWS Credentials: Requires AWS CLI configured with credentials (~/.aws/credentials) and appropriate IAM permissions:
s3:PutObject,s3:GetObject,s3:ListBucketon the target bucket
Important Notes
- The system uses D20-based probability mechanics where statistics are converted to chances out of 20
- Cards are generated with both basic stats and advanced metrics (OPS, WHIP, etc.)
- Defensive ratings use zone-based fielding statistics from Baseball Reference
- All player data flows through Paper Dynasty's API with bearer token authentication
- Cards are dynamically rendered when accessed via URL, with nginx caching for performance
Rarity Assignment System
- rarity_thresholds.py: Contains season-aware rarity thresholds (2024 vs 2025+)
- Rarity is calculated from
total_OPS(batters) or OPS-against (pitchers) in the ratings dataframe post_player_updates()uses LEFT JOIN to preserve players without ratings (assigns Common/5 rarity + default OPS)- Players missing ratings will log warnings showing player_id and card_id for troubleshooting
- Default OPS values: 0.612 (batters/Common), 0.702 (pitchers/Common reliever)
Position Assignment Rules
- Batters: Positions assigned from defensive stats, sorted by innings played (most innings = pos_1)
- DH Rule: "DH" only appears when a player has NO defensive positions at all
- Pitchers: Assigned based on starter_rating (≥4 = SP, <4 = RP) and closer_rating (if present, add CP)
- Position Updates: Script updates ALL 8 position slots when patching existing players to clear old data
- Player cards can be viewed as HTML by adding
html=trueto the card URL:https://pddev.manticorum.com/api/v2/players/{id}/battingcard?d={date}&html=true
Common Issues and Solutions
Multi-Team Players (Traded During Season)
Problem: Players traded during season appear multiple times in Baseball Reference data (one row per team + combined total marked as "2TM", "3TM", etc.)
Solution: Script automatically filters to keep only combined season totals:
- Detects duplicate
key_bbrefvalues after merging peripheral/running stats - Keeps rows where
Tmcolumn contains "TM" (2TM, 3TM, etc.) - Removes individual team rows to prevent duplicate player entries
Dictionary Column Corruption in Ratings
Problem: When merging full card DataFrames with ratings DataFrames, pandas corrupts ratings_vL and ratings_vR dictionary columns, converting them to floats/NaN.
Solution: Only merge specific columns needed (key_bbref, player_id, battingcard_id/pitchingcard_id) instead of entire DataFrame.
No Players Found After Successful Run
Symptoms: Script completes successfully but API query returns 0 players
Common Causes:
- Wrong Cardset: Check logs for actual cardset_id used vs. cardset queried in API
- Wrong Database: Verify
alt_databasesetting in db_calls.py (dev vs production) - Date Mismatch: START_DATE/END_DATE don't match Retrosheet data year
- Empty PROMO_INCLUSION_RETRO_IDS: When PLAYER_DESCRIPTION is a promo name, this list must contain player IDs
Debugging Steps:
- Check logs for actual POST operations and player_id values
- Verify cardset_id in logs matches API query
- Check database URL in logs matches intended environment
- Query API with cardset_id from logs to find players
String Type Issues with Retrosheet Data
Problem: Pandas .str accessor fails on hit_val, hit_location, batted_ball_type columns
Solution: retrosheet_transformer.py explicitly converts these to string dtype and maintains type when loading from cache using dtype parameter in pd.read_csv()
Pitcher OPS Calculation Errors
Problem: min() function fails with "truth value is ambiguous" error when calculating OB values
Solution: Explicitly convert pandas values to Python floats before using min():
ob_vl = float(108 * (df_data['BB_vL'] + df_data['HBP_vL']) / df_data['TBF_vL'])
result = min(ob_vl, 0.8) # Now works correctly
Outfielders Assigned as DH (Defense Column Mismatch)
Problem: All outfielders show pos_1 = "DH" instead of LF/CF/RF; cardpositions table has 0 outfield positions
Root Cause: Code checks for bis_runs_outfield or tz_runs_outfield columns in defense CSV files, but Baseball Reference only provides tz_runs_total
Symptoms:
- 50+ players with DH as pos_1 (should be <5 for full season)
- No LF/CF/RF positions in player records
- Log errors: "Outfield position failed: 'tz_runs_outfield'"
Solution (retrosheet_data.py lines 889, 926, 947):
# Wrong - checks batter stats row instead of defense dataframe columns
if 'tz_runs_total' in row: # ❌
# Correct - checks defense dataframe for actual column
if 'bis_runs_total' in pos_df.columns: # ✅
# Wrong - column doesn't exist in CSV
of_run_rating = 'bis_runs_outfield' if 'bis_runs_outfield' in pos_df else 'tz_runs_outfield' # ❌
# Correct - fallback to column that exists
of_run_rating = 'bis_runs_outfield' if 'bis_runs_outfield' in pos_df.columns else 'tz_runs_total' # ✅
Verification: Run ./scripts/check_positions.sh <cardset_id> after card generation to catch this issue
Additional Fix: Modified post_positions() to DELETE all existing cardpositions for the cardset before posting new ones. This prevents stale DH positions from remaining in the database when players gain defensive positions after bug fixes.