This commit adds support for the new Retrosheet CSV format and resolves multiple data processing issues in retrosheet_data.py. New Features: - Created retrosheet_transformer.py with smart caching system - Transforms new Retrosheet CSV format to legacy format - Checks file timestamps to avoid redundant transformations - Caches normalized data for instant subsequent loads (~5s → <1s) - Handles column mapping: gid→game_id, bathand→batter_hand, etc. - Derives event_type from multiple boolean columns - Converts handedness values R/L → r/l - Explicitly sets string dtypes for hit_val, hit_location, batted_ball_type Configuration Updates: - Updated retrosheet_data.py for 2005 season data - START_DATE: 19980301 → 20050403 (2005 Opening Day) - END_DATE: 19980430 → 20051002 (2005 Regular Season End) - SEASON_PCT: 28/162 → 162/162 (full season) - MIN_PA_VL/VR: 20/40 → 50/75 (full season minimums) - CARDSET_ID: Updated for 2005 cardsets - EVENTS_FILENAME: Updated to use retrosheets_events_2005.csv Bug Fixes: 1. Multi-team player duplicates - Players traded during season had duplicate rows (one per team + combined) - Added filtering to keep only combined totals (2TM, 3TM, etc.) - Prevents duplicate key_bbref values in ratings dataframes 2. Column name conflicts - Fixed Tm column conflict when merging periph_stats and defense_p - Drop duplicate Tm from defense data before merge 3. Pitcher rating calculations (pitchers/calcs_pitcher.py) - Fixed "truth value is ambiguous" error in min() comparisons - Explicitly convert pandas values to float before min() operations 4. Dictionary column corruption in ratings - Fixed ratings_vL and ratings_vR corruption during DataFrame merges - Only merge specific columns (key_bbref, player_id, card_id) instead of full DataFrame - Removed unnecessary .set_index() calls from post_batting_cards() and post_pitching_cards() Documentation: - Updated CLAUDE.md with comprehensive troubleshooting section - Added Retrosheet transformation documentation - Documented defense CSV requirements and column naming - Added configuration checklist for retrosheet_data.py - Documented common issues: multi-team players, dictionary corruption, string types 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
186 lines
9.6 KiB
Markdown
186 lines
9.6 KiB
Markdown
# CLAUDE.md
|
|
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
|
|
## Project Overview
|
|
|
|
This is a baseball card creation system for Paper Dynasty, a sports card simulation game. The system pulls real baseball statistics from FanGraphs and Baseball Reference, processes them through calculation algorithms, and generates statistical cards for players. All generated data is POSTed directly to the Paper Dynasty API, and cards are dynamically generated when accessed via card URLs (cached by nginx gateway).
|
|
|
|
## Key Architecture Components
|
|
|
|
### Core Modules
|
|
- **batters/**: Batting card creation with rating calculations (calcs_batter.py) and card generation (creation.py)
|
|
- **pitchers/**: Pitching card creation with ERA/WHIP calculations (calcs_pitcher.py) and card generation (creation.py)
|
|
- **defenders/**: Defensive rating calculations and fielding card generation (calcs_defense.py, creation.py)
|
|
- **db_calls.py**: Paper Dynasty API interface with authentication and CRUD operations
|
|
- **creation_helpers.py**: Shared utilities including D20 probability tables, stat normalization, and data sanitization
|
|
|
|
### Data Flow
|
|
1. **Input**: CSV files from FanGraphs/Baseball Reference placed in `data-input/[Year] [Type] Cardset/`
|
|
2. **Processing**: Statistics are normalized using league averages and converted to D20-based game mechanics
|
|
3. **Output**: Generated card data is POSTed directly to Paper Dynasty API; cards rendered on-demand when URLs accessed
|
|
|
|
### Entry Points
|
|
- **live_series_update.py**: Main script for live season card updates (in-season cards)
|
|
- **retrosheet_data.py**: Main script for historical replay cardsets
|
|
- **refresh_cards.py**: Updates existing player card images and metadata
|
|
- **check_cards.py**: Validates card data and generates test outputs
|
|
- **scouting_batters.py** / **scouting_pitchers.py**: Generate scouting reports and ratings comparisons
|
|
|
|
## Common Commands
|
|
|
|
### Testing
|
|
```bash
|
|
pytest # Run all tests
|
|
pytest tests/test_*.py # Run specific test file
|
|
```
|
|
|
|
### Card Generation
|
|
```bash
|
|
python live_series_update.py # Generate live series cards
|
|
python retrosheet_data.py # Generate historical replay cards
|
|
python refresh_cards.py # Update existing card images
|
|
python check_cards.py # Validate card data
|
|
```
|
|
|
|
### Scouting Reports
|
|
```bash
|
|
python scouting_batters.py # Generate batting scouting data
|
|
python scouting_pitchers.py # Generate pitching scouting data
|
|
```
|
|
|
|
## Data Input Requirements
|
|
|
|
### FanGraphs Data (place in data-input/[YEAR] [TYPE] Cardset/)
|
|
- **vlhp-basic.csv** / **vlhp-rate.csv**: vs Left-handed Pitching stats
|
|
- **vrhp-basic.csv** / **vrhp-rate.csv**: vs Right-handed Pitching stats
|
|
- **vlhh-basic.csv** / **vlhh-rate.csv**: vs Left-handed Hitting stats
|
|
- **vrhh-basic.csv** / **vrhh-rate.csv**: vs Right-handed Hitting stats
|
|
|
|
### Baseball Reference Data
|
|
- **running.csv**: Baserunning statistics
|
|
- **pitching.csv**: Standard pitching statistics
|
|
- **defense_*.csv**: Defensive statistics for each position (c, 1b, 2b, 3b, ss, lf, cf, rf, of, p)
|
|
|
|
### Retrosheet Play-by-Play Data
|
|
- **retrosheet_transformer.py**: Preprocesses new Retrosheet CSV format to legacy format with smart caching
|
|
- Place source files in `data-input/retrosheet/` directory
|
|
- Transformer automatically checks timestamps and only re-processes if source is newer than cache
|
|
- Normalized cache files saved as `*_normalized.csv` for fast subsequent runs
|
|
- Performance: ~5 seconds for initial transformation, <1 second for cached loads
|
|
|
|
### Defense CSV Requirements
|
|
All defense files must use underscore naming (`defense_c.csv`, not `defense-c.csv`) and include these standardized column names:
|
|
- `key_bbref`: Player identifier (required as index key)
|
|
- `Inn_def`: Innings played at position
|
|
- `chances`: Total fielding chances
|
|
- `E_def`: Errors
|
|
- `DP_def`: Double plays
|
|
- `fielding_perc`: Fielding percentage
|
|
- `tz_runs_total`: Total Zone runs saved
|
|
- `tz_runs_field`: Zone runs (fielding only)
|
|
- `tz_runs_infield`: Zone runs (infield only)
|
|
- `range_factor_per_nine`: Range factor per 9 innings
|
|
- `range_factor_per_game`: Range factor per game
|
|
- Catchers only: `caught_stealing_perc`, `pickoffs` (not PO)
|
|
- Position players: `PO` for putouts (not pickoffs)
|
|
|
|
### Minimum Playing Time Thresholds
|
|
- **Live Series**: 20 PA vs L / 40 PA vs R (batters), 20 TBF vs L / 40 TBF vs R (pitchers)
|
|
- **Season Cards**: 50 PA vs L / 75 PA vs R (batters), 50 TBF vs L / 75 TBF vs R (pitchers)
|
|
|
|
## Configuration
|
|
|
|
### Database Settings (db_calls.py)
|
|
- Production: `https://pd.manticorum.com/api`
|
|
- Development: `https://pddev.manticorum.com/api`
|
|
- Change `alt_database` variable to switch environments
|
|
|
|
### Live Series Settings (live_series_update.py)
|
|
- `SEASON`: Current year for live updates
|
|
- `CARDSET_NAME`: Target cardset (e.g., "2025 Live")
|
|
- `GAMES_PLAYED`: Season progress for live series calculations
|
|
- `IGNORE_LIMITS`: Override minimum playing time requirements
|
|
|
|
### Retrosheet Data Settings (retrosheet_data.py)
|
|
Before running retrosheet_data.py, verify these configuration settings:
|
|
- `PLAYER_DESCRIPTION`: 'Live' for season cards, or '<Month> PotM' for promotional cards
|
|
- `CARDSET_ID`: Correct cardset ID (e.g., 27 for 2005 Live, 28 for 2005 Promos)
|
|
- `START_DATE` / `END_DATE`: Date range in YYYYMMDD format matching your Retrosheet data
|
|
- `SEASON_PCT`: Percentage of season completed (162/162 for full season)
|
|
- `MIN_PA_VL` / `MIN_PA_VR`: Minimum plate appearances (50/75 for full season, 1/1 for promos)
|
|
- `DATA_INPUT_FILE_PATH`: Path to data directory (usually `data-input/[Year] [Type] Cardset/`)
|
|
- `EVENTS_FILENAME`: Retrosheet CSV filename (e.g., `retrosheets_events_2005.csv`)
|
|
|
|
**Configuration Checklist Before Running:**
|
|
1. Database environment (`alt_database` in db_calls.py)
|
|
2. Cardset ID matches intended target
|
|
3. Date range matches Retrosheet data year
|
|
4. Defense CSV files present and properly named
|
|
5. Running/pitching CSV files present
|
|
|
|
## Important Notes
|
|
|
|
- The system uses D20-based probability mechanics where statistics are converted to chances out of 20
|
|
- Cards are generated with both basic stats and advanced metrics (OPS, WHIP, etc.)
|
|
- Defensive ratings use zone-based fielding statistics from Baseball Reference
|
|
- All player data flows through Paper Dynasty's API with bearer token authentication
|
|
- Cards are dynamically rendered when accessed via URL, with nginx caching for performance
|
|
|
|
### Rarity Assignment System
|
|
- **rarity_thresholds.py**: Contains season-aware rarity thresholds (2024 vs 2025+)
|
|
- Rarity is calculated from `total_OPS` (batters) or OPS-against (pitchers) in the ratings dataframe
|
|
- `post_player_updates()` uses LEFT JOIN to preserve players without ratings (assigns Common/5 rarity + default OPS)
|
|
- Players missing ratings will log warnings showing player_id and card_id for troubleshooting
|
|
- Default OPS values: 0.612 (batters/Common), 0.702 (pitchers/Common reliever)
|
|
|
|
### Position Assignment Rules
|
|
- **Batters**: Positions assigned from defensive stats, sorted by innings played (most innings = pos_1)
|
|
- **DH Rule**: "DH" only appears when a player has NO defensive positions at all
|
|
- **Pitchers**: Assigned based on starter_rating (≥4 = SP, <4 = RP) and closer_rating (if present, add CP)
|
|
- **Position Updates**: Script updates ALL 8 position slots when patching existing players to clear old data
|
|
- Player cards can be viewed as HTML by adding `html=true` to the card URL: `https://pddev.manticorum.com/api/v2/players/{id}/battingcard?d={date}&html=true`
|
|
|
|
## Common Issues and Solutions
|
|
|
|
### Multi-Team Players (Traded During Season)
|
|
**Problem**: Players traded during season appear multiple times in Baseball Reference data (one row per team + combined total marked as "2TM", "3TM", etc.)
|
|
|
|
**Solution**: Script automatically filters to keep only combined season totals:
|
|
- Detects duplicate `key_bbref` values after merging peripheral/running stats
|
|
- Keeps rows where `Tm` column contains "TM" (2TM, 3TM, etc.)
|
|
- Removes individual team rows to prevent duplicate player entries
|
|
|
|
### Dictionary Column Corruption in Ratings
|
|
**Problem**: When merging full card DataFrames with ratings DataFrames, pandas corrupts `ratings_vL` and `ratings_vR` dictionary columns, converting them to floats/NaN.
|
|
|
|
**Solution**: Only merge specific columns needed (`key_bbref`, `player_id`, `battingcard_id`/`pitchingcard_id`) instead of entire DataFrame.
|
|
|
|
### No Players Found After Successful Run
|
|
**Symptoms**: Script completes successfully but API query returns 0 players
|
|
|
|
**Common Causes**:
|
|
1. **Wrong Cardset**: Check logs for actual cardset_id used vs. cardset queried in API
|
|
2. **Wrong Database**: Verify `alt_database` setting in db_calls.py (dev vs production)
|
|
3. **Date Mismatch**: START_DATE/END_DATE don't match Retrosheet data year
|
|
4. **Empty PROMO_INCLUSION_RETRO_IDS**: When PLAYER_DESCRIPTION is a promo name, this list must contain player IDs
|
|
|
|
**Debugging Steps**:
|
|
1. Check logs for actual POST operations and player_id values
|
|
2. Verify cardset_id in logs matches API query
|
|
3. Check database URL in logs matches intended environment
|
|
4. Query API with cardset_id from logs to find players
|
|
|
|
### String Type Issues with Retrosheet Data
|
|
**Problem**: Pandas .str accessor fails on `hit_val`, `hit_location`, `batted_ball_type` columns
|
|
|
|
**Solution**: retrosheet_transformer.py explicitly converts these to string dtype and maintains type when loading from cache using dtype parameter in pd.read_csv()
|
|
|
|
### Pitcher OPS Calculation Errors
|
|
**Problem**: `min()` function fails with "truth value is ambiguous" error when calculating OB values
|
|
|
|
**Solution**: Explicitly convert pandas values to Python floats before using `min()`:
|
|
```python
|
|
ob_vl = float(108 * (df_data['BB_vL'] + df_data['HBP_vL']) / df_data['TBF_vL'])
|
|
result = min(ob_vl, 0.8) # Now works correctly
|
|
``` |