Fixed critical bug where all outfielders were incorrectly assigned as DH
due to defense CSV column mismatch in retrosheet_data.py:
- Lines 889, 926: Changed column check from 'in row' to 'in pos_df.columns'
to correctly detect bis_runs_total availability
- Line 947: Fixed fallback from non-existent 'tz_runs_outfield' to
'tz_runs_total' which actually exists in Baseball Reference CSVs
Impact:
- Before: 57 DH players, 0 outfield positions
- After: 3 DH players, 62 outfielders (23 RF, 20 CF, 19 LF)
Added scripts/check_positions.sh:
- Validates position distribution after card generation
- Flags anomalous DH counts (>5 or >10%)
- Verifies outfield positions exist in cardpositions table
- Provides quick smoke test for defensive calculations
Updated CLAUDE.md:
- Added Position Validation section with check_positions.sh usage
- Documented outfield position bug in Common Issues & Solutions
- Included code examples and verification steps
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
This commit adds support for the new Retrosheet CSV format and resolves
multiple data processing issues in retrosheet_data.py.
New Features:
- Created retrosheet_transformer.py with smart caching system
- Transforms new Retrosheet CSV format to legacy format
- Checks file timestamps to avoid redundant transformations
- Caches normalized data for instant subsequent loads (~5s → <1s)
- Handles column mapping: gid→game_id, bathand→batter_hand, etc.
- Derives event_type from multiple boolean columns
- Converts handedness values R/L → r/l
- Explicitly sets string dtypes for hit_val, hit_location, batted_ball_type
Configuration Updates:
- Updated retrosheet_data.py for 2005 season data
- START_DATE: 19980301 → 20050403 (2005 Opening Day)
- END_DATE: 19980430 → 20051002 (2005 Regular Season End)
- SEASON_PCT: 28/162 → 162/162 (full season)
- MIN_PA_VL/VR: 20/40 → 50/75 (full season minimums)
- CARDSET_ID: Updated for 2005 cardsets
- EVENTS_FILENAME: Updated to use retrosheets_events_2005.csv
Bug Fixes:
1. Multi-team player duplicates
- Players traded during season had duplicate rows (one per team + combined)
- Added filtering to keep only combined totals (2TM, 3TM, etc.)
- Prevents duplicate key_bbref values in ratings dataframes
2. Column name conflicts
- Fixed Tm column conflict when merging periph_stats and defense_p
- Drop duplicate Tm from defense data before merge
3. Pitcher rating calculations (pitchers/calcs_pitcher.py)
- Fixed "truth value is ambiguous" error in min() comparisons
- Explicitly convert pandas values to float before min() operations
4. Dictionary column corruption in ratings
- Fixed ratings_vL and ratings_vR corruption during DataFrame merges
- Only merge specific columns (key_bbref, player_id, card_id) instead of full DataFrame
- Removed unnecessary .set_index() calls from post_batting_cards() and post_pitching_cards()
Documentation:
- Updated CLAUDE.md with comprehensive troubleshooting section
- Added Retrosheet transformation documentation
- Documented defense CSV requirements and column naming
- Added configuration checklist for retrosheet_data.py
- Documented common issues: multi-team players, dictionary corruption, string types
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>