8.7 KiB
Retrosheet 2005 Data Migration - Session Summary
Date: November 8, 2025 Objective: Migrate retrosheet_data.py to work with new Retrosheet CSV format for 2005 season data
Overview
Successfully migrated the card creation system from the old Retrosheet CSV format to a new format, implementing a smart caching transformer that processes ~194k rows in 5 seconds on first run and <1 second on subsequent runs.
Major Changes Implemented
1. Retrosheet CSV Format Transformation (retrosheet_transformer.py)
Problem: New Retrosheet data source uses completely different column structure
- Old format:
event_type,hit_val,batted_ball_typeas single columns - New format: Boolean columns for each event type (
single,double,triple,hr,walk,k, etc.)
Solution: Created retrosheet_transformer.py with:
- Automatic format detection and conversion
- Smart caching with timestamp checking
- Explicit string dtype enforcement for
.straccessor compatibility - Case conversion (R/L → r/l) for handedness columns
Key Transformations:
# Event type derivation (priority order)
if hr=1 → 'home run'
elif triple=1 → 'triple'
elif double=1 → 'double'
elif single=1 → 'single'
elif walk=1 or iw=1 → 'walk'
elif k=1 → 'strikeout'
elif hbp=1 → 'hit by pitch'
else → 'generic out'
# Batted ball type
if fly=1 → 'f'
elif ground=1 → 'G'
elif line=1 → 'l'
# Hit value
if hr=1 → '4'
elif triple=1 → '3'
elif double=1 → '2'
elif single=1 → '1'
Performance:
- Initial transformation: ~5 seconds for 194k rows
- Cached loads: <1 second
- Cache file:
retrosheets_events_YYYY_normalized.csv
2. Defense CSV Column Standardization
Files Affected: All 10 defense CSV files (c, 1b, 2b, 3b, ss, lf, cf, rf, of, p)
Required Renames:
RF/9 → range_factor_per_nine
RF/G → range_factor_per_game
DP → DP_def
E → E_def
Ch → chances
Inn → Inn_def
CS% → caught_stealing_perc
Name-additional → key_bbref
Special Cases:
- Pitchers:
PO → pickoffs - Position players: Keep
POas putouts
File Naming: Changed from hyphens to underscores (defense-c.csv → defense_c.csv)
3. Configuration Updates for 2005 Data
Date Range (retrosheet_data.py):
START_DATE = 20050403 # 2005 Opening Day
END_DATE = 20051002 # 2005 Regular Season End
SEASON_PCT = 162 / 162 # Full season
Playing Time Thresholds (full season):
MIN_PA_VL = 50 # was 20 for live series
MIN_PA_VR = 75 # was 40 for live series
Cardset Configuration:
CARDSET_ID = 27 if 'live' in PLAYER_DESCRIPTION.lower() else 28
# Changed from: 22/23 (1998 cardsets)
4. Multi-Team Player Handling
Problem: Players traded during season create duplicate entries
- Example: Ryan Drese appears 3x (2TM, TEX, WSN)
- Causes
TypeError: cannot convert the series to <class 'int'>when accessing ratings
Solution: Filter after peripheral/running stats merge
duplicated_mask = df['key_bbref'].duplicated(keep=False)
if duplicated_mask.any():
multi_team_mask = df['Tm'].str.contains('TM', na=False)
df = df[~duplicated_mask | multi_team_mask]
Applied to: Both run_batters() and run_pitchers() functions
5. Dictionary Column Corruption Fix
Problem: Merging full card DataFrames corrupted ratings_vL and ratings_vR dictionary columns
- Error:
TypeError: 'float' object does not support item assignment
Solution: Only merge required columns
# Before (corrupts dict columns)
br = pd.merge(left=br, right=bc, ...)
# After (preserves dict columns)
br = pd.merge(left=br, right=bc[['key_bbref', 'player_id', 'battingcard_id']], ...)
Also fixed: Removed unnecessary .set_index('key_bbref') from card functions
6. Position Assignment Improvements
Problem: Existing players kept old positions (e.g., DH + 3B when should only show 3B)
Root Cause: Script only updated cost/rarity/image for existing players, not positions
Solution: Update ALL position slots when patching existing players
# Batters: Update all 8 position slots from defense stats
all_pos = get_player_record_pos(def_rat_df, row)
for x in enumerate(all_pos):
patch_params.append((f'pos_{x[0] + 1}', x[1]))
# Pitchers: Set position based on rating, clear unused slots
if starter_rating >= 4:
patch_params.append(('pos_1', 'SP'))
for i in range(2, 9):
patch_params.append((f'pos_{i}', None))
DH Rule: Now correctly only appears when player has NO defensive positions
7. Pandas Type Handling Fixes
String Type Enforcement (retrosheet_transformer.py):
# Ensure .str accessor works
transformed['hit_val'] = df.apply(transform_hit_val, axis=1).astype(str)
transformed['hit_location'] = df['loc'].astype(str)
transformed['batted_ball_type'] = df.apply(transform_batted_ball_type, axis=1).astype(str)
# Maintain types when loading from cache
dtype_dict = {
'game_id': 'str',
'hit_val': 'str',
'hit_location': 'str',
'batted_ball_type': 'str'
}
pd.read_csv(cache_path, dtype=dtype_dict, low_memory=False)
Pitcher OPS Calculation (pitchers/calcs_pitcher.py):
# Convert to float before min() to avoid "ambiguous truth value" error
ob_vl = float(108 * (df_data['BB_vL'] + df_data['HBP_vL']) / df_data['TBF_vL'])
all_other_ob = sanitize_chance_output(min(ob_vl, 0.8))
Column Name Conflicts (retrosheet_data.py):
# Drop duplicate 'Tm' from defense_p to prevent Tm_x/Tm_y creation
if 'Tm' in df_p.columns:
df_p = df_p.drop(columns=['Tm'])
Files Created
- retrosheet_transformer.py: Main transformation module with caching
- rename_defense_columns.py: Script to standardize defense CSV columns
- rename_additional_defense_columns.py: Second pass for additional column renames
- undo_po_rename.py: Reverted PO→pickoffs for position players
- test_retrosheet_integration.py: Integration test script
- test_nan_handling.py: (created during testing)
- retrosheet_transformer.py: Retrosheet data normalization script
Files Modified
-
retrosheet_data.py:
- Added transformer import
- Updated date/cardset configuration for 2005
- Added multi-team player filtering
- Fixed position assignment for existing players
- Added column conflict resolution
-
pitchers/calcs_pitcher.py:
- Fixed OPS calculation type handling
-
CLAUDE.md:
- Added Retrosheet section
- Added Defense CSV requirements
- Added Common Issues section
- Added Position Assignment Rules
- Added Configuration Checklist
Results
Successfully Generated:
- ✅ 335 qualified batters (2005 season)
- ✅ 129 qualified pitchers (2005 season)
- ✅ All batting cards, ratings, and defensive positions
- ✅ All pitching cards and ratings
Posted to:
- Database: pddev.manticorum.com (development)
- Cardset: 22 (initial run), configured for 27 going forward
Key Learnings
-
Always verify configuration before running:
- Check cardset_id matches target
- Verify database environment (dev vs prod)
- Confirm date ranges match data year
-
Pandas merge gotchas:
- Merging full DataFrames can corrupt object/dict columns
- Column name conflicts create
_xand_ysuffixes - Always specify only needed columns in merge operations
-
Type preservation matters:
- Explicitly set dtypes for string columns
- Persist dtypes when loading from cache
- Convert pandas values to Python types before operations like
min()
-
Position assignment logic:
- Always update ALL position slots to clear old data
- DH should only appear when no defensive positions exist
- Sort positions by innings played (primary position = most innings)
-
Multi-team players require special handling:
- Baseball Reference provides per-team AND combined stats
- Always use combined totals (2TM, 3TM, etc.)
- Filter duplicates AFTER all data merges complete
-
Caching strategy:
- Timestamp-based cache invalidation works well
- 5 seconds preprocessing vs <1 second cached is significant
- Store normalized data for consistency
Maintenance Notes
When adding new seasons:
- Update START_DATE/END_DATE for season year
- Verify CARDSET_ID for target cardset
- Ensure defense CSV files have correct column names
- Run transformer to generate normalized cache
- Check logs for cardset_id confirmation
When data format changes:
- Update retrosheet_transformer.py transformation logic
- Regenerate cache (delete
*_normalized.csvfiles) - Test with small date range first
- Verify all column dependencies still work
Performance Monitoring:
- Initial transformation: Should complete in <10 seconds for full season
- Cached loads: Should complete in <2 seconds
- Full pipeline: Batters + Pitchers ~1-2 minutes total