# Retrosheet 2005 Data Migration - Session Summary **Date**: November 8, 2025 **Objective**: Migrate retrosheet_data.py to work with new Retrosheet CSV format for 2005 season data ## Overview Successfully migrated the card creation system from the old Retrosheet CSV format to a new format, implementing a smart caching transformer that processes ~194k rows in 5 seconds on first run and <1 second on subsequent runs. ## Major Changes Implemented ### 1. Retrosheet CSV Format Transformation (`retrosheet_transformer.py`) **Problem**: New Retrosheet data source uses completely different column structure - Old format: `event_type`, `hit_val`, `batted_ball_type` as single columns - New format: Boolean columns for each event type (`single`, `double`, `triple`, `hr`, `walk`, `k`, etc.) **Solution**: Created `retrosheet_transformer.py` with: - Automatic format detection and conversion - Smart caching with timestamp checking - Explicit string dtype enforcement for `.str` accessor compatibility - Case conversion (R/L → r/l) for handedness columns **Key Transformations**: ```python # Event type derivation (priority order) if hr=1 → 'home run' elif triple=1 → 'triple' elif double=1 → 'double' elif single=1 → 'single' elif walk=1 or iw=1 → 'walk' elif k=1 → 'strikeout' elif hbp=1 → 'hit by pitch' else → 'generic out' # Batted ball type if fly=1 → 'f' elif ground=1 → 'G' elif line=1 → 'l' # Hit value if hr=1 → '4' elif triple=1 → '3' elif double=1 → '2' elif single=1 → '1' ``` **Performance**: - Initial transformation: ~5 seconds for 194k rows - Cached loads: <1 second - Cache file: `retrosheets_events_YYYY_normalized.csv` ### 2. Defense CSV Column Standardization **Files Affected**: All 10 defense CSV files (c, 1b, 2b, 3b, ss, lf, cf, rf, of, p) **Required Renames**: ``` RF/9 → range_factor_per_nine RF/G → range_factor_per_game DP → DP_def E → E_def Ch → chances Inn → Inn_def CS% → caught_stealing_perc Name-additional → key_bbref ``` **Special Cases**: - Pitchers: `PO → pickoffs` - Position players: Keep `PO` as putouts **File Naming**: Changed from hyphens to underscores (`defense-c.csv` → `defense_c.csv`) ### 3. Configuration Updates for 2005 Data **Date Range** (retrosheet_data.py): ```python START_DATE = 20050403 # 2005 Opening Day END_DATE = 20051002 # 2005 Regular Season End SEASON_PCT = 162 / 162 # Full season ``` **Playing Time Thresholds** (full season): ```python MIN_PA_VL = 50 # was 20 for live series MIN_PA_VR = 75 # was 40 for live series ``` **Cardset Configuration**: ```python CARDSET_ID = 27 if 'live' in PLAYER_DESCRIPTION.lower() else 28 # Changed from: 22/23 (1998 cardsets) ``` ### 4. Multi-Team Player Handling **Problem**: Players traded during season create duplicate entries - Example: Ryan Drese appears 3x (2TM, TEX, WSN) - Causes `TypeError: cannot convert the series to ` when accessing ratings **Solution**: Filter after peripheral/running stats merge ```python duplicated_mask = df['key_bbref'].duplicated(keep=False) if duplicated_mask.any(): multi_team_mask = df['Tm'].str.contains('TM', na=False) df = df[~duplicated_mask | multi_team_mask] ``` **Applied to**: Both `run_batters()` and `run_pitchers()` functions ### 5. Dictionary Column Corruption Fix **Problem**: Merging full card DataFrames corrupted `ratings_vL` and `ratings_vR` dictionary columns - Error: `TypeError: 'float' object does not support item assignment` **Solution**: Only merge required columns ```python # Before (corrupts dict columns) br = pd.merge(left=br, right=bc, ...) # After (preserves dict columns) br = pd.merge(left=br, right=bc[['key_bbref', 'player_id', 'battingcard_id']], ...) ``` **Also fixed**: Removed unnecessary `.set_index('key_bbref')` from card functions ### 6. Position Assignment Improvements **Problem**: Existing players kept old positions (e.g., DH + 3B when should only show 3B) **Root Cause**: Script only updated cost/rarity/image for existing players, not positions **Solution**: Update ALL position slots when patching existing players ```python # Batters: Update all 8 position slots from defense stats all_pos = get_player_record_pos(def_rat_df, row) for x in enumerate(all_pos): patch_params.append((f'pos_{x[0] + 1}', x[1])) # Pitchers: Set position based on rating, clear unused slots if starter_rating >= 4: patch_params.append(('pos_1', 'SP')) for i in range(2, 9): patch_params.append((f'pos_{i}', None)) ``` **DH Rule**: Now correctly only appears when player has NO defensive positions ### 7. Pandas Type Handling Fixes **String Type Enforcement** (retrosheet_transformer.py): ```python # Ensure .str accessor works transformed['hit_val'] = df.apply(transform_hit_val, axis=1).astype(str) transformed['hit_location'] = df['loc'].astype(str) transformed['batted_ball_type'] = df.apply(transform_batted_ball_type, axis=1).astype(str) # Maintain types when loading from cache dtype_dict = { 'game_id': 'str', 'hit_val': 'str', 'hit_location': 'str', 'batted_ball_type': 'str' } pd.read_csv(cache_path, dtype=dtype_dict, low_memory=False) ``` **Pitcher OPS Calculation** (pitchers/calcs_pitcher.py): ```python # Convert to float before min() to avoid "ambiguous truth value" error ob_vl = float(108 * (df_data['BB_vL'] + df_data['HBP_vL']) / df_data['TBF_vL']) all_other_ob = sanitize_chance_output(min(ob_vl, 0.8)) ``` **Column Name Conflicts** (retrosheet_data.py): ```python # Drop duplicate 'Tm' from defense_p to prevent Tm_x/Tm_y creation if 'Tm' in df_p.columns: df_p = df_p.drop(columns=['Tm']) ``` ## Files Created 1. **retrosheet_transformer.py**: Main transformation module with caching 2. **rename_defense_columns.py**: Script to standardize defense CSV columns 3. **rename_additional_defense_columns.py**: Second pass for additional column renames 4. **undo_po_rename.py**: Reverted PO→pickoffs for position players 5. **test_retrosheet_integration.py**: Integration test script 6. **test_nan_handling.py**: (created during testing) 7. **retrosheet_transformer.py**: Retrosheet data normalization script ## Files Modified 1. **retrosheet_data.py**: - Added transformer import - Updated date/cardset configuration for 2005 - Added multi-team player filtering - Fixed position assignment for existing players - Added column conflict resolution 2. **pitchers/calcs_pitcher.py**: - Fixed OPS calculation type handling 3. **CLAUDE.md**: - Added Retrosheet section - Added Defense CSV requirements - Added Common Issues section - Added Position Assignment Rules - Added Configuration Checklist ## Results **Successfully Generated**: - ✅ 335 qualified batters (2005 season) - ✅ 129 qualified pitchers (2005 season) - ✅ All batting cards, ratings, and defensive positions - ✅ All pitching cards and ratings **Posted to**: - Database: pddev.manticorum.com (development) - Cardset: 22 (initial run), configured for 27 going forward ## Key Learnings 1. **Always verify configuration before running**: - Check cardset_id matches target - Verify database environment (dev vs prod) - Confirm date ranges match data year 2. **Pandas merge gotchas**: - Merging full DataFrames can corrupt object/dict columns - Column name conflicts create `_x` and `_y` suffixes - Always specify only needed columns in merge operations 3. **Type preservation matters**: - Explicitly set dtypes for string columns - Persist dtypes when loading from cache - Convert pandas values to Python types before operations like `min()` 4. **Position assignment logic**: - Always update ALL position slots to clear old data - DH should only appear when no defensive positions exist - Sort positions by innings played (primary position = most innings) 5. **Multi-team players require special handling**: - Baseball Reference provides per-team AND combined stats - Always use combined totals (2TM, 3TM, etc.) - Filter duplicates AFTER all data merges complete 6. **Caching strategy**: - Timestamp-based cache invalidation works well - 5 seconds preprocessing vs <1 second cached is significant - Store normalized data for consistency ## Maintenance Notes **When adding new seasons**: 1. Update START_DATE/END_DATE for season year 2. Verify CARDSET_ID for target cardset 3. Ensure defense CSV files have correct column names 4. Run transformer to generate normalized cache 5. Check logs for cardset_id confirmation **When data format changes**: 1. Update retrosheet_transformer.py transformation logic 2. Regenerate cache (delete `*_normalized.csv` files) 3. Test with small date range first 4. Verify all column dependencies still work **Performance Monitoring**: - Initial transformation: Should complete in <10 seconds for full season - Cached loads: Should complete in <2 seconds - Full pipeline: Batters + Pitchers ~1-2 minutes total