paper-dynasty-card-creation/RETROSHEET_2005_MIGRATION.md
2025-11-08 16:57:35 -06:00

8.7 KiB

Retrosheet 2005 Data Migration - Session Summary

Date: November 8, 2025 Objective: Migrate retrosheet_data.py to work with new Retrosheet CSV format for 2005 season data

Overview

Successfully migrated the card creation system from the old Retrosheet CSV format to a new format, implementing a smart caching transformer that processes ~194k rows in 5 seconds on first run and <1 second on subsequent runs.

Major Changes Implemented

1. Retrosheet CSV Format Transformation (retrosheet_transformer.py)

Problem: New Retrosheet data source uses completely different column structure

  • Old format: event_type, hit_val, batted_ball_type as single columns
  • New format: Boolean columns for each event type (single, double, triple, hr, walk, k, etc.)

Solution: Created retrosheet_transformer.py with:

  • Automatic format detection and conversion
  • Smart caching with timestamp checking
  • Explicit string dtype enforcement for .str accessor compatibility
  • Case conversion (R/L → r/l) for handedness columns

Key Transformations:

# Event type derivation (priority order)
if hr=1  'home run'
elif triple=1  'triple'
elif double=1  'double'
elif single=1  'single'
elif walk=1 or iw=1  'walk'
elif k=1  'strikeout'
elif hbp=1  'hit by pitch'
else  'generic out'

# Batted ball type
if fly=1  'f'
elif ground=1  'G'
elif line=1  'l'

# Hit value
if hr=1  '4'
elif triple=1  '3'
elif double=1  '2'
elif single=1  '1'

Performance:

  • Initial transformation: ~5 seconds for 194k rows
  • Cached loads: <1 second
  • Cache file: retrosheets_events_YYYY_normalized.csv

2. Defense CSV Column Standardization

Files Affected: All 10 defense CSV files (c, 1b, 2b, 3b, ss, lf, cf, rf, of, p)

Required Renames:

RF/9 → range_factor_per_nine
RF/G → range_factor_per_game
DP → DP_def
E → E_def
Ch → chances
Inn → Inn_def
CS% → caught_stealing_perc
Name-additional → key_bbref

Special Cases:

  • Pitchers: PO → pickoffs
  • Position players: Keep PO as putouts

File Naming: Changed from hyphens to underscores (defense-c.csvdefense_c.csv)

3. Configuration Updates for 2005 Data

Date Range (retrosheet_data.py):

START_DATE = 20050403   # 2005 Opening Day
END_DATE = 20051002     # 2005 Regular Season End
SEASON_PCT = 162 / 162  # Full season

Playing Time Thresholds (full season):

MIN_PA_VL = 50   # was 20 for live series
MIN_PA_VR = 75   # was 40 for live series

Cardset Configuration:

CARDSET_ID = 27 if 'live' in PLAYER_DESCRIPTION.lower() else 28
# Changed from: 22/23 (1998 cardsets)

4. Multi-Team Player Handling

Problem: Players traded during season create duplicate entries

  • Example: Ryan Drese appears 3x (2TM, TEX, WSN)
  • Causes TypeError: cannot convert the series to <class 'int'> when accessing ratings

Solution: Filter after peripheral/running stats merge

duplicated_mask = df['key_bbref'].duplicated(keep=False)
if duplicated_mask.any():
    multi_team_mask = df['Tm'].str.contains('TM', na=False)
    df = df[~duplicated_mask | multi_team_mask]

Applied to: Both run_batters() and run_pitchers() functions

5. Dictionary Column Corruption Fix

Problem: Merging full card DataFrames corrupted ratings_vL and ratings_vR dictionary columns

  • Error: TypeError: 'float' object does not support item assignment

Solution: Only merge required columns

# Before (corrupts dict columns)
br = pd.merge(left=br, right=bc, ...)

# After (preserves dict columns)
br = pd.merge(left=br, right=bc[['key_bbref', 'player_id', 'battingcard_id']], ...)

Also fixed: Removed unnecessary .set_index('key_bbref') from card functions

6. Position Assignment Improvements

Problem: Existing players kept old positions (e.g., DH + 3B when should only show 3B)

Root Cause: Script only updated cost/rarity/image for existing players, not positions

Solution: Update ALL position slots when patching existing players

# Batters: Update all 8 position slots from defense stats
all_pos = get_player_record_pos(def_rat_df, row)
for x in enumerate(all_pos):
    patch_params.append((f'pos_{x[0] + 1}', x[1]))

# Pitchers: Set position based on rating, clear unused slots
if starter_rating >= 4:
    patch_params.append(('pos_1', 'SP'))
    for i in range(2, 9):
        patch_params.append((f'pos_{i}', None))

DH Rule: Now correctly only appears when player has NO defensive positions

7. Pandas Type Handling Fixes

String Type Enforcement (retrosheet_transformer.py):

# Ensure .str accessor works
transformed['hit_val'] = df.apply(transform_hit_val, axis=1).astype(str)
transformed['hit_location'] = df['loc'].astype(str)
transformed['batted_ball_type'] = df.apply(transform_batted_ball_type, axis=1).astype(str)

# Maintain types when loading from cache
dtype_dict = {
    'game_id': 'str',
    'hit_val': 'str',
    'hit_location': 'str',
    'batted_ball_type': 'str'
}
pd.read_csv(cache_path, dtype=dtype_dict, low_memory=False)

Pitcher OPS Calculation (pitchers/calcs_pitcher.py):

# Convert to float before min() to avoid "ambiguous truth value" error
ob_vl = float(108 * (df_data['BB_vL'] + df_data['HBP_vL']) / df_data['TBF_vL'])
all_other_ob = sanitize_chance_output(min(ob_vl, 0.8))

Column Name Conflicts (retrosheet_data.py):

# Drop duplicate 'Tm' from defense_p to prevent Tm_x/Tm_y creation
if 'Tm' in df_p.columns:
    df_p = df_p.drop(columns=['Tm'])

Files Created

  1. retrosheet_transformer.py: Main transformation module with caching
  2. rename_defense_columns.py: Script to standardize defense CSV columns
  3. rename_additional_defense_columns.py: Second pass for additional column renames
  4. undo_po_rename.py: Reverted PO→pickoffs for position players
  5. test_retrosheet_integration.py: Integration test script
  6. test_nan_handling.py: (created during testing)
  7. retrosheet_transformer.py: Retrosheet data normalization script

Files Modified

  1. retrosheet_data.py:

    • Added transformer import
    • Updated date/cardset configuration for 2005
    • Added multi-team player filtering
    • Fixed position assignment for existing players
    • Added column conflict resolution
  2. pitchers/calcs_pitcher.py:

    • Fixed OPS calculation type handling
  3. CLAUDE.md:

    • Added Retrosheet section
    • Added Defense CSV requirements
    • Added Common Issues section
    • Added Position Assignment Rules
    • Added Configuration Checklist

Results

Successfully Generated:

  • 335 qualified batters (2005 season)
  • 129 qualified pitchers (2005 season)
  • All batting cards, ratings, and defensive positions
  • All pitching cards and ratings

Posted to:

  • Database: pddev.manticorum.com (development)
  • Cardset: 22 (initial run), configured for 27 going forward

Key Learnings

  1. Always verify configuration before running:

    • Check cardset_id matches target
    • Verify database environment (dev vs prod)
    • Confirm date ranges match data year
  2. Pandas merge gotchas:

    • Merging full DataFrames can corrupt object/dict columns
    • Column name conflicts create _x and _y suffixes
    • Always specify only needed columns in merge operations
  3. Type preservation matters:

    • Explicitly set dtypes for string columns
    • Persist dtypes when loading from cache
    • Convert pandas values to Python types before operations like min()
  4. Position assignment logic:

    • Always update ALL position slots to clear old data
    • DH should only appear when no defensive positions exist
    • Sort positions by innings played (primary position = most innings)
  5. Multi-team players require special handling:

    • Baseball Reference provides per-team AND combined stats
    • Always use combined totals (2TM, 3TM, etc.)
    • Filter duplicates AFTER all data merges complete
  6. Caching strategy:

    • Timestamp-based cache invalidation works well
    • 5 seconds preprocessing vs <1 second cached is significant
    • Store normalized data for consistency

Maintenance Notes

When adding new seasons:

  1. Update START_DATE/END_DATE for season year
  2. Verify CARDSET_ID for target cardset
  3. Ensure defense CSV files have correct column names
  4. Run transformer to generate normalized cache
  5. Check logs for cardset_id confirmation

When data format changes:

  1. Update retrosheet_transformer.py transformation logic
  2. Regenerate cache (delete *_normalized.csv files)
  3. Test with small date range first
  4. Verify all column dependencies still work

Performance Monitoring:

  • Initial transformation: Should complete in <10 seconds for full season
  • Cached loads: Should complete in <2 seconds
  • Full pipeline: Batters + Pitchers ~1-2 minutes total