paper-dynasty-card-creation/RETROSHEET_2005_MIGRATION.md

# Retrosheet 2005 Data Migration - Session Summary

**Date**: November 8, 2025
**Objective**: Migrate retrosheet_data.py to work with new Retrosheet CSV format for 2005 season data

## Overview

Successfully migrated the card creation system from the old Retrosheet CSV format to a new format, implementing a smart caching transformer that processes ~194k rows in 5 seconds on first run and <1 second on subsequent runs.

## Major Changes Implemented

### 1. Retrosheet CSV Format Transformation (`retrosheet_transformer.py`)

**Problem**: New Retrosheet data source uses completely different column structure
- Old format: `event_type`, `hit_val`, `batted_ball_type` as single columns
- New format: Boolean columns for each event type (`single`, `double`, `triple`, `hr`, `walk`, `k`, etc.)

**Solution**: Created `retrosheet_transformer.py` with:
- Automatic format detection and conversion
- Smart caching with timestamp checking
- Explicit string dtype enforcement for `.str` accessor compatibility
- Case conversion (R/L → r/l) for handedness columns

**Key Transformations**:
```python
# Event type derivation (priority order)
if hr=1 → 'home run'
elif triple=1 → 'triple'
elif double=1 → 'double'
elif single=1 → 'single'
elif walk=1 or iw=1 → 'walk'
elif k=1 → 'strikeout'
elif hbp=1 → 'hit by pitch'
else → 'generic out'

# Batted ball type
if fly=1 → 'f'
elif ground=1 → 'G'
elif line=1 → 'l'

# Hit value
if hr=1 → '4'
elif triple=1 → '3'
elif double=1 → '2'
elif single=1 → '1'
```

**Performance**:
- Initial transformation: ~5 seconds for 194k rows
- Cached loads: <1 second
- Cache file: `retrosheets_events_YYYY_normalized.csv`

### 2. Defense CSV Column Standardization

**Files Affected**: All 10 defense CSV files (c, 1b, 2b, 3b, ss, lf, cf, rf, of, p)

**Required Renames**:
```
RF/9 → range_factor_per_nine
RF/G → range_factor_per_game
DP → DP_def
E → E_def
Ch → chances
Inn → Inn_def
CS% → caught_stealing_perc
Name-additional → key_bbref
```

**Special Cases**:
- Pitchers: `PO → pickoffs`
- Position players: Keep `PO` as putouts

**File Naming**: Changed from hyphens to underscores (`defense-c.csv` → `defense_c.csv`)

### 3. Configuration Updates for 2005 Data

**Date Range** (retrosheet_data.py):
```python
START_DATE = 20050403   # 2005 Opening Day
END_DATE = 20051002     # 2005 Regular Season End
SEASON_PCT = 162 / 162  # Full season
```

**Playing Time Thresholds** (full season):
```python
MIN_PA_VL = 50   # was 20 for live series
MIN_PA_VR = 75   # was 40 for live series
```

**Cardset Configuration**:
```python
CARDSET_ID = 27 if 'live' in PLAYER_DESCRIPTION.lower() else 28
# Changed from: 22/23 (1998 cardsets)
```

### 4. Multi-Team Player Handling

**Problem**: Players traded during season create duplicate entries
- Example: Ryan Drese appears 3x (2TM, TEX, WSN)
- Causes `TypeError: cannot convert the series to <class 'int'>` when accessing ratings

**Solution**: Filter after peripheral/running stats merge
```python
duplicated_mask = df['key_bbref'].duplicated(keep=False)
if duplicated_mask.any():
    multi_team_mask = df['Tm'].str.contains('TM', na=False)
    df = df[~duplicated_mask | multi_team_mask]
```

**Applied to**: Both `run_batters()` and `run_pitchers()` functions

### 5. Dictionary Column Corruption Fix

**Problem**: Merging full card DataFrames corrupted `ratings_vL` and `ratings_vR` dictionary columns
- Error: `TypeError: 'float' object does not support item assignment`

**Solution**: Only merge required columns
```python
# Before (corrupts dict columns)
br = pd.merge(left=br, right=bc, ...)

# After (preserves dict columns)
br = pd.merge(left=br, right=bc[['key_bbref', 'player_id', 'battingcard_id']], ...)
```

**Also fixed**: Removed unnecessary `.set_index('key_bbref')` from card functions

### 6. Position Assignment Improvements

**Problem**: Existing players kept old positions (e.g., DH + 3B when should only show 3B)

**Root Cause**: Script only updated cost/rarity/image for existing players, not positions

**Solution**: Update ALL position slots when patching existing players
```python
# Batters: Update all 8 position slots from defense stats
all_pos = get_player_record_pos(def_rat_df, row)
for x in enumerate(all_pos):
    patch_params.append((f'pos_{x[0] + 1}', x[1]))

# Pitchers: Set position based on rating, clear unused slots
if starter_rating >= 4:
    patch_params.append(('pos_1', 'SP'))
    for i in range(2, 9):
        patch_params.append((f'pos_{i}', None))
```

**DH Rule**: Now correctly only appears when player has NO defensive positions

### 7. Pandas Type Handling Fixes

**String Type Enforcement** (retrosheet_transformer.py):
```python
# Ensure .str accessor works
transformed['hit_val'] = df.apply(transform_hit_val, axis=1).astype(str)
transformed['hit_location'] = df['loc'].astype(str)
transformed['batted_ball_type'] = df.apply(transform_batted_ball_type, axis=1).astype(str)

# Maintain types when loading from cache
dtype_dict = {
    'game_id': 'str',
    'hit_val': 'str',
    'hit_location': 'str',
    'batted_ball_type': 'str'
}
pd.read_csv(cache_path, dtype=dtype_dict, low_memory=False)
```

**Pitcher OPS Calculation** (pitchers/calcs_pitcher.py):
```python
# Convert to float before min() to avoid "ambiguous truth value" error
ob_vl = float(108 * (df_data['BB_vL'] + df_data['HBP_vL']) / df_data['TBF_vL'])
all_other_ob = sanitize_chance_output(min(ob_vl, 0.8))
```

**Column Name Conflicts** (retrosheet_data.py):
```python
# Drop duplicate 'Tm' from defense_p to prevent Tm_x/Tm_y creation
if 'Tm' in df_p.columns:
    df_p = df_p.drop(columns=['Tm'])
```

## Files Created

1. **retrosheet_transformer.py**: Main transformation module with caching
2. **rename_defense_columns.py**: Script to standardize defense CSV columns
3. **rename_additional_defense_columns.py**: Second pass for additional column renames
4. **undo_po_rename.py**: Reverted PO→pickoffs for position players
5. **test_retrosheet_integration.py**: Integration test script
6. **test_nan_handling.py**: (created during testing)
7. **retrosheet_transformer.py**: Retrosheet data normalization script

## Files Modified

1. **retrosheet_data.py**:
   - Added transformer import
   - Updated date/cardset configuration for 2005
   - Added multi-team player filtering
   - Fixed position assignment for existing players
   - Added column conflict resolution

2. **pitchers/calcs_pitcher.py**:
   - Fixed OPS calculation type handling

3. **CLAUDE.md**:
   - Added Retrosheet section
   - Added Defense CSV requirements
   - Added Common Issues section
   - Added Position Assignment Rules
   - Added Configuration Checklist

## Results

**Successfully Generated**:
- ✅ 335 qualified batters (2005 season)
- ✅ 129 qualified pitchers (2005 season)
- ✅ All batting cards, ratings, and defensive positions
- ✅ All pitching cards and ratings

**Posted to**:
- Database: pddev.manticorum.com (development)
- Cardset: 22 (initial run), configured for 27 going forward

## Key Learnings

1. **Always verify configuration before running**:
   - Check cardset_id matches target
   - Verify database environment (dev vs prod)
   - Confirm date ranges match data year

2. **Pandas merge gotchas**:
   - Merging full DataFrames can corrupt object/dict columns
   - Column name conflicts create `_x` and `_y` suffixes
   - Always specify only needed columns in merge operations

3. **Type preservation matters**:
   - Explicitly set dtypes for string columns
   - Persist dtypes when loading from cache
   - Convert pandas values to Python types before operations like `min()`

4. **Position assignment logic**:
   - Always update ALL position slots to clear old data
   - DH should only appear when no defensive positions exist
   - Sort positions by innings played (primary position = most innings)

5. **Multi-team players require special handling**:
   - Baseball Reference provides per-team AND combined stats
   - Always use combined totals (2TM, 3TM, etc.)
   - Filter duplicates AFTER all data merges complete

6. **Caching strategy**:
   - Timestamp-based cache invalidation works well
   - 5 seconds preprocessing vs <1 second cached is significant
   - Store normalized data for consistency

## Maintenance Notes

**When adding new seasons**:
1. Update START_DATE/END_DATE for season year
2. Verify CARDSET_ID for target cardset
3. Ensure defense CSV files have correct column names
4. Run transformer to generate normalized cache
5. Check logs for cardset_id confirmation

**When data format changes**:
1. Update retrosheet_transformer.py transformation logic
2. Regenerate cache (delete `*_normalized.csv` files)
3. Test with small date range first
4. Verify all column dependencies still work

**Performance Monitoring**:
- Initial transformation: Should complete in <10 seconds for full season
- Cached loads: Should complete in <2 seconds
- Full pipeline: Batters + Pitchers ~1-2 minutes total