paper-dynasty-card-creation/RETROSHEET_2005_MIGRATION.md
2025-11-08 16:57:35 -06:00

275 lines
8.7 KiB
Markdown

# Retrosheet 2005 Data Migration - Session Summary
**Date**: November 8, 2025
**Objective**: Migrate retrosheet_data.py to work with new Retrosheet CSV format for 2005 season data
## Overview
Successfully migrated the card creation system from the old Retrosheet CSV format to a new format, implementing a smart caching transformer that processes ~194k rows in 5 seconds on first run and <1 second on subsequent runs.
## Major Changes Implemented
### 1. Retrosheet CSV Format Transformation (`retrosheet_transformer.py`)
**Problem**: New Retrosheet data source uses completely different column structure
- Old format: `event_type`, `hit_val`, `batted_ball_type` as single columns
- New format: Boolean columns for each event type (`single`, `double`, `triple`, `hr`, `walk`, `k`, etc.)
**Solution**: Created `retrosheet_transformer.py` with:
- Automatic format detection and conversion
- Smart caching with timestamp checking
- Explicit string dtype enforcement for `.str` accessor compatibility
- Case conversion (R/L r/l) for handedness columns
**Key Transformations**:
```python
# Event type derivation (priority order)
if hr=1 'home run'
elif triple=1 'triple'
elif double=1 'double'
elif single=1 'single'
elif walk=1 or iw=1 'walk'
elif k=1 'strikeout'
elif hbp=1 'hit by pitch'
else 'generic out'
# Batted ball type
if fly=1 'f'
elif ground=1 'G'
elif line=1 'l'
# Hit value
if hr=1 '4'
elif triple=1 '3'
elif double=1 '2'
elif single=1 '1'
```
**Performance**:
- Initial transformation: ~5 seconds for 194k rows
- Cached loads: <1 second
- Cache file: `retrosheets_events_YYYY_normalized.csv`
### 2. Defense CSV Column Standardization
**Files Affected**: All 10 defense CSV files (c, 1b, 2b, 3b, ss, lf, cf, rf, of, p)
**Required Renames**:
```
RF/9 → range_factor_per_nine
RF/G → range_factor_per_game
DP → DP_def
E → E_def
Ch → chances
Inn → Inn_def
CS% → caught_stealing_perc
Name-additional → key_bbref
```
**Special Cases**:
- Pitchers: `PO → pickoffs`
- Position players: Keep `PO` as putouts
**File Naming**: Changed from hyphens to underscores (`defense-c.csv` `defense_c.csv`)
### 3. Configuration Updates for 2005 Data
**Date Range** (retrosheet_data.py):
```python
START_DATE = 20050403 # 2005 Opening Day
END_DATE = 20051002 # 2005 Regular Season End
SEASON_PCT = 162 / 162 # Full season
```
**Playing Time Thresholds** (full season):
```python
MIN_PA_VL = 50 # was 20 for live series
MIN_PA_VR = 75 # was 40 for live series
```
**Cardset Configuration**:
```python
CARDSET_ID = 27 if 'live' in PLAYER_DESCRIPTION.lower() else 28
# Changed from: 22/23 (1998 cardsets)
```
### 4. Multi-Team Player Handling
**Problem**: Players traded during season create duplicate entries
- Example: Ryan Drese appears 3x (2TM, TEX, WSN)
- Causes `TypeError: cannot convert the series to <class 'int'>` when accessing ratings
**Solution**: Filter after peripheral/running stats merge
```python
duplicated_mask = df['key_bbref'].duplicated(keep=False)
if duplicated_mask.any():
multi_team_mask = df['Tm'].str.contains('TM', na=False)
df = df[~duplicated_mask | multi_team_mask]
```
**Applied to**: Both `run_batters()` and `run_pitchers()` functions
### 5. Dictionary Column Corruption Fix
**Problem**: Merging full card DataFrames corrupted `ratings_vL` and `ratings_vR` dictionary columns
- Error: `TypeError: 'float' object does not support item assignment`
**Solution**: Only merge required columns
```python
# Before (corrupts dict columns)
br = pd.merge(left=br, right=bc, ...)
# After (preserves dict columns)
br = pd.merge(left=br, right=bc[['key_bbref', 'player_id', 'battingcard_id']], ...)
```
**Also fixed**: Removed unnecessary `.set_index('key_bbref')` from card functions
### 6. Position Assignment Improvements
**Problem**: Existing players kept old positions (e.g., DH + 3B when should only show 3B)
**Root Cause**: Script only updated cost/rarity/image for existing players, not positions
**Solution**: Update ALL position slots when patching existing players
```python
# Batters: Update all 8 position slots from defense stats
all_pos = get_player_record_pos(def_rat_df, row)
for x in enumerate(all_pos):
patch_params.append((f'pos_{x[0] + 1}', x[1]))
# Pitchers: Set position based on rating, clear unused slots
if starter_rating >= 4:
patch_params.append(('pos_1', 'SP'))
for i in range(2, 9):
patch_params.append((f'pos_{i}', None))
```
**DH Rule**: Now correctly only appears when player has NO defensive positions
### 7. Pandas Type Handling Fixes
**String Type Enforcement** (retrosheet_transformer.py):
```python
# Ensure .str accessor works
transformed['hit_val'] = df.apply(transform_hit_val, axis=1).astype(str)
transformed['hit_location'] = df['loc'].astype(str)
transformed['batted_ball_type'] = df.apply(transform_batted_ball_type, axis=1).astype(str)
# Maintain types when loading from cache
dtype_dict = {
'game_id': 'str',
'hit_val': 'str',
'hit_location': 'str',
'batted_ball_type': 'str'
}
pd.read_csv(cache_path, dtype=dtype_dict, low_memory=False)
```
**Pitcher OPS Calculation** (pitchers/calcs_pitcher.py):
```python
# Convert to float before min() to avoid "ambiguous truth value" error
ob_vl = float(108 * (df_data['BB_vL'] + df_data['HBP_vL']) / df_data['TBF_vL'])
all_other_ob = sanitize_chance_output(min(ob_vl, 0.8))
```
**Column Name Conflicts** (retrosheet_data.py):
```python
# Drop duplicate 'Tm' from defense_p to prevent Tm_x/Tm_y creation
if 'Tm' in df_p.columns:
df_p = df_p.drop(columns=['Tm'])
```
## Files Created
1. **retrosheet_transformer.py**: Main transformation module with caching
2. **rename_defense_columns.py**: Script to standardize defense CSV columns
3. **rename_additional_defense_columns.py**: Second pass for additional column renames
4. **undo_po_rename.py**: Reverted POpickoffs for position players
5. **test_retrosheet_integration.py**: Integration test script
6. **test_nan_handling.py**: (created during testing)
7. **retrosheet_transformer.py**: Retrosheet data normalization script
## Files Modified
1. **retrosheet_data.py**:
- Added transformer import
- Updated date/cardset configuration for 2005
- Added multi-team player filtering
- Fixed position assignment for existing players
- Added column conflict resolution
2. **pitchers/calcs_pitcher.py**:
- Fixed OPS calculation type handling
3. **CLAUDE.md**:
- Added Retrosheet section
- Added Defense CSV requirements
- Added Common Issues section
- Added Position Assignment Rules
- Added Configuration Checklist
## Results
**Successfully Generated**:
- 335 qualified batters (2005 season)
- 129 qualified pitchers (2005 season)
- All batting cards, ratings, and defensive positions
- All pitching cards and ratings
**Posted to**:
- Database: pddev.manticorum.com (development)
- Cardset: 22 (initial run), configured for 27 going forward
## Key Learnings
1. **Always verify configuration before running**:
- Check cardset_id matches target
- Verify database environment (dev vs prod)
- Confirm date ranges match data year
2. **Pandas merge gotchas**:
- Merging full DataFrames can corrupt object/dict columns
- Column name conflicts create `_x` and `_y` suffixes
- Always specify only needed columns in merge operations
3. **Type preservation matters**:
- Explicitly set dtypes for string columns
- Persist dtypes when loading from cache
- Convert pandas values to Python types before operations like `min()`
4. **Position assignment logic**:
- Always update ALL position slots to clear old data
- DH should only appear when no defensive positions exist
- Sort positions by innings played (primary position = most innings)
5. **Multi-team players require special handling**:
- Baseball Reference provides per-team AND combined stats
- Always use combined totals (2TM, 3TM, etc.)
- Filter duplicates AFTER all data merges complete
6. **Caching strategy**:
- Timestamp-based cache invalidation works well
- 5 seconds preprocessing vs <1 second cached is significant
- Store normalized data for consistency
## Maintenance Notes
**When adding new seasons**:
1. Update START_DATE/END_DATE for season year
2. Verify CARDSET_ID for target cardset
3. Ensure defense CSV files have correct column names
4. Run transformer to generate normalized cache
5. Check logs for cardset_id confirmation
**When data format changes**:
1. Update retrosheet_transformer.py transformation logic
2. Regenerate cache (delete `*_normalized.csv` files)
3. Test with small date range first
4. Verify all column dependencies still work
**Performance Monitoring**:
- Initial transformation: Should complete in <10 seconds for full season
- Cached loads: Should complete in <2 seconds
- Full pipeline: Batters + Pitchers ~1-2 minutes total