275 lines
8.7 KiB
Markdown
275 lines
8.7 KiB
Markdown
# Retrosheet 2005 Data Migration - Session Summary
|
|
|
|
**Date**: November 8, 2025
|
|
**Objective**: Migrate retrosheet_data.py to work with new Retrosheet CSV format for 2005 season data
|
|
|
|
## Overview
|
|
|
|
Successfully migrated the card creation system from the old Retrosheet CSV format to a new format, implementing a smart caching transformer that processes ~194k rows in 5 seconds on first run and <1 second on subsequent runs.
|
|
|
|
## Major Changes Implemented
|
|
|
|
### 1. Retrosheet CSV Format Transformation (`retrosheet_transformer.py`)
|
|
|
|
**Problem**: New Retrosheet data source uses completely different column structure
|
|
- Old format: `event_type`, `hit_val`, `batted_ball_type` as single columns
|
|
- New format: Boolean columns for each event type (`single`, `double`, `triple`, `hr`, `walk`, `k`, etc.)
|
|
|
|
**Solution**: Created `retrosheet_transformer.py` with:
|
|
- Automatic format detection and conversion
|
|
- Smart caching with timestamp checking
|
|
- Explicit string dtype enforcement for `.str` accessor compatibility
|
|
- Case conversion (R/L → r/l) for handedness columns
|
|
|
|
**Key Transformations**:
|
|
```python
|
|
# Event type derivation (priority order)
|
|
if hr=1 → 'home run'
|
|
elif triple=1 → 'triple'
|
|
elif double=1 → 'double'
|
|
elif single=1 → 'single'
|
|
elif walk=1 or iw=1 → 'walk'
|
|
elif k=1 → 'strikeout'
|
|
elif hbp=1 → 'hit by pitch'
|
|
else → 'generic out'
|
|
|
|
# Batted ball type
|
|
if fly=1 → 'f'
|
|
elif ground=1 → 'G'
|
|
elif line=1 → 'l'
|
|
|
|
# Hit value
|
|
if hr=1 → '4'
|
|
elif triple=1 → '3'
|
|
elif double=1 → '2'
|
|
elif single=1 → '1'
|
|
```
|
|
|
|
**Performance**:
|
|
- Initial transformation: ~5 seconds for 194k rows
|
|
- Cached loads: <1 second
|
|
- Cache file: `retrosheets_events_YYYY_normalized.csv`
|
|
|
|
### 2. Defense CSV Column Standardization
|
|
|
|
**Files Affected**: All 10 defense CSV files (c, 1b, 2b, 3b, ss, lf, cf, rf, of, p)
|
|
|
|
**Required Renames**:
|
|
```
|
|
RF/9 → range_factor_per_nine
|
|
RF/G → range_factor_per_game
|
|
DP → DP_def
|
|
E → E_def
|
|
Ch → chances
|
|
Inn → Inn_def
|
|
CS% → caught_stealing_perc
|
|
Name-additional → key_bbref
|
|
```
|
|
|
|
**Special Cases**:
|
|
- Pitchers: `PO → pickoffs`
|
|
- Position players: Keep `PO` as putouts
|
|
|
|
**File Naming**: Changed from hyphens to underscores (`defense-c.csv` → `defense_c.csv`)
|
|
|
|
### 3. Configuration Updates for 2005 Data
|
|
|
|
**Date Range** (retrosheet_data.py):
|
|
```python
|
|
START_DATE = 20050403 # 2005 Opening Day
|
|
END_DATE = 20051002 # 2005 Regular Season End
|
|
SEASON_PCT = 162 / 162 # Full season
|
|
```
|
|
|
|
**Playing Time Thresholds** (full season):
|
|
```python
|
|
MIN_PA_VL = 50 # was 20 for live series
|
|
MIN_PA_VR = 75 # was 40 for live series
|
|
```
|
|
|
|
**Cardset Configuration**:
|
|
```python
|
|
CARDSET_ID = 27 if 'live' in PLAYER_DESCRIPTION.lower() else 28
|
|
# Changed from: 22/23 (1998 cardsets)
|
|
```
|
|
|
|
### 4. Multi-Team Player Handling
|
|
|
|
**Problem**: Players traded during season create duplicate entries
|
|
- Example: Ryan Drese appears 3x (2TM, TEX, WSN)
|
|
- Causes `TypeError: cannot convert the series to <class 'int'>` when accessing ratings
|
|
|
|
**Solution**: Filter after peripheral/running stats merge
|
|
```python
|
|
duplicated_mask = df['key_bbref'].duplicated(keep=False)
|
|
if duplicated_mask.any():
|
|
multi_team_mask = df['Tm'].str.contains('TM', na=False)
|
|
df = df[~duplicated_mask | multi_team_mask]
|
|
```
|
|
|
|
**Applied to**: Both `run_batters()` and `run_pitchers()` functions
|
|
|
|
### 5. Dictionary Column Corruption Fix
|
|
|
|
**Problem**: Merging full card DataFrames corrupted `ratings_vL` and `ratings_vR` dictionary columns
|
|
- Error: `TypeError: 'float' object does not support item assignment`
|
|
|
|
**Solution**: Only merge required columns
|
|
```python
|
|
# Before (corrupts dict columns)
|
|
br = pd.merge(left=br, right=bc, ...)
|
|
|
|
# After (preserves dict columns)
|
|
br = pd.merge(left=br, right=bc[['key_bbref', 'player_id', 'battingcard_id']], ...)
|
|
```
|
|
|
|
**Also fixed**: Removed unnecessary `.set_index('key_bbref')` from card functions
|
|
|
|
### 6. Position Assignment Improvements
|
|
|
|
**Problem**: Existing players kept old positions (e.g., DH + 3B when should only show 3B)
|
|
|
|
**Root Cause**: Script only updated cost/rarity/image for existing players, not positions
|
|
|
|
**Solution**: Update ALL position slots when patching existing players
|
|
```python
|
|
# Batters: Update all 8 position slots from defense stats
|
|
all_pos = get_player_record_pos(def_rat_df, row)
|
|
for x in enumerate(all_pos):
|
|
patch_params.append((f'pos_{x[0] + 1}', x[1]))
|
|
|
|
# Pitchers: Set position based on rating, clear unused slots
|
|
if starter_rating >= 4:
|
|
patch_params.append(('pos_1', 'SP'))
|
|
for i in range(2, 9):
|
|
patch_params.append((f'pos_{i}', None))
|
|
```
|
|
|
|
**DH Rule**: Now correctly only appears when player has NO defensive positions
|
|
|
|
### 7. Pandas Type Handling Fixes
|
|
|
|
**String Type Enforcement** (retrosheet_transformer.py):
|
|
```python
|
|
# Ensure .str accessor works
|
|
transformed['hit_val'] = df.apply(transform_hit_val, axis=1).astype(str)
|
|
transformed['hit_location'] = df['loc'].astype(str)
|
|
transformed['batted_ball_type'] = df.apply(transform_batted_ball_type, axis=1).astype(str)
|
|
|
|
# Maintain types when loading from cache
|
|
dtype_dict = {
|
|
'game_id': 'str',
|
|
'hit_val': 'str',
|
|
'hit_location': 'str',
|
|
'batted_ball_type': 'str'
|
|
}
|
|
pd.read_csv(cache_path, dtype=dtype_dict, low_memory=False)
|
|
```
|
|
|
|
**Pitcher OPS Calculation** (pitchers/calcs_pitcher.py):
|
|
```python
|
|
# Convert to float before min() to avoid "ambiguous truth value" error
|
|
ob_vl = float(108 * (df_data['BB_vL'] + df_data['HBP_vL']) / df_data['TBF_vL'])
|
|
all_other_ob = sanitize_chance_output(min(ob_vl, 0.8))
|
|
```
|
|
|
|
**Column Name Conflicts** (retrosheet_data.py):
|
|
```python
|
|
# Drop duplicate 'Tm' from defense_p to prevent Tm_x/Tm_y creation
|
|
if 'Tm' in df_p.columns:
|
|
df_p = df_p.drop(columns=['Tm'])
|
|
```
|
|
|
|
## Files Created
|
|
|
|
1. **retrosheet_transformer.py**: Main transformation module with caching
|
|
2. **rename_defense_columns.py**: Script to standardize defense CSV columns
|
|
3. **rename_additional_defense_columns.py**: Second pass for additional column renames
|
|
4. **undo_po_rename.py**: Reverted PO→pickoffs for position players
|
|
5. **test_retrosheet_integration.py**: Integration test script
|
|
6. **test_nan_handling.py**: (created during testing)
|
|
7. **retrosheet_transformer.py**: Retrosheet data normalization script
|
|
|
|
## Files Modified
|
|
|
|
1. **retrosheet_data.py**:
|
|
- Added transformer import
|
|
- Updated date/cardset configuration for 2005
|
|
- Added multi-team player filtering
|
|
- Fixed position assignment for existing players
|
|
- Added column conflict resolution
|
|
|
|
2. **pitchers/calcs_pitcher.py**:
|
|
- Fixed OPS calculation type handling
|
|
|
|
3. **CLAUDE.md**:
|
|
- Added Retrosheet section
|
|
- Added Defense CSV requirements
|
|
- Added Common Issues section
|
|
- Added Position Assignment Rules
|
|
- Added Configuration Checklist
|
|
|
|
## Results
|
|
|
|
**Successfully Generated**:
|
|
- ✅ 335 qualified batters (2005 season)
|
|
- ✅ 129 qualified pitchers (2005 season)
|
|
- ✅ All batting cards, ratings, and defensive positions
|
|
- ✅ All pitching cards and ratings
|
|
|
|
**Posted to**:
|
|
- Database: pddev.manticorum.com (development)
|
|
- Cardset: 22 (initial run), configured for 27 going forward
|
|
|
|
## Key Learnings
|
|
|
|
1. **Always verify configuration before running**:
|
|
- Check cardset_id matches target
|
|
- Verify database environment (dev vs prod)
|
|
- Confirm date ranges match data year
|
|
|
|
2. **Pandas merge gotchas**:
|
|
- Merging full DataFrames can corrupt object/dict columns
|
|
- Column name conflicts create `_x` and `_y` suffixes
|
|
- Always specify only needed columns in merge operations
|
|
|
|
3. **Type preservation matters**:
|
|
- Explicitly set dtypes for string columns
|
|
- Persist dtypes when loading from cache
|
|
- Convert pandas values to Python types before operations like `min()`
|
|
|
|
4. **Position assignment logic**:
|
|
- Always update ALL position slots to clear old data
|
|
- DH should only appear when no defensive positions exist
|
|
- Sort positions by innings played (primary position = most innings)
|
|
|
|
5. **Multi-team players require special handling**:
|
|
- Baseball Reference provides per-team AND combined stats
|
|
- Always use combined totals (2TM, 3TM, etc.)
|
|
- Filter duplicates AFTER all data merges complete
|
|
|
|
6. **Caching strategy**:
|
|
- Timestamp-based cache invalidation works well
|
|
- 5 seconds preprocessing vs <1 second cached is significant
|
|
- Store normalized data for consistency
|
|
|
|
## Maintenance Notes
|
|
|
|
**When adding new seasons**:
|
|
1. Update START_DATE/END_DATE for season year
|
|
2. Verify CARDSET_ID for target cardset
|
|
3. Ensure defense CSV files have correct column names
|
|
4. Run transformer to generate normalized cache
|
|
5. Check logs for cardset_id confirmation
|
|
|
|
**When data format changes**:
|
|
1. Update retrosheet_transformer.py transformation logic
|
|
2. Regenerate cache (delete `*_normalized.csv` files)
|
|
3. Test with small date range first
|
|
4. Verify all column dependencies still work
|
|
|
|
**Performance Monitoring**:
|
|
- Initial transformation: Should complete in <10 seconds for full season
|
|
- Cached loads: Should complete in <2 seconds
|
|
- Full pipeline: Batters + Pitchers ~1-2 minutes total
|