diff --git a/CLAUDE.md b/CLAUDE.md index a2f490b..718c7f9 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -60,6 +60,30 @@ python scouting_pitchers.py # Generate pitching scouting data ### Baseball Reference Data - **running.csv**: Baserunning statistics - **pitching.csv**: Standard pitching statistics +- **defense_*.csv**: Defensive statistics for each position (c, 1b, 2b, 3b, ss, lf, cf, rf, of, p) + +### Retrosheet Play-by-Play Data +- **retrosheet_transformer.py**: Preprocesses new Retrosheet CSV format to legacy format with smart caching +- Place source files in `data-input/retrosheet/` directory +- Transformer automatically checks timestamps and only re-processes if source is newer than cache +- Normalized cache files saved as `*_normalized.csv` for fast subsequent runs +- Performance: ~5 seconds for initial transformation, <1 second for cached loads + +### Defense CSV Requirements +All defense files must use underscore naming (`defense_c.csv`, not `defense-c.csv`) and include these standardized column names: +- `key_bbref`: Player identifier (required as index key) +- `Inn_def`: Innings played at position +- `chances`: Total fielding chances +- `E_def`: Errors +- `DP_def`: Double plays +- `fielding_perc`: Fielding percentage +- `tz_runs_total`: Total Zone runs saved +- `tz_runs_field`: Zone runs (fielding only) +- `tz_runs_infield`: Zone runs (infield only) +- `range_factor_per_nine`: Range factor per 9 innings +- `range_factor_per_game`: Range factor per game +- Catchers only: `caught_stealing_perc`, `pickoffs` (not PO) +- Position players: `PO` for putouts (not pickoffs) ### Minimum Playing Time Thresholds - **Live Series**: 20 PA vs L / 40 PA vs R (batters), 20 TBF vs L / 40 TBF vs R (pitchers) @@ -78,10 +102,85 @@ python scouting_pitchers.py # Generate pitching scouting data - `GAMES_PLAYED`: Season progress for live series calculations - `IGNORE_LIMITS`: Override minimum playing time requirements +### Retrosheet Data Settings (retrosheet_data.py) +Before running retrosheet_data.py, verify these configuration settings: +- `PLAYER_DESCRIPTION`: 'Live' for season cards, or ' PotM' for promotional cards +- `CARDSET_ID`: Correct cardset ID (e.g., 27 for 2005 Live, 28 for 2005 Promos) +- `START_DATE` / `END_DATE`: Date range in YYYYMMDD format matching your Retrosheet data +- `SEASON_PCT`: Percentage of season completed (162/162 for full season) +- `MIN_PA_VL` / `MIN_PA_VR`: Minimum plate appearances (50/75 for full season, 1/1 for promos) +- `DATA_INPUT_FILE_PATH`: Path to data directory (usually `data-input/[Year] [Type] Cardset/`) +- `EVENTS_FILENAME`: Retrosheet CSV filename (e.g., `retrosheets_events_2005.csv`) + +**Configuration Checklist Before Running:** +1. Database environment (`alt_database` in db_calls.py) +2. Cardset ID matches intended target +3. Date range matches Retrosheet data year +4. Defense CSV files present and properly named +5. Running/pitching CSV files present + ## Important Notes - The system uses D20-based probability mechanics where statistics are converted to chances out of 20 - Cards are generated with both basic stats and advanced metrics (OPS, WHIP, etc.) - Defensive ratings use zone-based fielding statistics from Baseball Reference - All player data flows through Paper Dynasty's API with bearer token authentication -- Cards are dynamically rendered when accessed via URL, with nginx caching for performance \ No newline at end of file +- Cards are dynamically rendered when accessed via URL, with nginx caching for performance + +### Rarity Assignment System +- **rarity_thresholds.py**: Contains season-aware rarity thresholds (2024 vs 2025+) +- Rarity is calculated from `total_OPS` (batters) or OPS-against (pitchers) in the ratings dataframe +- `post_player_updates()` uses LEFT JOIN to preserve players without ratings (assigns Common/5 rarity + default OPS) +- Players missing ratings will log warnings showing player_id and card_id for troubleshooting +- Default OPS values: 0.612 (batters/Common), 0.702 (pitchers/Common reliever) + +### Position Assignment Rules +- **Batters**: Positions assigned from defensive stats, sorted by innings played (most innings = pos_1) +- **DH Rule**: "DH" only appears when a player has NO defensive positions at all +- **Pitchers**: Assigned based on starter_rating (≥4 = SP, <4 = RP) and closer_rating (if present, add CP) +- **Position Updates**: Script updates ALL 8 position slots when patching existing players to clear old data +- Player cards can be viewed as HTML by adding `html=true` to the card URL: `https://pddev.manticorum.com/api/v2/players/{id}/battingcard?d={date}&html=true` + +## Common Issues and Solutions + +### Multi-Team Players (Traded During Season) +**Problem**: Players traded during season appear multiple times in Baseball Reference data (one row per team + combined total marked as "2TM", "3TM", etc.) + +**Solution**: Script automatically filters to keep only combined season totals: +- Detects duplicate `key_bbref` values after merging peripheral/running stats +- Keeps rows where `Tm` column contains "TM" (2TM, 3TM, etc.) +- Removes individual team rows to prevent duplicate player entries + +### Dictionary Column Corruption in Ratings +**Problem**: When merging full card DataFrames with ratings DataFrames, pandas corrupts `ratings_vL` and `ratings_vR` dictionary columns, converting them to floats/NaN. + +**Solution**: Only merge specific columns needed (`key_bbref`, `player_id`, `battingcard_id`/`pitchingcard_id`) instead of entire DataFrame. + +### No Players Found After Successful Run +**Symptoms**: Script completes successfully but API query returns 0 players + +**Common Causes**: +1. **Wrong Cardset**: Check logs for actual cardset_id used vs. cardset queried in API +2. **Wrong Database**: Verify `alt_database` setting in db_calls.py (dev vs production) +3. **Date Mismatch**: START_DATE/END_DATE don't match Retrosheet data year +4. **Empty PROMO_INCLUSION_RETRO_IDS**: When PLAYER_DESCRIPTION is a promo name, this list must contain player IDs + +**Debugging Steps**: +1. Check logs for actual POST operations and player_id values +2. Verify cardset_id in logs matches API query +3. Check database URL in logs matches intended environment +4. Query API with cardset_id from logs to find players + +### String Type Issues with Retrosheet Data +**Problem**: Pandas .str accessor fails on `hit_val`, `hit_location`, `batted_ball_type` columns + +**Solution**: retrosheet_transformer.py explicitly converts these to string dtype and maintains type when loading from cache using dtype parameter in pd.read_csv() + +### Pitcher OPS Calculation Errors +**Problem**: `min()` function fails with "truth value is ambiguous" error when calculating OB values + +**Solution**: Explicitly convert pandas values to Python floats before using `min()`: +```python +ob_vl = float(108 * (df_data['BB_vL'] + df_data['HBP_vL']) / df_data['TBF_vL']) +result = min(ob_vl, 0.8) # Now works correctly +``` \ No newline at end of file diff --git a/pitchers/calcs_pitcher.py b/pitchers/calcs_pitcher.py index 316a45b..4950b5a 100644 --- a/pitchers/calcs_pitcher.py +++ b/pitchers/calcs_pitcher.py @@ -300,12 +300,16 @@ class PitchingCardRatingsModel(pydantic.BaseModel): def get_pitcher_ratings(df_data) -> List[dict]: + # Calculate OB values with min cap (ensure scalar values for comparison) + ob_vl = float(108 * (df_data['BB_vL'] + df_data['HBP_vL']) / df_data['TBF_vL']) + ob_vr = float(108 * (df_data['BB_vR'] + df_data['HBP_vR']) / df_data['TBF_vR']) + vl = PitchingCardRatingsModel( pitchingcard_id=df_data.pitchingcard_id, pit_hand=df_data.pitch_hand, vs_hand='L', all_hits=sanitize_chance_output((df_data['AVG_vL'] - 0.05) * 108), # Subtracting chances from BP results - all_other_ob=sanitize_chance_output(min((108 * (df_data['BB_vL'] + df_data['HBP_vL']) / df_data['TBF_vL']), 0.8)), + all_other_ob=sanitize_chance_output(min(ob_vl, 0.8)), hard_rate=df_data['Hard%_vL'], med_rate=df_data['Med%_vL'], soft_rate=df_data['Soft%_vL'] @@ -315,7 +319,7 @@ def get_pitcher_ratings(df_data) -> List[dict]: pit_hand=df_data.pitch_hand, vs_hand='R', all_hits=sanitize_chance_output((df_data['AVG_vR'] - 0.05) * 108), # Subtracting chances from BP results - all_other_ob=sanitize_chance_output(min((108 * (df_data['BB_vR'] + df_data['HBP_vR']) / df_data['TBF_vR']), 0.8)), + all_other_ob=sanitize_chance_output(min(ob_vr, 0.8)), hard_rate=df_data['Hard%_vR'], med_rate=df_data['Med%_vR'], soft_rate=df_data['Soft%_vR'] diff --git a/retrosheet_data.py b/retrosheet_data.py index 1ea6f8d..80f12b8 100644 --- a/retrosheet_data.py +++ b/retrosheet_data.py @@ -16,6 +16,7 @@ from creation_helpers import get_args, CLUB_LIST, FRANCHISE_LIST, sanitize_name from batters.stat_prep import DataMismatchError from db_calls import DB_URL, db_get, db_patch, db_post, db_put from exceptions import log_exception, logger +from retrosheet_transformer import load_retrosheet_csv import batters.calcs_batter as cba import defenders.calcs_defense as cde import pitchers.calcs_pitcher as cpi @@ -31,35 +32,35 @@ cache.enable() RETRO_FILE_PATH = 'data-input/retrosheet/' -EVENTS_FILENAME = 'retrosheets_events_1998_short.csv' # Removed last few columns which were throwing dtype errors +EVENTS_FILENAME = 'retrosheets_events_2005.csv' # Now using transformer for new format compatibility PERSONNEL_FILENAME = 'retrosheets_personnel.csv' -DATA_INPUT_FILE_PATH = 'data-input/1998 Season Cardset/' +DATA_INPUT_FILE_PATH = 'data-input/2005 Live Cardset/' CARD_BASE_URL = f'{DB_URL}/v2/players/' start_time = datetime.datetime.now() RELEASE_DIRECTORY = f'{start_time.year}-{start_time.month}-{start_time.day}' -# PLAYER_DESCRIPTION = 'Live' # Live for Live Series -PLAYER_DESCRIPTION = 'September PotM' # PotM for promos +PLAYER_DESCRIPTION = 'Live' # Live for Live Series +# PLAYER_DESCRIPTION = 'September PotM' # PotM for promos PROMO_INCLUSION_RETRO_IDS = [ - 'marte001', - 'willg001', - 'sampb003', - 'ruscg001', - 'larkb001', - 'sosas001', - 'smolj001', - 'acevj001' + # 'marte001', + # 'willg001', + # 'sampb003', + # 'ruscg001', + # 'larkb001', + # 'sosas001', + # 'smolj001', + # 'acevj001' ] MIN_PA_VL = 20 if 'live' in PLAYER_DESCRIPTION.lower() else 1 # 1 for PotM MIN_PA_VR = 40 if 'live' in PLAYER_DESCRIPTION.lower() else 1 # 1 for PotM MIN_TBF_VL = MIN_PA_VL MIN_TBF_VR = MIN_PA_VR -CARDSET_ID = 20 if 'live' in PLAYER_DESCRIPTION.lower() else 21 # 20: 1998 Live, 21: 1998 Promos +CARDSET_ID = 27 if 'live' in PLAYER_DESCRIPTION.lower() else 28 # 27: 2005 Live, 28: 2005 Promos # Per-Update Parameters -SEASON_PCT = 162 / 162 -START_DATE = 19980901 # YYYYMMDD format -END_DATE = 19980928 # YYYYMMDD format +SEASON_PCT = 162 / 162 # Full season +START_DATE = 20050301 # YYYYMMDD format - 2005 Opening Day +END_DATE = 20050430 # YYYYMMDD format - 2005 Regular Season End POST_DATA = True LAST_WEEK_RATIO = 0.0 if PLAYER_DESCRIPTION == 'Live' else 0.0 LAST_TWOWEEKS_RATIO = 0.0 @@ -247,7 +248,7 @@ def get_player_ids(plays: pd.DataFrame, which: Literal['batters', 'pitchers']) - def get_base_batting_df(file_path: str, start_date: int, end_date: int) -> list[pd.DataFrame, pd.DataFrame]: - all_plays = pd.read_csv(f'{file_path}', dtype={'game_id': 'str'}) + all_plays = load_retrosheet_csv(file_path) all_plays['date'] = all_plays['game_id'].str[3:-1].astype(int) date_plays = all_plays[(all_plays.date >= start_date) & (all_plays.date <= end_date)] @@ -310,7 +311,7 @@ def get_base_batting_df(file_path: str, start_date: int, end_date: int) -> list[ def get_base_pitching_df(file_path: str, start_date: int, end_date: int) -> list[pd.DataFrame, pd.DataFrame]: - all_plays = pd.read_csv(f'{file_path}', dtype={'game_id': 'str'}) + all_plays = load_retrosheet_csv(file_path) all_plays['date'] = all_plays['game_id'].str[3:-1].astype(int) date_plays = all_plays[(all_plays.date >= start_date) & (all_plays.date <= end_date)] @@ -393,6 +394,7 @@ def get_batting_stats_by_date(retro_file_path, start_date: int, end_date: int) - start = datetime.datetime.now() all_player_ids = batting_stats['key_retro'] + logging.info(f'all_player_ids: {all_player_ids}') all_plays = all_plays[all_plays['batter_id'].isin(all_player_ids)] print(f'Shrink all_plays: {(datetime.datetime.now() - start).total_seconds():.2f}s') @@ -1139,9 +1141,19 @@ async def get_or_post_players(bstat_df: pd.DataFrame = None, bat_rat_df: pd.Data else: player_id = p_search['player_id'] - new_player = await db_patch('players', object_id=player_id, params=[ - ('cost', f'{bat_rat_df.loc[row['key_bbref']]["cost"]}'), ('rarity_id', int(bat_rat_df.loc[row['key_bbref']]['rarity_id'])), ('image', f'{CARD_BASE_URL}{player_id}/battingcard{urllib.parse.quote("?d=")}{RELEASE_DIRECTORY}') - ]) + # Update positions for existing players too + all_pos = get_player_record_pos(def_rat_df, row) + patch_params = [ + ('cost', f'{bat_rat_df.loc[row['key_bbref']]["cost"]}'), + ('rarity_id', int(bat_rat_df.loc[row['key_bbref']]['rarity_id'])), + ('image', f'{CARD_BASE_URL}{player_id}/battingcard{urllib.parse.quote("?d=")}{RELEASE_DIRECTORY}') + ] + # Add position updates - set all 8 slots to clear any old positions + for x in enumerate(all_pos): + patch_params.append((f'pos_{x[0] + 1}', x[1])) + + new_player = await db_patch('players', object_id=player_id, params=patch_params) + new_player['bbref_id'] = row['key_bbref'] all_players.append(new_player) player_deltas.append([ new_player['player_id'], new_player['p_name'], p_search['cost'], new_player['cost'], p_search['rarity']['name'], new_player['rarity']['name'] @@ -1165,9 +1177,10 @@ async def get_or_post_players(bstat_df: pd.DataFrame = None, bat_rat_df: pd.Data new_player = await db_patch('players', object_id=player_id, params=[('image', f'{CARD_BASE_URL}{player_id}/battingcard{urllib.parse.quote("?d=")}{RELEASE_DIRECTORY}')]) if 'paperdex' in new_player: del new_player['paperdex'] - + # all_bbref_ids.append(row['key_bbref']) # all_player_ids.append(player_id) + new_player['bbref_id'] = row['key_bbref'] all_players.append(new_player) new_players.append([new_player['player_id'], new_player['p_name'], new_player['cost'], new_player['rarity']['name'], new_player['pos_1']]) @@ -1187,9 +1200,37 @@ async def get_or_post_players(bstat_df: pd.DataFrame = None, bat_rat_df: pd.Data else: player_id = p_search['player_id'] - new_player = await db_patch('players', object_id=player_id, params=[ - ('cost', f'{pit_rat_df.loc[row['key_bbref']]["cost"]}'), ('rarity_id', int(pit_rat_df.loc[row['key_bbref']]['rarity_id'])), ('image', f'{CARD_BASE_URL}{player_id}/pitchingcard{urllib.parse.quote("?d=")}{RELEASE_DIRECTORY}') - ]) + # Determine pitcher positions based on ratings + patch_params = [ + ('cost', f'{pit_rat_df.loc[row['key_bbref']]["cost"]}'), + ('rarity_id', int(pit_rat_df.loc[row['key_bbref']]['rarity_id'])), + ('image', f'{CARD_BASE_URL}{player_id}/pitchingcard{urllib.parse.quote("?d=")}{RELEASE_DIRECTORY}') + ] + + player_index = pstat_df.index[pstat_df['key_bbref'] == row['key_bbref']].tolist() + stat_row = pstat_df.iloc[player_index] + starter_rating = stat_row.iat[0, starter_index] + + if starter_rating >= 4: + patch_params.append(('pos_1', 'SP')) + # Clear other position slots + for i in range(2, 9): + patch_params.append((f'pos_{i}', None)) + else: + patch_params.append(('pos_1', 'RP')) + closer_rating = stat_row.iat[0, closer_index] + if not pd.isna(closer_rating): + patch_params.append(('pos_2', 'CP')) + # Clear remaining position slots + for i in range(3, 9): + patch_params.append((f'pos_{i}', None)) + else: + # Clear remaining position slots + for i in range(2, 9): + patch_params.append((f'pos_{i}', None)) + + new_player = await db_patch('players', object_id=player_id, params=patch_params) + new_player['bbref_id'] = row['key_bbref'] all_players.append(new_player) player_deltas.append([ new_player['player_id'], new_player['p_name'], p_search['cost'], new_player['cost'], p_search['rarity']['name'], new_player['rarity']['name'] @@ -1220,7 +1261,8 @@ async def get_or_post_players(bstat_df: pd.DataFrame = None, bat_rat_df: pd.Data new_player = await db_patch('players', object_id=player_id, params=[('image', f'{CARD_BASE_URL}{player_id}/pitchingcard{urllib.parse.quote("?d=")}{RELEASE_DIRECTORY}')]) if 'paperdex' in new_player: del new_player['paperdex'] - + + new_player['bbref_id'] = row['key_bbref'] all_players.append(new_player) new_players.append([new_player['player_id'], new_player['p_name'], new_player['cost'], new_player['rarity']['name'], new_player['pos_1']]) @@ -1264,7 +1306,7 @@ async def post_batting_cards(cards_df: pd.DataFrame): line['key_bbref'] = line['player']['bbref_id'] line['battingcard_id'] = line['id'] - return pd.DataFrame(bc_data).set_index('key_bbref') + return pd.DataFrame(bc_data) else: log_exception(ValueError, 'Unable to pull newly posted batting cards') @@ -1308,7 +1350,7 @@ async def post_pitching_cards(cards_df: pd.DataFrame): line['key_bbref'] = line['player']['bbref_id'] line['pitchingcard_id'] = line['id'] - return pd.DataFrame(pc_data).set_index('key_bbref') + return pd.DataFrame(pc_data) else: log_exception(ValueError, 'Unable to pull newly posted pitcher cards') @@ -1390,9 +1432,10 @@ async def post_batter_data(bs: pd.DataFrame, bc: pd.DataFrame, br: pd.DataFrame, bc = await post_batting_cards(bc) # Post Batting Ratings + # Only merge the columns we need to avoid corrupting dict columns in br br = pd.merge( left=br, - right=bc, + right=bc[['key_bbref', 'player_id', 'battingcard_id']], how='left', left_on='key_bbref', right_on='key_bbref' @@ -1426,9 +1469,10 @@ async def post_pitcher_data(ps: pd.DataFrame, pc: pd.DataFrame, pr: pd.DataFrame pc = await post_pitching_cards(ps) # Post Pitching Ratings + # Only merge the columns we need to avoid corrupting dict columns in pr pr = pd.merge( - left=pc, - right=pr, + left=pr, + right=pc[['key_bbref', 'player_id', 'pitchingcard_id']], how='left', left_on='key_bbref', right_on='key_bbref' @@ -1470,6 +1514,18 @@ async def run_batters(data_input_path: str, start_date: int, end_date: int, post left_on='key_bbref', right_on='key_bbref' ) + + # Handle players who played for multiple teams - keep only combined totals + # Players traded during season have multiple rows: one per team + one combined (2TM, 3TM, etc.) + duplicated_mask = batting_stats['key_bbref'].duplicated(keep=False) + if duplicated_mask.any(): + # For duplicates, keep rows where Tm contains 'TM' (combined totals: 2TM, 3TM, etc.) + # For non-duplicates, keep all rows + multi_team_mask = batting_stats['Tm'].str.contains('TM', na=False) + batting_stats = batting_stats[~duplicated_mask | multi_team_mask] + logger.info(f"Removed {duplicated_mask.sum() - multi_team_mask.sum()} team-specific rows for traded batters") + bs_len = len(batting_stats) # Update length after removing duplicates + end_calc = datetime.datetime.now() print(f'Running stats: {(end_calc - running_start).total_seconds():.2f}s') @@ -1533,12 +1589,25 @@ async def run_pitchers(data_input_path: str, start_date: int, end_date: int, pos left_on='key_bbref', right_on='key_bbref' ) + + # Handle players who played for multiple teams - keep only combined totals + # Players traded during season have multiple rows: one per team + one combined (2TM, 3TM, etc.) + duplicated_mask = pitching_stats['key_bbref'].duplicated(keep=False) + if duplicated_mask.any(): + # For duplicates, keep rows where Tm contains 'TM' (combined totals: 2TM, 3TM, etc.) + # For non-duplicates, keep all rows + multi_team_mask = pitching_stats['Tm'].str.contains('TM', na=False) + pitching_stats = pitching_stats[~duplicated_mask | multi_team_mask] + logger.info(f"Removed {duplicated_mask.sum() - multi_team_mask.sum()} team-specific rows for traded players") end_time = datetime.datetime.now() print(f'Peripheral stats: {(end_time - start_time).total_seconds():.2f}s') # Calculate defense ratings start_time = datetime.datetime.now() df_p = pd.read_csv(f'{DATA_INPUT_FILE_PATH}defense_p.csv').set_index('key_bbref') + # Drop 'Tm' from defense data to avoid column name conflicts (we already have it from periph_stats) + if 'Tm' in df_p.columns: + df_p = df_p.drop(columns=['Tm']) pitching_stats = pd.merge( left=pitching_stats, right=df_p, diff --git a/retrosheet_transformer.py b/retrosheet_transformer.py new file mode 100644 index 0000000..e0c3dab --- /dev/null +++ b/retrosheet_transformer.py @@ -0,0 +1,299 @@ +""" +Retrosheet CSV Format Transformer + +This module transforms newer Retrosheet CSV formats into the legacy format +expected by retrosheet_data.py. Includes smart caching to avoid redundant +transformations. + +Author: Claude Code +""" + +import os +import logging +from pathlib import Path +import pandas as pd +import numpy as np + +# Set up logging +logger = logging.getLogger(f'{__name__}') + + +def get_normalized_csv_path(source_path: str) -> str: + """ + Generate the cached/normalized CSV path from source path. + + Args: + source_path: Path to the source CSV file + + Returns: + Path to the normalized cache file + """ + source = Path(source_path) + cache_name = f"{source.stem}_normalized{source.suffix}" + return str(source.parent / cache_name) + + +def needs_transformation(source_path: str, cache_path: str) -> bool: + """ + Check if transformation is needed based on file modification times. + + Args: + source_path: Path to source CSV + cache_path: Path to cached normalized CSV + + Returns: + True if transformation needed, False if cache is valid + """ + if not os.path.exists(cache_path): + logger.info(f"Cache file not found: {cache_path}") + return True + + source_mtime = os.path.getmtime(source_path) + cache_mtime = os.path.getmtime(cache_path) + + if source_mtime > cache_mtime: + logger.info(f"Source file is newer than cache, transformation needed") + return True + + logger.info(f"Using cached normalized file: {cache_path}") + return False + + +def transform_event_type(row: pd.Series) -> str: + """ + Derive event_type from boolean columns in new format. + + Priority order matches baseball scoring conventions. + """ + if row['hr'] == 1: + return 'home run' + elif row['triple'] == 1: + return 'triple' + elif row['double'] == 1: + return 'double' + elif row['single'] == 1: + return 'single' + elif row['walk'] == 1 or row['iw'] == 1: + return 'walk' + elif row['k'] == 1: + return 'strikeout' + elif row['hbp'] == 1: + return 'hit by pitch' + else: + return 'generic out' + + +def transform_batted_ball_type(row: pd.Series) -> str: + """ + Derive batted_ball_type from boolean columns. + + Returns 'f' (fly), 'G' (ground), 'l' (line), or empty string. + """ + if row['fly'] == 1: + return 'f' + elif row['ground'] == 1: + return 'G' + elif row['line'] == 1: + return 'l' + else: + return '' + + +def transform_hit_val(row: pd.Series) -> str: + """ + Derive hit_val from hit type columns. + + Returns '1', '2', '3', '4' for singles through home runs. + """ + if row['hr'] == 1: + return '4' + elif row['triple'] == 1: + return '3' + elif row['double'] == 1: + return '2' + elif row['single'] == 1: + return '1' + else: + return '' + + +def bool_to_tf(val) -> str: + """Convert 1/0 or True/False to 't'/'f' strings.""" + if pd.isna(val): + return 'f' + return 't' if val == 1 or val is True else 'f' + + +def transform_retrosheet_csv(source_path: str) -> pd.DataFrame: + """ + Transform new Retrosheet CSV format to legacy format. + + Args: + source_path: Path to source CSV file + + Returns: + Transformed DataFrame in legacy format + """ + logger.info(f"Reading source CSV: {source_path}") + df = pd.read_csv(source_path, low_memory=False) + + logger.info(f"Transforming {len(df)} rows to legacy format") + + # Create new dataframe with legacy column names + transformed = pd.DataFrame() + + # Simple renames (with case conversion for handedness) + transformed['game_id'] = df['gid'] + transformed['batter_id'] = df['batter'] + transformed['pitcher_id'] = df['pitcher'] + transformed['batter_hand'] = df['bathand'].str.lower() # Convert R/L to r/l + transformed['pitcher_hand'] = df['pithand'].str.lower() # Convert R/L to r/l + transformed['hit_location'] = df['loc'].astype(str) # Ensure string type for .str operations + + # Derive event_type from multiple columns + logger.info("Deriving event_type from hit/walk/strikeout columns") + transformed['event_type'] = df.apply(transform_event_type, axis=1) + + # Derive batted_ball_type + logger.info("Deriving batted_ball_type from fly/ground/line columns") + transformed['batted_ball_type'] = df.apply(transform_batted_ball_type, axis=1).astype(str) + + # Derive hit_val + logger.info("Deriving hit_val from hit type columns") + transformed['hit_val'] = df.apply(transform_hit_val, axis=1).astype(str) + + # Boolean conversions to 't'/'f' format + logger.info("Converting boolean columns to 't'/'f' format") + transformed['batter_event'] = df['pa'].apply(bool_to_tf) + transformed['ab'] = df['ab'].apply(bool_to_tf) + transformed['bunt'] = df['bunt'].apply(bool_to_tf) + transformed['tp'] = df['tp'].apply(bool_to_tf) + + # Combine gdp + othdp for double play indicator + transformed['dp'] = (df['gdp'].fillna(0) + df['othdp'].fillna(0)).apply(lambda x: 't' if x > 0 else 'f') + + # Use batter_hand as result_batter_hand (assumption: most batters don't switch mid-AB) + # This may need refinement if we have switch hitter data + transformed['result_batter_hand'] = df['bathand'].str.lower() # Convert R/L to r/l + + # Add placeholder columns that may be referenced but aren't critical for stats + # These can be populated if needed in the future + transformed['event_id'] = range(1, len(df) + 1) + transformed['batting_team'] = '' + transformed['inning'] = df['inning'] if 'inning' in df.columns else '' + transformed['outs'] = '' + transformed['balls'] = '' + transformed['strikes'] = '' + transformed['pitch_seq'] = '' + transformed['vis_score'] = '' + transformed['home_score'] = '' + transformed['result_batter_id'] = df['batter'] + transformed['result_pitcher_id'] = df['pitcher'] + transformed['result_pitcher_hand'] = df['pithand'] + transformed['def_c'] = '' + transformed['def_1b'] = '' + transformed['def_2b'] = '' + transformed['def_3b'] = '' + transformed['def_ss'] = '' + transformed['def_lf'] = '' + transformed['def_cf'] = '' + transformed['def_rf'] = '' + transformed['run_1b'] = '' + transformed['run_2b'] = '' + transformed['run_3b'] = '' + transformed['event_scoring'] = '' + transformed['leadoff'] = '' + transformed['pinch_hit'] = '' + transformed['batt_def_pos'] = '' + transformed['batt_lineup_pos'] = '' + transformed['sac_hit'] = df['sh'].apply(bool_to_tf) if 'sh' in df.columns else 'f' + transformed['sac_fly'] = df['sf'].apply(bool_to_tf) if 'sf' in df.columns else 'f' + transformed['event_outs'] = '' + transformed['rbi'] = '' + transformed['wild_pitch'] = df['wp'].apply(bool_to_tf) if 'wp' in df.columns else 'f' + transformed['passed_ball'] = df['pb'].apply(bool_to_tf) if 'pb' in df.columns else 'f' + transformed['fielded_by'] = '' + transformed['foul_ground'] = '' + + logger.info(f"Transformation complete: {len(transformed)} rows") + return transformed + + +def load_retrosheet_csv(source_path: str, force_transform: bool = False) -> pd.DataFrame: + """ + Load Retrosheet CSV, using cached normalized version if available. + + This is the main entry point for loading Retrosheet data. It handles: + - Checking for cached normalized data + - Transforming if needed + - Saving transformed data for future use + + Args: + source_path: Path to source Retrosheet CSV + force_transform: If True, ignore cache and force transformation + + Returns: + DataFrame in legacy format ready for retrosheet_data.py + """ + logger.info(f"Loading Retrosheet CSV: {source_path}") + + if not os.path.exists(source_path): + raise FileNotFoundError(f"Source file not found: {source_path}") + + cache_path = get_normalized_csv_path(source_path) + + # Check if we need to transform + if force_transform or needs_transformation(source_path, cache_path): + # Transform the data + df = transform_retrosheet_csv(source_path) + + # Save to cache + logger.info(f"Saving normalized data to cache: {cache_path}") + df.to_csv(cache_path, index=False) + logger.info(f"Cache saved successfully") + + return df + else: + # Load from cache + logger.info(f"Loading from cache: {cache_path}") + # Explicitly set dtypes for string columns to ensure .str accessor works + dtype_dict = { + 'game_id': 'str', + 'hit_val': 'str', + 'hit_location': 'str', + 'batted_ball_type': 'str' + } + return pd.read_csv(cache_path, dtype=dtype_dict, low_memory=False) + + +if __name__ == '__main__': + # Test the transformer + import sys + + logging.basicConfig( + level=logging.INFO, + format='%(asctime)s - %(name)s - %(levelname)s - %(message)s' + ) + + if len(sys.argv) > 1: + test_file = sys.argv[1] + else: + test_file = 'data-input/retrosheet/retrosheets_events_2005.csv' + + print(f"\n{'='*60}") + print(f"Testing Retrosheet Transformer") + print(f"{'='*60}\n") + + df = load_retrosheet_csv(test_file) + + print(f"\nTransformed DataFrame Info:") + print(f"Shape: {df.shape}") + print(f"\nColumns: {list(df.columns)}") + print(f"\nSample rows:") + print(df.head(3)) + + print(f"\nEvent type distribution:") + print(df['event_type'].value_counts()) + + print(f"\nBatted ball type distribution:") + print(df['batted_ball_type'].value_counts())