8.0 KiB
Retrosheet Outfield Arm Rating - Implementation Summary
What I've Created
I've analyzed your Retrosheet play-by-play data and created a comprehensive system to calculate outfield arm ratings from historical game events. This gives you an alternative to Baseball Reference's bis_runs_outfield statistic that works for all historical seasons.
Files Created
1. Proposal Document
Location: docs/of_arm_rating_improvement_proposal.md
Comprehensive proposal explaining:
- Current limitations of the
bis_runs_outfieldapproach - Available data in Retrosheet events
- Statistical analysis of 2005 season
- Proposed multi-metric composite formula
- Comparison of advantages/disadvantages
- Integration recommendations
2. Implementation Module
Location: defenders/retrosheet_arm_calculator.py
Production-ready Python module with:
calculate_of_arms_from_retrosheet()- Main entry point for batch calculationcalculate_player_arm_rating()- Calculate rating for individual playercalculate_position_baselines()- Position-adjusted normalization- Detailed logging and documentation
3. Validation Script
Location: test_retrosheet_arms.py
Testing/validation tool that:
- Calculates ratings for all 2005 outfielders
- Shows distribution of ratings
- Identifies top 20 and bottom 10 arms
- Validates against known strong arms (Ichiro, Guerrero, etc.)
- Generates detailed statistical reports
Key Findings from 2005 Analysis
Available Metrics from Retrosheet
From the play-by-play data, we can extract:
- Total Assists - OF threw out a runner
- Home Throws - Threw out runner at home (strongest arm indicator)
- Batter Extra-Base Outs - Threw out batter trying to stretch (prevents doubles)
- Assist Rate - Assists per balls fielded (opportunity-adjusted)
- Throwout Rate - Success when attempting throw
2005 League Statistics
| Position | Avg Assist Rate | Avg Throwout Rate | Total Assists |
|---|---|---|---|
| LF | 3.01% | 86.71% | 294 |
| CF | 2.04% | 81.23% | 247 |
| RF | 2.77% | 79.52% | 288 |
Key Insight: Assist rates and success rates vary by position, so we use position-adjusted z-scores.
How the Formula Works
Composite Score (Simplified Rate-Dominant Formula)
raw_score = (
(assist_rate * 300) + # PRIMARY: Assist rate (dominant factor)
(home_throws * 1.0) + # Quality: home plate throws
(batter_extra_outs * 1.0) + # Quality: preventing extra bases
(total_assists * 0.1) # Minimal volume bonus
)
Philosophy: Assist rate is the dominant driver. Assists are already outs by definition, so no separate "throwout rate" is needed. Quality indicators (home throws, batter extra outs) provide minimal context about the types of plays made.
Elite assist rates (8%+) contribute 24+ points vs average rates (3%) contribute ~9 points.
Position-Adjusted Rating
- Calculate league average and standard deviation for LF/CF/RF
- Convert player's raw score to z-score:
(score - avg) / stddev - Map z-score to -6 to +5 rating scale (normal distribution)
Rating Scale (Calibrated Distribution)
| Z-Score | Rating | Description | Approx % |
|---|---|---|---|
| > 2.5 | -6 | Elite cannon | ~1% |
| 2.0-2.5 | -5 | Outstanding | ~2% |
| 1.5-2.0 | -4 | Excellent | ~3% |
| 1.0-1.5 | -3 | Very Good | ~5% |
| 0.5-1.0 | -2 | Above Average | ~15% |
| 0.0-0.5 | -1 | Slightly Above | ~30% |
| -0.5-0.0 | 0 | Average | ~40% |
| -0.8--0.5 | 1 | Slightly Below | ~20% |
| -1.2--0.8 | 2 | Below Average | ~10% |
| -1.5--1.2 | 3 | Poor | ~5% |
| -1.8--1.5 | 4 | Very Poor | ~2% |
| < -1.8 | 5 | Very Weak | ~1% |
Note: Thresholds adjusted after 300x assist_rate weight compressed z-score spread.
Testing the Implementation
Step 1: Run the validation script
python test_retrosheet_arms.py
This will:
- Calculate ratings for all 2005 outfielders
- Show distribution (how many players at each rating)
- Identify elite vs weak arms
- Validate against known strong arms
Step 2: Review the output
Look for:
- Elite arms (rating ≤ -3): Should be players known for strong arms
- Distribution: Should be bell curve centered around 0
- Position differences: CF may have more volume but RF/LF may have stronger arms
Step 3: Compare to current method
For players with both bis_runs_outfield (from Baseball Reference) and Retrosheet data:
- Do the ratings correlate?
- Where do they differ and why?
- Which seems more accurate to your domain knowledge?
Integration Options
Option 1: Hybrid (Recommended for Development)
Use Baseball Reference when available, Retrosheet as fallback:
# In defenders/calcs_defense.py, around line 71-84
if 'bis_runs_outfield' in pos_df.columns:
# Current method - use BIS runs
of_arms.append(int(pos_data[0].at[df_data["key_bbref"], 'bis_runs_outfield']))
else:
# Fallback - use Retrosheet calculation
if not hasattr(self, 'retrosheet_arms'):
from defenders.retrosheet_arm_calculator import calculate_of_arms_from_retrosheet
self.retrosheet_arms = calculate_of_arms_from_retrosheet(df_events, season_pct)
# Get arm rating from Retrosheet
from defenders.retrosheet_arm_calculator import get_arm_for_player
arm_rating = get_arm_for_player(self.retrosheet_arms, df_data['key_bbref'])
return arm_rating # Skip the arm_outfield() call
Option 2: Full Replacement
Always use Retrosheet for consistency:
# In retrosheet_data.py, after loading events
from defenders.retrosheet_arm_calculator import calculate_of_arms_from_retrosheet
df_events = pd.read_csv(EVENTS_FILENAME)
retrosheet_arm_ratings = calculate_of_arms_from_retrosheet(df_events, SEASON_PCT)
# Then in create_positions() call:
from defenders.retrosheet_arm_calculator import get_arm_for_player
arm_rating = get_arm_for_player(retrosheet_arm_ratings, df_data['key_bbref'])
Sample Size Requirements
- Minimum: 50 balls fielded (putouts + assists) per position
- Full season: Most regulars will qualify (200+ balls)
- Partial season: Adjust with
season_pctparameter - Platoon players: May not qualify; get default rating of 0 (average)
Why This Approach is Better
Advantages
- Historical Coverage - Works for any season with Retrosheet data (1921+)
- Multi-Dimensional - Considers quality and quantity of throws
- Position-Adjusted - Accounts for different expectations by position
- Transparent - Formula is clear and can be tuned
- Context-Aware - Weights high-value plays (home throws) more heavily
Disadvantages
- Processing Overhead - Must parse large play-by-play files
- Sample Size - Platoon players may not qualify
- Indirect - Measures outcomes, not raw arm strength
- One-Time Work - Need to calculate baselines for each season
Next Steps
- Run validation script to see 2005 results
- Review elite arms - Do they match your expectations?
- Choose integration approach (hybrid vs full replacement)
- Test on a small cardset before full deployment
- Tune weights if needed based on validation results
Questions to Consider
- Do the elite arms (rating -4 to -6) match players you know had strong arms?
- Are there players with unexpectedly high/low ratings? Why?
- How does this compare to the
bis_runs_outfieldmethod for 2005? - Should home throws be weighted even more heavily?
- Should we adjust thresholds to get more granular ratings?
Support
The implementation includes extensive logging. Set logging level to DEBUG to see:
- Individual player calculations
- Raw scores and z-scores
- Position baselines
- Sample size warnings
import logging
logging.getLogger('exceptions').setLevel(logging.DEBUG)
Created: 2025-11-15
Status: Ready for Testing
Recommendation: Run test_retrosheet_arms.py to validate before integration