| .. | ||
| data-input/2025 Live Cardset | ||
| pitching_page.shtml | ||
| pull_baserunning_stats.py | ||
| pull_pitching_stats.py | ||
| README.md | ||
| requirements.txt | ||
| test_pull_pitching_stats_with_mock.py | ||
| test_pull_pitching_stats.py | ||
Baseball Reference Pitching Stats Scraper
This script scrapes the Player Standard Pitching table from Baseball Reference and saves it as a CSV file in the specified cardset directory.
Features
- Scrapes the
players_standard_pitchingtable from Baseball Reference - Maps data to the expected CSV format matching existing pitcher-stats.csv files
- Handles missing data gracefully
- Includes comprehensive error handling and logging
- Saves output to the correct cardset directory structure
Installation
pip install -r requirements.txt
Usage
# Basic usage
python pull_pitching_stats.py --year 2025 --cardset-name "2025 Live Cardset"
# With verbose logging
python pull_pitching_stats.py --year 2025 --cardset-name "2025 Live Cardset" --verbose
Parameters
--year: The year to scrape (e.g., 2025). This constructs the URL:https://www.baseball-reference.com/leagues/majors/YYYY-standard-pitching.shtml--cardset-name: Name of the cardset directory where the CSV will be saved--verbose: Enable verbose logging (optional)
Output
The script creates:
- Directory:
data-input/{cardset-name}/ - File:
pitching.csvwith 41 columns matching the expected format
Column Mapping
The script maps Baseball Reference columns to the expected format:
| BR Column | Output Column | Notes |
|---|---|---|
| Name | Name | Player name |
| Age | Age | Player age |
| Lev | Lev | League level |
| Tm | Tm | Team |
| G | G | Games |
| GS | GS | Games started |
| W | W | Wins |
| L | L | Losses |
| SV | SV | Saves |
| IP | IP | Innings pitched |
| H | H | Hits allowed |
| R | R | Runs allowed |
| ER | ER | Earned runs |
| BB | BB | Walks |
| SO | SO | Strikeouts |
| HR | HR | Home runs allowed |
| HBP | HBP | Hit by pitch |
| ERA | ERA | Earned run average |
| AB | AB | At bats against |
| 2B | 2B | Doubles allowed |
| 3B | 3B | Triples allowed |
| IBB | IBB | Intentional walks |
| GDP | GDP | Ground into double plays |
| SF | SF | Sacrifice flies |
| SB | SB | Stolen bases allowed |
| CS | CS | Caught stealing |
| PO | PO | Pickoffs |
| BF | BF | Batters faced |
| Pit | Pit | Pitches thrown |
| Str | Str | Strikes |
| StL | StL | Strikes looking |
| StS | StS | Strikes swinging |
| GB/FB | GB/FB | Ground ball to fly ball ratio |
| LD | LD | Line drives |
| PU | PU | Pop ups |
| WHIP | WHIP | Walks + hits per inning |
| BAbip | BAbip | Batting average on balls in play |
| SO9 | SO9 | Strikeouts per 9 innings |
| SO/W | SO/W | Strikeout to walk ratio |
| (calculated) | #days | Left empty for business logic |
| (extracted) | mlbID | Left empty (could extract from links) |
Error Handling
- Network timeouts and connection errors
- Missing table elements
- Missing columns in scraped data
- Invalid HTML structure
- File system errors
Logging
The script provides detailed logging of:
- URL being scraped
- Number of columns and rows found
- Missing columns and data mapping
- File save location and success status