Cal Corum a2f4d02b18 Add scouting upload CLI command

- Add `pd-cards scouting upload` command to upload scouting CSVs to database server via SCP
- Update CLAUDE.md with critical warning: scouting must always run for ALL cardsets
- Document full workflow: `pd-cards scouting all && pd-cards scouting upload`

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-12 14:17:14 -06:00

22 KiB

Raw Permalink Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is a baseball card creation system for Paper Dynasty, a sports card simulation game. The system pulls real baseball statistics from FanGraphs and Baseball Reference, processes them through calculation algorithms, and generates statistical cards for players. All generated data is POSTed directly to the Paper Dynasty API, and cards are dynamically generated when accessed via card URLs (cached by nginx gateway).

⚠️ Critical Lessons Learned

MUST READ: docs/LESSONS_LEARNED_ASTERISK_REGRESSION.md before working with player names or card generation.

Key Points:

API parameter for player name is name, NOT p_name
Card generation is cached - always use timestamp for cache-busting: ?d={year}-{month}-{day}-{timestamp}
S3 keys must NOT include query parameters
Player names are ONLY in players table (not in battingcards/pitchingcards)
NEVER append visual indicators (asterisks, hashes, etc.) to stored player names

Key Architecture Components

Core Modules

batters/: Batting card creation with rating calculations (calcs_batter.py) and card generation (creation.py)
pitchers/: Pitching card creation with ERA/WHIP calculations (calcs_pitcher.py) and card generation (creation.py)
defenders/: Defensive rating calculations and fielding card generation (calcs_defense.py, creation.py)
db_calls.py: Paper Dynasty API interface with authentication and CRUD operations
creation_helpers.py: Shared utilities including D20 probability tables, stat normalization, and data sanitization

Data Flow

Input: CSV files from FanGraphs/Baseball Reference placed in data-input/[Year] [Type] Cardset/
Processing: Statistics are normalized using league averages and converted to D20-based game mechanics
Output: Generated card data is POSTed directly to Paper Dynasty API; cards rendered on-demand when URLs accessed

Entry Points

pd_cards/: CLI package (pd-cards command) for all card creation operations
live_series_update.py: Legacy script for live season card updates (use pd-cards live-series instead)
retrosheet_data.py: Legacy script for historical replay cardsets (use pd-cards retrosheet instead)
refresh_cards.py: Updates existing player card images and metadata
check_cards.py: Validates card data and generates test outputs
check_cards_and_upload.py: Legacy S3 upload script (use pd-cards upload instead)
scouting_batters.py / scouting_pitchers.py: Legacy scouting scripts (use pd-cards scouting instead)

pd-cards CLI

The primary interface is the pd-cards CLI tool. Install it with:

uv pip install -e .  # Install in development mode

Custom Characters (YAML Profiles)

Custom fictional players are defined as YAML profiles in pd_cards/custom/profiles/.

# List all custom character profiles
pd-cards custom list

# Preview a character's calculated ratings
pd-cards custom preview kalin_young

# Submit a character to the database
pd-cards custom submit kalin_young --dry-run  # Preview first
pd-cards custom submit kalin_young            # Actually submit

# Create a new character profile template
pd-cards custom new --name "Player Name" --type batter --hand R --target-ops 0.800
pd-cards custom new --name "Pitcher Name" --type pitcher --hand L --target-ops 0.650

Live Series Updates

# Update live series cards from FanGraphs/Baseball Reference data
pd-cards live-series update --cardset "2025 Season" --games 81 --dry-run
pd-cards live-series update --cardset "2025 Season" --games 162

# Show cardset status
pd-cards live-series status

Retrosheet Processing

# Process historical Retrosheet data
pd-cards retrosheet process 2005 --cardset-id 27 --description Live --dry-run
pd-cards retrosheet process 2005 --cardset-id 27 --description Live

# Generate outfield arm ratings
pd-cards retrosheet arms 2005 --events data-input/retrosheet/retrosheets_events_2005.csv

# Validate positions for a cardset
pd-cards retrosheet validate 27

# Fetch defensive stats from Baseball Reference
pd-cards retrosheet defense 2005 --output "data-input/2005 Live Cardset/"

Scouting Reports

CRITICAL: Scouting reports must ALWAYS be generated for ALL cardsets (no --cardset-id filter). The scouting database is a unified view across all players, and filtering to a single cardset will overwrite the full reports with partial data.

# Generate all scouting reports (ALWAYS run without cardset filter)
pd-cards scouting all

# Upload scouting reports to database server
pd-cards scouting upload

# Full workflow after any card changes:
pd-cards scouting all && pd-cards scouting upload

S3 Upload

# Upload card images to S3
pd-cards upload s3 --cardset "2005 Live" --dry-run
pd-cards upload s3 --cardset "2005 Live" --limit 10

# Check cards without uploading
pd-cards upload check --cardset "2005 Live" --limit 10

# Refresh card images
pd-cards upload refresh --cardset "2005 Live"

Legacy Commands (Still Available)

Testing

pytest                    # Run all tests
pytest tests/test_*.py    # Run specific test file

Card Generation

python live_series_update.py     # Generate live series cards
python retrosheet_data.py        # Generate historical replay cards
python refresh_cards.py          # Update existing card images
python check_cards.py            # Validate card data

Scouting Reports

python scouting_batters.py       # Generate batting scouting data
python scouting_pitchers.py      # Generate pitching scouting data

AWS S3 Card Upload

python check_cards_and_upload.py # Fetch cards from API and upload to S3

Analysis and Reporting

python analyze_cardset_rarity.py # Analyze players by franchise and rarity (batters/pitchers/combined)
python rank_pitching_staffs.py   # Rank teams 1-30 by pitching staff quality

Position Validation

# Verify position assignments after card generation (recommended after every run)
./scripts/check_positions.sh <cardset_id> [api_url]

# Examples:
./scripts/check_positions.sh 27                                    # Check production
./scripts/check_positions.sh 27 https://pddev.manticorum.com/api  # Check dev

# The script flags:
# - Anomalous DH counts (should be <5 for full-season cards)
# - Missing outfield positions (indicates defensive calculation failures)
# - Mismatches between player positions and cardpositions table

Outfield Arm Ratings (Retrosheet)

# Generate arm ratings CSV from Retrosheet play-by-play data
python generate_arm_ratings_csv.py --year 2005 --events data-input/retrosheet/retrosheets_events_2005.csv

# Test/validate arm ratings
python test_retrosheet_arms.py

# Output: data-output/retrosheet_arm_ratings_YYYY.csv

Data Input Requirements

FanGraphs Data (place in data-input/[YEAR] [TYPE] Cardset/)

vlhp-basic.csv / vlhp-rate.csv: vs Left-handed Pitching stats
vrhp-basic.csv / vrhp-rate.csv: vs Right-handed Pitching stats
vlhh-basic.csv / vlhh-rate.csv: vs Left-handed Hitting stats
vrhh-basic.csv / vrhh-rate.csv: vs Right-handed Hitting stats

Baseball Reference Data

running.csv: Baserunning statistics
pitching.csv: Standard pitching statistics
defense_*.csv: Defensive statistics for each position (c, 1b, 2b, 3b, ss, lf, cf, rf, of, p)

Retrosheet Play-by-Play Data

retrosheet_transformer.py: Preprocesses new Retrosheet CSV format to legacy format with smart caching
Place source files in data-input/retrosheet/ directory
Transformer automatically checks timestamps and only re-processes if source is newer than cache
Normalized cache files saved as *_normalized.csv for fast subsequent runs
Performance: ~5 seconds for initial transformation, <1 second for cached loads

Defense CSV Requirements

All defense files must use underscore naming (defense_c.csv, not defense-c.csv) and include these standardized column names:

key_bbref: Player identifier (required as index key)
Inn_def: Innings played at position
chances: Total fielding chances
E_def: Errors
DP_def: Double plays
fielding_perc: Fielding percentage
tz_runs_total: Total Zone runs saved
tz_runs_field: Zone runs (fielding only)
tz_runs_infield: Zone runs (infield only)
range_factor_per_nine: Range factor per 9 innings
range_factor_per_game: Range factor per game
Catchers only: caught_stealing_perc, pickoffs (not PO)
Position players: PO for putouts (not pickoffs)

Minimum Playing Time Thresholds

Live Series: 20 PA vs L / 40 PA vs R (batters), 20 TBF vs L / 40 TBF vs R (pitchers)
Season Cards: 50 PA vs L / 75 PA vs R (batters), 50 TBF vs L / 75 TBF vs R (pitchers)

Configuration

Database Settings (db_calls.py)

Production: https://pd.manticorum.com/api
Development: https://pddev.manticorum.com/api
Change alt_database variable to switch environments

Live Series Settings (live_series_update.py)

SEASON: Current year for live updates
CARDSET_NAME: Target cardset (e.g., "2025 Live")
GAMES_PLAYED: Season progress for live series calculations
IGNORE_LIMITS: Override minimum playing time requirements

Retrosheet Data Settings (retrosheet_data.py)

Before running retrosheet_data.py, verify these configuration settings:

PLAYER_DESCRIPTION: 'Live' for season cards, or ' PotM' for promotional cards
CARDSET_ID: Correct cardset ID (e.g., 27 for 2005 Live, 28 for 2005 Promos)
START_DATE / END_DATE: Date range in YYYYMMDD format matching your Retrosheet data
SEASON_PCT: Percentage of season completed (162/162 for full season)
MIN_PA_VL / MIN_PA_VR: Minimum plate appearances (50/75 for full season, 1/1 for promos)
DATA_INPUT_FILE_PATH: Path to data directory (usually data-input/[Year] [Type] Cardset/)
EVENTS_FILENAME: Retrosheet CSV filename (e.g., retrosheets_events_2005.csv)

Configuration Checklist Before Running:

Database environment (alt_database in db_calls.py)
Cardset ID matches intended target
Date range matches Retrosheet data year
Defense CSV files present and properly named
Running/pitching CSV files present

AWS S3 Upload Settings (check_cards_and_upload.py)

CARDSET_NAME: Target cardset name to fetch players from (e.g., "2005 Live")
START_ID: Optional player_id to start from (useful for resuming uploads)
TEST_COUNT: Limit number of cards to process (set to None for all cards)
HTML_CARDS: Set to True to fetch HTML preview cards instead of PNG
UPLOAD_TO_S3: Enable/disable S3 upload (True for production)
UPDATE_PLAYER_URLS: Enable/disable updating player records with S3 URLs (careful - modifies database)
AWS_BUCKET_NAME: S3 bucket name (default: 'paper-dynasty')
AWS_REGION: AWS region (default: 'us-east-1')

S3 URL Structure: cards/cardset-{cardset_id:03d}/player-{player_id}/{batting|pitching}card.png?d={release_date}

Uses zero-padded 3-digit cardset ID for consistent sorting
Includes cache-busting query parameter with date (YYYY-M-D format)
Uses persistent aiohttp session for efficient connection reuse

AWS Credentials: Requires AWS CLI configured with credentials (~/.aws/credentials) and appropriate IAM permissions:

s3:PutObject, s3:GetObject, s3:ListBucket on the target bucket

Important Notes

The system uses D20-based probability mechanics where statistics are converted to chances out of 20
Cards are generated with both basic stats and advanced metrics (OPS, WHIP, etc.)
Defensive ratings use zone-based fielding statistics from Baseball Reference
All player data flows through Paper Dynasty's API with bearer token authentication
Cards are dynamically rendered when accessed via URL, with nginx caching for performance

Rarity Assignment System

rarity_thresholds.py: Contains season-aware rarity thresholds (2024 vs 2025+)
Rarity is calculated from total_OPS (batters) or OPS-against (pitchers) in the ratings dataframe
post_player_updates() uses LEFT JOIN to preserve players without ratings (assigns Common/5 rarity + default OPS)
Players missing ratings will log warnings showing player_id and card_id for troubleshooting
Default OPS values: 0.612 (batters/Common), 0.702 (pitchers/Common reliever)

Position Assignment Rules

Batters: Positions assigned from defensive stats, sorted by innings played (most innings = pos_1)
DH Rule: "DH" only appears when a player has NO defensive positions at all
Pitchers: Assigned based on starter_rating (≥4 = SP, <4 = RP) and closer_rating (if present, add CP)
Position Updates: Script updates ALL 8 position slots when patching existing players to clear old data
Player cards can be viewed as HTML by adding html=true to the card URL: https://pddev.manticorum.com/api/v2/players/{id}/battingcard?d={date}&html=true

Common Issues and Solutions

Multi-Team Players (Traded During Season)

Problem: Players traded during season appear multiple times in Baseball Reference data (one row per team + combined total marked as "2TM", "3TM", etc.)

Solution: Script automatically filters to keep only combined season totals:

Detects duplicate key_bbref values after merging peripheral/running stats
Keeps rows where Tm column contains "TM" (2TM, 3TM, etc.)
Removes individual team rows to prevent duplicate player entries

Dictionary Column Corruption in Ratings

Problem: When merging full card DataFrames with ratings DataFrames, pandas corrupts ratings_vL and ratings_vR dictionary columns, converting them to floats/NaN.

Solution: Only merge specific columns needed (key_bbref, player_id, battingcard_id/pitchingcard_id) instead of entire DataFrame.

No Players Found After Successful Run

Symptoms: Script completes successfully but API query returns 0 players

Common Causes:

Wrong Cardset: Check logs for actual cardset_id used vs. cardset queried in API
Wrong Database: Verify alt_database setting in db_calls.py (dev vs production)
Date Mismatch: START_DATE/END_DATE don't match Retrosheet data year
Empty PROMO_INCLUSION_RETRO_IDS: When PLAYER_DESCRIPTION is a promo name, this list must contain player IDs

Debugging Steps:

Check logs for actual POST operations and player_id values
Verify cardset_id in logs matches API query
Check database URL in logs matches intended environment
Query API with cardset_id from logs to find players

String Type Issues with Retrosheet Data

Problem: Pandas .str accessor fails on hit_val, hit_location, batted_ball_type columns

Solution: retrosheet_transformer.py explicitly converts these to string dtype and maintains type when loading from cache using dtype parameter in pd.read_csv()

Pitcher OPS Calculation Errors

Problem: min() function fails with "truth value is ambiguous" error when calculating OB values

Solution: Explicitly convert pandas values to Python floats before using min():

ob_vl = float(108 * (df_data['BB_vL'] + df_data['HBP_vL']) / df_data['TBF_vL'])
result = min(ob_vl, 0.8)  # Now works correctly

Outfielders Assigned as DH (Defense Column Mismatch)

Problem: All outfielders show pos_1 = "DH" instead of LF/CF/RF; cardpositions table has 0 outfield positions

Root Cause: Code checks for bis_runs_outfield or tz_runs_outfield columns in defense CSV files, but Baseball Reference only provides tz_runs_total

Symptoms:

50+ players with DH as pos_1 (should be <5 for full season)
No LF/CF/RF positions in player records
Log errors: "Outfield position failed: 'tz_runs_outfield'"

Solution (retrosheet_data.py lines 889, 926, 947):

# Wrong - checks batter stats row instead of defense dataframe columns
if 'tz_runs_total' in row:  # ❌

# Correct - checks defense dataframe for actual column
if 'bis_runs_total' in pos_df.columns:  # ✅

# Wrong - column doesn't exist in CSV
of_run_rating = 'bis_runs_outfield' if 'bis_runs_outfield' in pos_df else 'tz_runs_outfield'  # ❌

# Correct - fallback to column that exists
of_run_rating = 'bis_runs_outfield' if 'bis_runs_outfield' in pos_df.columns else 'tz_runs_total'  # ✅

Verification: Run ./scripts/check_positions.sh <cardset_id> after card generation to catch this issue

Additional Fix: Modified post_positions() to DELETE all existing cardpositions for the cardset before posting new ones. This prevents stale DH positions from remaining in the database when players gain defensive positions after bug fixes.

Outfield Arm Ratings from Retrosheet Data

Overview

For historical seasons where Baseball Reference's bis_runs_outfield is unavailable, we calculate OF arm ratings directly from Retrosheet play-by-play event data using assist rates and quality indicators.

System Architecture

Location: defenders/retrosheet_arm_calculator.py

Key Components:

Calculation Engine - Analyzes play-by-play events to measure arm strength
CSV Persistence - Saves calculated ratings for reuse
Load/Lookup Functions - Easy integration with card creation scripts

Formula (Rate-Dominant)

raw_score = (
    (assist_rate * 300) +        # PRIMARY: Assists per ball fielded
    (home_throws * 1.0) +        # Quality: Throwing runners out at home
    (batter_extra_outs * 1.0) +  # Quality: Preventing extra bases
    (total_assists * 0.1)        # Minimal volume bonus
)

Design Philosophy:

Assist rate is king - 300x weight (primary driver)
Quality indicators - Home throws and batter extra outs add context
No throwout rate - Assists already imply outs (redundant)
Minimal volume bonus - Raw count provides tiebreaker only

Rating Scale (-6 to +5)

Ratings follow a calibrated distribution (peak at 0 = ~45-50%):

Rating	Description	Z-Score	Approx %
-6	Elite cannon	> 2.5	~1%
-5	Outstanding	2.0-2.5	~2%
-4	Excellent	1.5-2.0	~3%
-3	Very Good	1.0-1.5	~5%
-2	Above Average	0.5-1.0	~10%
-1	Slightly Above	0.0-0.5	~15%
0	Average	-0.15-0.0	~45%
+1	Slightly Below	-0.5--0.15	~10%
+2	Below Average	-0.9--0.5	~5%
+3	Poor	-1.3--0.9	~3%
+4	Very Poor	-1.6--1.3	~2%
+5	Very Weak	< -1.6	~1%

Note: Thresholds calibrated to actual data distribution after 300x assist_rate weight compressed z-scores.

Critical Bug Fix: Fielder vs Lineup Columns

Problem: Original implementation used wrong columns for fielder positions.

Wrong Columns (Lineup Order):

l7, l8, l9 = 7th, 8th, 9th batters in lineup (NOT field positions!)

Correct Columns (Actual Fielders):

f7, f8, f9 = Fielders at positions 7 (LF), 8 (CF), 9 (RF)

Impact:

Was measuring arm strength of whoever batted 7th/8th/9th
Known strong arms (Ichiro, Crawford, Edmonds) didn't show up
Rankings were based on batting order, not defensive positions

Fix: All references updated to use f7, f8, f9 fielder columns.

Data Requirements

Retrosheet Columns Used:

f7, f8, f9 - Fielder IDs at LF/CF/RF (CRITICAL: not l7/l8/l9!)
a7, a8, a9 - Assists by position
po7, po8, po9 - Putouts by position
brout1, brout2, brout3, brout_b - Which fielder got the out
Event descriptions for context

Minimum Sample Size: 50 balls fielded per position (adjustable with season_pct)

Generating Arm Ratings

Command:

python generate_arm_ratings_csv.py --year 2005 --events data-input/retrosheet/retrosheets_events_2005.csv

Output: data-output/retrosheet_arm_ratings_2005.csv

CSV Columns:

player_id - Baseball Reference ID (key_bbref)
position - LF/CF/RF
season - Year
balls_fielded - Sample size
total_assists - Assist count
home_throws - Throws to home that got outs
batter_extra_outs - Prevented extra bases
assist_rate - Assists / balls fielded
raw_score - Pre-normalization score
z_score - Position-adjusted z-score
arm_rating - Final rating (-6 to +5)

Using in Card Creation Scripts

Load pre-calculated ratings:

from defenders.retrosheet_arm_calculator import load_arm_ratings_from_csv, get_arm_for_player

# At script start
arm_ratings = load_arm_ratings_from_csv(season_year=2005)

# When assigning positions
player_arm = get_arm_for_player(arm_ratings, 'suzui001', default=0)

Calculate on-the-fly:

from defenders.retrosheet_arm_calculator import calculate_of_arms_from_retrosheet

df_events = pd.read_csv('data-input/retrosheet/events.csv')
arm_ratings = calculate_of_arms_from_retrosheet(df_events, season_pct=1.0)

Integration in retrosheet_data.py:

# After loading events
from defenders.retrosheet_arm_calculator import load_arm_ratings_from_csv

try:
    retrosheet_arm_ratings = load_arm_ratings_from_csv(SEASON_YEAR)
except FileNotFoundError:
    retrosheet_arm_ratings = {}  # Use defaults if not found

# In create_positions(), replace arm_outfield() call:
from defenders.retrosheet_arm_calculator import get_arm_for_player
arm_rating = get_arm_for_player(retrosheet_arm_ratings, df_data['key_bbref'], default=0)

Documentation

Detailed guides:

docs/of_arm_rating_improvement_proposal.md - Full methodology and design
docs/HOW_TO_USE_ARM_RATINGS.md - Integration guide with examples
docs/formula_weight_comparison.md - Before/after comparison
docs/CRITICAL_BUG_FIX_fielder_columns.md - Fielder column bug fix details
docs/arm_rating_scale_reference.md - Quick reference for rating scale

Key Advantages

Historical Availability - Works for any season with Retrosheet data (1921+)
Rate-Based - Prioritizes assist rate over volume (no platoon penalty)
Position-Adjusted - Normalized within LF/CF/RF for fair comparison
Quality-Aware - Credits high-value throws (home, preventing extra bases)
Persistent - CSV output allows consistent ratings across runs
Transparent - Clear formula allows tuning and debugging

Validation

Test script: python test_retrosheet_arms.py

2005 Results:

300 qualified outfielders
Distribution: ~1% elite (-6), ~45% average (0), ~1% very weak (+5)
Known strong arms (Ichiro, Guerrero) properly identified after bug fix
Assist rate correctly dominates over volume

22 KiB Raw Permalink Blame History

CLAUDE.md

Project Overview

⚠️ Critical Lessons Learned

Key Architecture Components

Core Modules

Data Flow

Entry Points

pd-cards CLI

Custom Characters (YAML Profiles)

Live Series Updates

Retrosheet Processing

Scouting Reports

S3 Upload

Legacy Commands (Still Available)

Testing

Card Generation

Scouting Reports

AWS S3 Card Upload

Analysis and Reporting

Position Validation

Outfield Arm Ratings (Retrosheet)

Data Input Requirements

FanGraphs Data (place in data-input/[YEAR] [TYPE] Cardset/)

Baseball Reference Data

Retrosheet Play-by-Play Data

Defense CSV Requirements

Minimum Playing Time Thresholds

Configuration

Database Settings (db_calls.py)

Live Series Settings (live_series_update.py)

Retrosheet Data Settings (retrosheet_data.py)

AWS S3 Upload Settings (check_cards_and_upload.py)

Important Notes

Rarity Assignment System

Position Assignment Rules

Common Issues and Solutions

Multi-Team Players (Traded During Season)

Dictionary Column Corruption in Ratings

No Players Found After Successful Run

String Type Issues with Retrosheet Data

Pitcher OPS Calculation Errors

Outfielders Assigned as DH (Defense Column Mismatch)

Outfield Arm Ratings from Retrosheet Data

Overview

System Architecture

Formula (Rate-Dominant)

Rating Scale (-6 to +5)

Critical Bug Fix: Fielder vs Lineup Columns

Data Requirements

Generating Arm Ratings

Using in Card Creation Scripts

Documentation

Key Advantages

Validation

22 KiB

Raw Permalink Blame History