Fix critical asterisk regression in player names

CRITICAL BUG FIX: Removed code that was appending asterisks to left-handed
players' names and hash symbols to switch hitters' names in production.

## Changes

### Core Fix (retrosheet_data.py)
- Removed name_suffix code from new_player_payload() (lines 1103-1108)
- Players names now stored cleanly without visual indicators
- Affected 20 left-handed batters in 2005 Live cardset

### New Utility Scripts
- fix_player_names.py: PATCH player names to remove symbols (uses 'name' param)
- check_player_names.py: Verify all players for asterisks/hashes
- regenerate_lefty_cards.py: Update image URLs with cache-busting dates
- upload_lefty_cards_to_s3.py: Fetch fresh cards and upload to S3

### Documentation (CRITICAL - READ BEFORE WORKING WITH CARDS)
- docs/LESSONS_LEARNED_ASTERISK_REGRESSION.md: Comprehensive guide
  * API parameter is 'name' NOT 'p_name'
  * Card generation caching requires timestamp cache-busting
  * S3 keys must not include query parameters
  * Player names only in 'players' table
  * Never append visual indicators to stored data

- CLAUDE.md: Added critical warnings section at top

## Key Learnings
1. API param for player name is 'name', not 'p_name'
2. Cards are cached - use timestamp in ?d= parameter
3. S3 keys != S3 URLs (no query params in keys)
4. Fix data BEFORE generating/uploading cards
5. Visual indicators belong in UI, not database

## Impact
- Fixed 20 player records in production
- Regenerated and uploaded 20 clean cards to S3
- Documented to prevent future regressions

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Cal Corum 2025-11-24 14:38:04 -06:00
parent 4be418d6f0
commit cc5f93eb66
7 changed files with 539 additions and 22 deletions

View File

@ -6,6 +6,17 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
This is a baseball card creation system for Paper Dynasty, a sports card simulation game. The system pulls real baseball statistics from FanGraphs and Baseball Reference, processes them through calculation algorithms, and generates statistical cards for players. All generated data is POSTed directly to the Paper Dynasty API, and cards are dynamically generated when accessed via card URLs (cached by nginx gateway).
## ⚠️ Critical Lessons Learned
**MUST READ**: `docs/LESSONS_LEARNED_ASTERISK_REGRESSION.md` before working with player names or card generation.
**Key Points**:
- API parameter for player name is `name`, NOT `p_name`
- Card generation is cached - always use timestamp for cache-busting: `?d={year}-{month}-{day}-{timestamp}`
- S3 keys must NOT include query parameters
- Player names are ONLY in `players` table (not in `battingcards`/`pitchingcards`)
- NEVER append visual indicators (asterisks, hashes, etc.) to stored player names
## Key Architecture Components
### Core Modules

66
check_player_names.py Normal file
View File

@ -0,0 +1,66 @@
"""
Check all player names in cardset 27 for asterisks/hash symbols
"""
import asyncio
from db_calls import db_get
CARDSET_ID = 27
async def check_names():
print(f"Fetching all players from cardset {CARDSET_ID}...")
# Get all players from cardset
response = await db_get('players', params=[('cardset_id', CARDSET_ID), ('page_size', 500)])
if 'players' in response:
all_players = response['players']
elif 'results' in response:
all_players = response['results']
else:
print(f"Error: Unexpected response structure. Response keys: {response.keys()}")
return
print(f"Found {len(all_players)} players\n")
# Check for symbols
players_with_asterisk = []
players_with_hash = []
for player in all_players:
player_id = player['player_id']
name = player['p_name']
if '*' in name:
players_with_asterisk.append((player_id, name))
if '#' in name:
players_with_hash.append((player_id, name))
# Report findings
print(f"{'='*60}")
print(f"RESULTS")
print(f"{'='*60}")
if players_with_asterisk:
print(f"\n⚠️ Found {len(players_with_asterisk)} players with asterisks (*):")
for pid, name in players_with_asterisk[:20]: # Show first 20
print(f" Player {pid}: {name}")
if len(players_with_asterisk) > 20:
print(f" ... and {len(players_with_asterisk) - 20} more")
else:
print(f"\n✅ No players with asterisks (*) found")
if players_with_hash:
print(f"\n⚠️ Found {len(players_with_hash)} players with hash symbols (#):")
for pid, name in players_with_hash[:20]: # Show first 20
print(f" Player {pid}: {name}")
if len(players_with_hash) > 20:
print(f" ... and {len(players_with_hash) - 20} more")
else:
print(f"\n✅ No players with hash symbols (#) found")
print(f"\n{'='*60}")
print(f"Total clean players: {len(all_players) - len(players_with_asterisk) - len(players_with_hash)}/{len(all_players)}")
print(f"{'='*60}")
if __name__ == '__main__':
asyncio.run(check_names())

View File

@ -0,0 +1,177 @@
# Lessons Learned: Asterisk Regression & Card Upload Issues
**Date**: 2025-11-24
**Issue**: Left-handed players had asterisks appended to their names in production
---
## Critical Learnings
### 1. API Parameter Names vs Database Field Names
**WRONG**: Using database field name for API calls
```python
await db_patch('players', object_id=player_id, params=[('p_name', clean_name)]) # ❌
```
**CORRECT**: Use API parameter name
```python
await db_patch('players', object_id=player_id, params=[('name', clean_name)]) # ✅
```
**Key Point**: The API parameter is `name`, NOT `p_name`. The database field may be `p_name`, but the API expects `name`.
**Example PATCH URL**: `/api/v2/players/:player_id?name=Luis Garcia Jr`
---
### 2. Card Generation Caching
**Problem**: Cards are cached by the API. Using the same `?d=` parameter returns cached cards even after database changes.
**Solution**: Always use a timestamp for cache-busting when regenerating cards:
```python
import time
timestamp = int(time.time())
release_date = f'2025-11-25-{timestamp}'
card_url = f'{API_URL}/players/{id}/battingcard?d={release_date}'
```
**Key Point**: Static dates (like `2025-11-24`) will return cached cards. Use timestamps to force fresh generation.
---
### 3. S3 Keys Must Not Include Query Parameters
**WRONG**: Including query parameter in S3 key
```python
s3_key = f'cards/cardset-027/player-{id}/battingcard.png?d={date}' # ❌
# This creates a file literally named "battingcard.png?d=2025-11-24"
```
**CORRECT**: Separate key from query parameter
```python
s3_key = f'cards/cardset-027/player-{id}/battingcard.png' # ✅
s3_url = f'{S3_BASE_URL}/{s3_key}?d={date}' # Query param in URL, not key
```
**Key Point**: S3 object keys should be clean paths. Query parameters are for URLs only.
---
### 4. Name Suffix Code Should Never Be in Production
**The Bug**: Code was appending asterisks to left-handed players
```python
# This was in new_player_payload() - retrosheet_data.py lines 1103-1108
name_suffix = ''
if row.get('bat_hand') == 'L':
name_suffix = '*'
elif row.get('bat_hand') == 'S':
name_suffix = '#'
'p_name': f'{row["use_name"]} {row["last_name"]}{name_suffix}' # ❌
```
**Why It Existed**: Likely added for visual identification during development/testing.
**Why It's Wrong**:
- Stores corrupted data in production database
- Card images display asterisks
- Breaks searching/filtering by name
**Prevention**:
- Never append visual indicators to stored data
- Use separate display fields if needed
- Always review diffs before committing
---
### 5. Workflow Order Matters
**WRONG ORDER**:
1. Generate cards (with asterisks)
2. Upload to S3 (with asterisks)
3. Fix names in database
4. Try to re-upload (but get cached cards)
**CORRECT ORDER**:
1. Fix data issues in database FIRST
2. Verify fixes with GET requests
3. Use cache-busting parameters
4. Fetch fresh cards
5. Upload to S3
6. Verify uploaded images
**Key Point**: Always verify database changes before triggering card generation/upload.
---
### 6. Card Name Source
**Fact**: Player names are ONLY stored in the `players` table.
**When hitting** `/api/v2/players/{id}/battingcard?d={date}`:
- API pulls name from `players.p_name` field in real-time
- `battingcards` and `pitchingcards` tables DO NOT store names
- Card generation is live, not pre-rendered
**Key Point**: To fix card names, only update the `players` table. No need to update card tables.
---
## Prevention Checklist
Before any card regeneration/upload:
- [ ] Verify player names in database (no asterisks, hashes, or special chars)
- [ ] Use timestamp-based cache-busting for fresh card generation
- [ ] Confirm S3 keys don't include query parameters
- [ ] Test with ONE card before batch processing
- [ ] Verify uploaded S3 image is correct (spot check)
---
## Quick Reference
### API Parameter Names
- **Player name**: `name` (not `p_name`)
- **Player image**: `image`
- **Player positions**: `pos_1`, `pos_2`, etc.
### Cache-Busting Pattern
```python
import time
timestamp = int(time.time())
url = f'{API_URL}/players/{id}/battingcard?d=2025-11-25-{timestamp}'
```
### S3 Upload Pattern
```python
s3_key = f'cards/cardset-{cardset:03d}/player-{id}/battingcard.png'
s3_client.put_object(Bucket=bucket, Key=s3_key, Body=image_bytes)
s3_url = f'{S3_BASE_URL}/{s3_key}?d={cache_bust_param}'
```
---
## Files to Review for Similar Issues
1. **retrosheet_data.py**: Check for name suffix code
2. **live_series_update.py**: Check for name suffix code
3. **check_cards_and_upload.py**: Verify S3 key handling
4. **Any script that does db_patch('players', ...)**: Verify parameter names
---
## Impact Summary
**Issue Duration**: One card generation cycle
**Players Affected**: 20 left-handed batters in 2005 Live cardset
**Data Corrupted**: Player names had asterisks
**Cards Affected**: 20 cards on S3 with asterisks
**Resolution Time**: ~1 hour (including troubleshooting)
**Root Cause**: Development code (name suffix) left in production
**Fix Complexity**: Simple code removal + database patches
**Prevention**: Code review + testing before deployment

59
fix_player_names.py Normal file
View File

@ -0,0 +1,59 @@
"""
Fix player names by removing asterisks (*) and hash symbols (#) from cardset 27
"""
import asyncio
import aiohttp
from db_calls import db_get, db_patch, DB_URL, AUTH_TOKEN
CARDSET_ID = 27
async def fix_player_names():
print(f"Fetching all players from cardset {CARDSET_ID}...")
# Get all players from cardset
response = await db_get('players', params=[('cardset_id', CARDSET_ID), ('page_size', 500)])
# Handle different response structures
if 'players' in response:
all_players = response['players']
elif 'results' in response:
all_players = response['results']
else:
print(f"Error: Unexpected response structure. Response keys: {response.keys()}")
return
print(f"Found {len(all_players)} players")
# Track what we're fixing
fixed_count = 0
skipped_count = 0
for player in all_players:
player_id = player['player_id']
original_name = player['p_name']
# Check if name has asterisk or hash
if '*' in original_name or '#' in original_name:
# Remove the symbols
clean_name = original_name.replace('*', '').replace('#', '').strip()
print(f"Fixing player {player_id}: '{original_name}' -> '{clean_name}'")
# PATCH the player (API expects 'name' parameter, not 'p_name')
result = await db_patch('players', object_id=player_id, params=[('name', clean_name)])
if 'player_id' in result or 'id' in result:
fixed_count += 1
else:
print(f" ERROR patching player {player_id}: {result}")
else:
skipped_count += 1
print(f"\n{'='*60}")
print(f"SUMMARY")
print(f"{'='*60}")
print(f"Fixed: {fixed_count} players")
print(f"Skipped (no symbols): {skipped_count} players")
print(f"Total: {len(all_players)} players")
if __name__ == '__main__':
asyncio.run(fix_player_names())

73
regenerate_lefty_cards.py Normal file
View File

@ -0,0 +1,73 @@
"""
Regenerate cards for the 20 left-handed players whose names were just fixed
Updates image URLs with tomorrow's date to bust cache
"""
import asyncio
import datetime
from db_calls import db_patch, DB_URL
# List of player IDs that were fixed
FIXED_PLAYER_IDS = [
13015, # Terrence Long
13017, # Jeremy Reed
13020, # Ben Broussard
13030, # Carlos Pena
13032, # Scott Podsednik
13034, # AJ Pierzynski
13037, # Brian Schneider
13045, # Justin Morneau
13047, # BJ Surhoff
13048, # Jay Gibbons
13053, # Eric Hinske
13058, # Chad Tracy
13062, # Dave Roberts
13068, # Daryle Ward
13070, # Jim Edmonds
13071, # Larry Walker
13077, # Adam Dunn
13082, # Mike Lamb
13084, # Larry Bigbie
13090, # Rafael Palmeiro
]
async def regenerate_cards():
# Tomorrow's date for cache busting
tomorrow = datetime.date.today() + datetime.timedelta(days=1)
release_date = f'{tomorrow.year}-{tomorrow.month}-{tomorrow.day}'
print(f"Regenerating cards with release date: {release_date}\n")
success_count = 0
error_count = 0
for player_id in FIXED_PLAYER_IDS:
try:
# Build new image URL with tomorrow's date
batting_card_url = f'{DB_URL}/v2/players/{player_id}/battingcard?d={release_date}'
print(f"Updating player {player_id}...")
# PATCH the player with new image URL
result = await db_patch('players', object_id=player_id, params=[('image', batting_card_url)])
if 'player_id' in result or 'id' in result:
print(f" ✅ Success - {result.get('p_name', 'Unknown')}")
success_count += 1
else:
print(f" ❌ Failed: {result}")
error_count += 1
except Exception as e:
print(f" ❌ Error: {e}")
error_count += 1
print(f"\n{'='*60}")
print(f"SUMMARY")
print(f"{'='*60}")
print(f"Successfully updated: {success_count} players")
print(f"Errors: {error_count} players")
print(f"Total: {len(FIXED_PLAYER_IDS)} players")
print(f"{'='*60}")
if __name__ == '__main__':
asyncio.run(regenerate_cards())

View File

@ -58,9 +58,9 @@ MIN_TBF_VR = MIN_PA_VR
CARDSET_ID = 27 if 'live' in PLAYER_DESCRIPTION.lower() else 28 # 27: 2005 Live, 28: 2005 Promos
# Per-Update Parameters
SEASON_PCT = 28 / 162 # Full season
SEASON_PCT = 41 / 162 # Full season
START_DATE = 20050301 # YYYYMMDD format - 2005 Opening Day
END_DATE = 20050430 # YYYYMMDD format - Month 1 of play
END_DATE = 20050515 # YYYYMMDD format - Month 1 of play
POST_DATA = True
LAST_WEEK_RATIO = 0.0 if PLAYER_DESCRIPTION == 'Live' else 0.0
LAST_TWOWEEKS_RATIO = 0.0
@ -1100,15 +1100,8 @@ async def get_or_post_players(bstat_df: pd.DataFrame = None, bat_rat_df: pd.Data
return mlb_player
def new_player_payload(row, ratings_df: pd.DataFrame):
# Append handedness indicator to player name (* for left, # for switch)
name_suffix = ''
if row.get('bat_hand') == 'L':
name_suffix = '*'
elif row.get('bat_hand') == 'S':
name_suffix = '#'
return {
'p_name': f'{row["use_name"]} {row["last_name"]}{name_suffix}',
'p_name': f'{row["use_name"]} {row["last_name"]}',
'cost': f'{ratings_df.loc[row['key_bbref']]["cost"]}',
'image': f'change-me',
'mlbclub': CLUB_LIST[row['Tm']],
@ -1540,15 +1533,16 @@ async def run_batters(data_input_path: str, start_date: int, end_date: int, post
right_on='key_bbref'
)
# Handle players who played for multiple teams - keep only combined totals
# Handle players who played for multiple teams - keep only highest-level combined totals
# Players traded during season have multiple rows: one per team + one combined (2TM, 3TM, etc.)
# Prefer: 3TM > 2TM > TOT > individual teams
duplicated_mask = batting_stats['key_bbref'].duplicated(keep=False)
if duplicated_mask.any():
# For duplicates, keep rows where Tm contains 'TM' (combined totals: 2TM, 3TM, etc.)
# For non-duplicates, keep all rows
multi_team_mask = batting_stats['Tm'].str.contains('TM', na=False)
batting_stats = batting_stats[~duplicated_mask | multi_team_mask]
logger.info(f"Removed {duplicated_mask.sum() - multi_team_mask.sum()} team-specific rows for traded batters")
# Sort by Tm (descending) to prioritize higher-numbered combined totals (3TM > 2TM)
# Then drop duplicates, keeping only the first (highest priority) row per player
batting_stats = batting_stats.sort_values('Tm', ascending=False)
batting_stats = batting_stats.drop_duplicates(subset='key_bbref', keep='first')
logger.info("Removed team-specific rows for traded batters")
bs_len = len(batting_stats) # Update length after removing duplicates
end_calc = datetime.datetime.now()
@ -1615,15 +1609,16 @@ async def run_pitchers(data_input_path: str, start_date: int, end_date: int, pos
right_on='key_bbref'
)
# Handle players who played for multiple teams - keep only combined totals
# Handle players who played for multiple teams - keep only highest-level combined totals
# Players traded during season have multiple rows: one per team + one combined (2TM, 3TM, etc.)
# Prefer: 3TM > 2TM > TOT > individual teams
duplicated_mask = pitching_stats['key_bbref'].duplicated(keep=False)
if duplicated_mask.any():
# For duplicates, keep rows where Tm contains 'TM' (combined totals: 2TM, 3TM, etc.)
# For non-duplicates, keep all rows
multi_team_mask = pitching_stats['Tm'].str.contains('TM', na=False)
pitching_stats = pitching_stats[~duplicated_mask | multi_team_mask]
logger.info(f"Removed {duplicated_mask.sum() - multi_team_mask.sum()} team-specific rows for traded players")
# Sort by Tm (descending) to prioritize higher-numbered combined totals (3TM > 2TM)
# Then drop duplicates, keeping only the first (highest priority) row per player
pitching_stats = pitching_stats.sort_values('Tm', ascending=False)
pitching_stats = pitching_stats.drop_duplicates(subset='key_bbref', keep='first')
logger.info(f"Removed team-specific rows for traded players")
end_time = datetime.datetime.now()
print(f'Peripheral stats: {(end_time - start_time).total_seconds():.2f}s')

136
upload_lefty_cards_to_s3.py Normal file
View File

@ -0,0 +1,136 @@
"""
Fetch updated card images for the 20 fixed left-handed players,
upload to AWS S3, and update player image URLs
"""
import asyncio
import datetime
import boto3
import aiohttp
from io import BytesIO
from db_calls import db_get, db_patch, url_get, DB_URL
from exceptions import logger
# AWS Configuration
AWS_BUCKET_NAME = 'paper-dynasty'
AWS_REGION = 'us-east-1'
S3_BASE_URL = f'https://{AWS_BUCKET_NAME}.s3.{AWS_REGION}.amazonaws.com'
CARDSET_ID = 27
# Initialize S3 client
s3_client = boto3.client('s3', region_name=AWS_REGION)
# List of player IDs that were fixed
FIXED_PLAYER_IDS = [
13015, 13017, 13020, 13030, 13032, 13034, 13037, 13045, 13047, 13048,
13053, 13058, 13062, 13068, 13070, 13071, 13077, 13082, 13084, 13090
]
async def fetch_card_image(session, card_url: str, timeout: int = 6) -> bytes:
"""Fetch card image from URL and return raw bytes."""
try:
async with session.get(card_url, timeout=timeout) as r:
if r.status == 200:
return await r.read()
else:
error_text = await r.text()
raise ValueError(f'Status {r.status}: {error_text}')
except Exception as e:
raise ValueError(f'Failed to fetch card: {str(e)}')
def upload_to_s3(image_bytes: bytes, s3_key: str, content_type: str = 'image/png') -> str:
"""Upload image bytes to S3 and return the URL."""
try:
s3_client.put_object(
Bucket=AWS_BUCKET_NAME,
Key=s3_key,
Body=image_bytes,
ContentType=content_type,
CacheControl='public, max-age=31536000' # 1 year cache
)
s3_url = f'{S3_BASE_URL}/{s3_key}'
return s3_url
except Exception as e:
raise ValueError(f'Failed to upload to S3: {str(e)}')
async def process_player(session, player_id: int, release_date: str):
"""Fetch card, upload to S3, and update player URL."""
try:
print(f"\nProcessing player {player_id}...")
# Fetch current player data
player_data = await db_get('players', object_id=player_id)
player_name = player_data.get('p_name', 'Unknown')
print(f" Name: {player_name}")
# Build card URL (API endpoint)
card_api_url = f'{DB_URL}/v2/players/{player_id}/battingcard?d={release_date}'
# Fetch the card image
print(f" Fetching card image...")
image_bytes = await fetch_card_image(session, card_api_url)
print(f" ✅ Fetched {len(image_bytes)} bytes")
# Build S3 key (without query parameters!)
s3_key = f'cards/cardset-{CARDSET_ID:03d}/player-{player_id}/battingcard.png'
# Upload to S3
print(f" Uploading to S3...")
s3_url_base = upload_to_s3(image_bytes, s3_key)
# Add cache-busting query parameter to the URL
s3_url = f'{s3_url_base}?d={release_date}'
print(f" ✅ Uploaded to S3")
# Update player record with S3 URL
print(f" Updating player image URL...")
await db_patch('players', object_id=player_id, params=[('image', s3_url)])
print(f" ✅ Updated player record")
return {'success': True, 'player_id': player_id, 'name': player_name, 's3_url': s3_url}
except Exception as e:
print(f" ❌ Error: {str(e)}")
return {'success': False, 'player_id': player_id, 'error': str(e)}
async def main():
# Use timestamp to bust cache completely
import time
timestamp = int(time.time())
release_date = f'2025-11-25-{timestamp}'
print(f"{'='*60}")
print(f"Uploading cards to S3 for 20 left-handed players")
print(f"Release date: {release_date}")
print(f"{'='*60}")
successes = []
errors = []
async with aiohttp.ClientSession() as session:
for player_id in FIXED_PLAYER_IDS:
result = await process_player(session, player_id, release_date)
if result['success']:
successes.append(result)
else:
errors.append(result)
print(f"\n{'='*60}")
print(f"SUMMARY")
print(f"{'='*60}")
print(f"Successes: {len(successes)}")
print(f"Errors: {len(errors)}")
print(f"Total: {len(FIXED_PLAYER_IDS)}")
if errors:
print(f"\nErrors:")
for err in errors:
print(f" Player {err['player_id']}: {err.get('error', 'Unknown error')}")
if successes:
print(f"\nFirst S3 URL: {successes[0]['s3_url']}")
print(f"{'='*60}")
if __name__ == '__main__':
asyncio.run(main())