paper-dynasty-card-creation/.claude/plans/pandas-to-polars.md
2025-07-22 09:24:34 -05:00

8.5 KiB

Pandas to Polars Migration Plan

Project Overview

The Paper Dynasty card-creation project is a baseball card generation system that processes real baseball statistics from FanGraphs and Baseball Reference. The system uses pandas extensively for data manipulation across 19 Python files, handling complex workflows including CSV ingestion, multi-source data merging, API response processing, and statistical calculations for D20-based game mechanics.

Current Pandas Usage

  • 19 files with pandas operations
  • Large CSV processing from FanGraphs and Baseball Reference
  • Complex multi-table joins (vs L/R handed pitching/hitting data)
  • API response transformation to DataFrames
  • Statistical calculations for card generation
  • Web scraping data processing for defensive stats

Migration Analysis

Pandas Usage Patterns Identified

1. Core Data Operations

  • CSV Reading: pd.read_csv() with filtering (query() method)
  • DataFrame Creation: From API responses and scraped data
  • Multi-table Joins: Complex merges with suffixes ('_vL', '_vR')
  • Index Operations: Heavy use of set_index() and index-based lookups
  • Apply Functions: Row-wise calculations using apply(axis=1)

2. Performance-Critical Areas

  • File I/O: Reading multiple large CSV files from FanGraphs
  • Join Operations: Player ID reconciliation across data sources
  • Statistical Calculations: Complex baseball metric computations
  • Memory Usage: Large DataFrame operations and copies

3. Complex Operations

  • Player Data Matching: Reconciling IDs across FanGraphs, Baseball Reference, and Paper Dynasty API
  • Split Data Merging: Combining left-handed vs right-handed statistics
  • Web Scraping Integration: Converting scraped HTML tables to DataFrames
  • Business Logic Apply Functions: Card rating calculations and D20 probability assignments

Migration Strategy: Phased Approach

Phase 1: Foundation & Low-Risk Operations (1-2 weeks)

Target Files:

  • creation_helpers.py - Player DataFrame creation functions
  • Basic CSV reading operations in core modules
  • Simple filtering and column operations

Operations to Migrate:

# Current pandas
pd.read_csv(file_path)
df.query('PA >= 20')
pd.DataFrame(api_response['data'])

# Polars equivalent
pl.read_csv(file_path)
df.filter(pl.col('PA') >= 20)
pl.DataFrame(api_response['data'])

Benefits:

  • Immediate performance gains on file I/O (up to 30-50% faster CSV reading)
  • Minimal risk due to straightforward operations
  • Foundation for more complex migrations

Implementation Tasks:

  1. Add polars to requirements.txt
  2. Create polars_helpers.py utility module
  3. Migrate basic CSV reading functions
  4. Update simple filtering operations
  5. Add comprehensive tests for migrated functions

Phase 2: Data Merging & Joins (2-3 weeks)

Target Files:

  • batters/creation.py - Multi-way merges (lines 70, 33, 93)
  • pitchers/creation.py - Similar join patterns
  • Player ID reconciliation functions

Operations to Migrate:

# Current pandas
pd.merge(vl_basic, vr_basic, on="playerId", suffixes=('_vL', '_vR'))
pd.merge(player_data, batting_stats, left_on='key_fangraphs', right_on='playerId')

# Polars equivalent  
vl_basic.join(vr_basic, on="playerId", suffix="_vR")
player_data.join(batting_stats, left_on='key_fangraphs', right_on='playerId')

Complexity Considerations:

  • Polars join syntax differences require careful testing
  • Suffix handling works differently in Polars
  • Index-based operations need conversion to column-based approaches

Implementation Tasks:

  1. Create join utility functions for common patterns
  2. Migrate split data merging (vs L/R handed stats)
  3. Update player ID reconciliation logic
  4. Comprehensive testing of join results
  5. Performance benchmarking vs pandas

Phase 3: Complex Transformations & Apply Functions (3-4 weeks)

Target Files:

  • batters/creation.py:203 - create_batting_card apply function
  • pitchers/creation.py - Similar card creation functions
  • Statistical calculation pipelines

Operations to Migrate:

# Current pandas
offense_stats.apply(create_batting_card, axis=1)

# Polars equivalent (requires rewriting business logic)
offense_stats.with_columns([
    pl.when(condition).then(value).otherwise(default).alias("new_col")
    for various card calculations
])

Challenges:

  • Business logic embedded in apply functions needs complete rewrite
  • D20 probability calculations require vectorization
  • Complex conditional logic must be converted to Polars expressions

Implementation Tasks:

  1. Analyze and document all apply function logic
  2. Rewrite business logic using Polars expressions
  3. Create vectorized D20 probability functions
  4. Extensive testing of card generation accuracy
  5. Performance optimization of new expressions

Phase 4: Web Scraping & Advanced Operations (2-3 weeks)

Target Files:

  • defenders/calcs_defense.py:581 - Scraped data processing
  • Complex aggregations and groupby operations
  • Performance-critical loops and calculations

Operations to Migrate:

# Current pandas
pd.DataFrame(scraped_data, index=indices, columns=headers)
df.groupby('position').agg({'stat': 'mean'})

# Polars equivalent
pl.DataFrame({"col": scraped_data}).with_row_index()
df.group_by('position').agg(pl.col('stat').mean())

Implementation Tasks:

  1. Migrate web scraping data processing
  2. Convert complex groupby operations
  3. Optimize performance-critical calculations
  4. Update defensive rating algorithms
  5. Final integration testing

Implementation Timeline

Total Estimated Duration: 8-12 weeks

Week 1-2: Phase 1 - Foundation

  • Environment setup and basic migrations
  • CSV reading and simple operations
  • Initial performance benchmarking

Week 3-5: Phase 2 - Data Merging

  • Join operations and player matching
  • Complex merge patterns
  • Data integrity validation

Week 6-9: Phase 3 - Complex Transformations

  • Apply function rewrites
  • Business logic migration
  • Card generation accuracy testing

Week 10-12: Phase 4 - Advanced Operations

  • Web scraping integration
  • Performance optimization
  • Final testing and documentation

Technical Requirements

Dependencies

# Add to requirements.txt
polars>=0.20.0

Utility Module Structure

polars_helpers.py
├── csv_operations.py      # CSV reading utilities
├── join_operations.py     # Merge and join helpers  
├── transformations.py     # Apply function replacements
└── performance.py         # Benchmarking utilities

Testing Strategy

  1. Unit Tests: Each migrated function
  2. Integration Tests: End-to-end card generation
  3. Performance Tests: Speed and memory comparisons
  4. Data Validation: Ensure identical outputs

Expected Benefits

Performance Improvements

  • 30-50% faster CSV reading and processing
  • 20-40% memory reduction for large datasets
  • Faster joins and aggregations
  • Better multi-threading for parallel operations

Code Quality Improvements

  • Better type safety with explicit schemas
  • More readable expression syntax
  • Lazy evaluation for optimized query plans
  • Modern DataFrame API with better ergonomics

Maintenance Benefits

  • Future-proof technology with active development
  • Better error messages and debugging
  • Consistent API across operations
  • Enhanced ecosystem integration

Risk Mitigation

Backwards Compatibility

  • Keep pandas dependency during transition
  • Gradual migration with feature flags
  • Comprehensive testing at each phase

Data Integrity

  • Validate outputs match exactly between pandas/polars
  • Automated testing of card generation accuracy
  • Performance regression monitoring

Team Adoption

  • Documentation of Polars patterns and idioms
  • Training on expression API usage
  • Code review guidelines for Polars best practices

Success Metrics

  1. Functionality: 100% identical card outputs
  2. Performance: >20% improvement in processing time
  3. Memory: >15% reduction in peak memory usage
  4. Maintainability: Cleaner, more readable code
  5. Team Adoption: Successful transition with minimal disruption

This migration will position the Paper Dynasty card-creation system with modern, high-performance data processing capabilities while maintaining the robust functionality that drives the baseball card generation pipeline.