8.5 KiB
8.5 KiB
Pandas to Polars Migration Plan
Project Overview
The Paper Dynasty card-creation project is a baseball card generation system that processes real baseball statistics from FanGraphs and Baseball Reference. The system uses pandas extensively for data manipulation across 19 Python files, handling complex workflows including CSV ingestion, multi-source data merging, API response processing, and statistical calculations for D20-based game mechanics.
Current Pandas Usage
- 19 files with pandas operations
- Large CSV processing from FanGraphs and Baseball Reference
- Complex multi-table joins (vs L/R handed pitching/hitting data)
- API response transformation to DataFrames
- Statistical calculations for card generation
- Web scraping data processing for defensive stats
Migration Analysis
Pandas Usage Patterns Identified
1. Core Data Operations
- CSV Reading:
pd.read_csv()with filtering (query()method) - DataFrame Creation: From API responses and scraped data
- Multi-table Joins: Complex merges with suffixes ('_vL', '_vR')
- Index Operations: Heavy use of
set_index()and index-based lookups - Apply Functions: Row-wise calculations using
apply(axis=1)
2. Performance-Critical Areas
- File I/O: Reading multiple large CSV files from FanGraphs
- Join Operations: Player ID reconciliation across data sources
- Statistical Calculations: Complex baseball metric computations
- Memory Usage: Large DataFrame operations and copies
3. Complex Operations
- Player Data Matching: Reconciling IDs across FanGraphs, Baseball Reference, and Paper Dynasty API
- Split Data Merging: Combining left-handed vs right-handed statistics
- Web Scraping Integration: Converting scraped HTML tables to DataFrames
- Business Logic Apply Functions: Card rating calculations and D20 probability assignments
Migration Strategy: Phased Approach
Phase 1: Foundation & Low-Risk Operations (1-2 weeks)
Target Files:
creation_helpers.py- Player DataFrame creation functions- Basic CSV reading operations in core modules
- Simple filtering and column operations
Operations to Migrate:
# Current pandas
pd.read_csv(file_path)
df.query('PA >= 20')
pd.DataFrame(api_response['data'])
# Polars equivalent
pl.read_csv(file_path)
df.filter(pl.col('PA') >= 20)
pl.DataFrame(api_response['data'])
Benefits:
- Immediate performance gains on file I/O (up to 30-50% faster CSV reading)
- Minimal risk due to straightforward operations
- Foundation for more complex migrations
Implementation Tasks:
- Add
polarsto requirements.txt - Create
polars_helpers.pyutility module - Migrate basic CSV reading functions
- Update simple filtering operations
- Add comprehensive tests for migrated functions
Phase 2: Data Merging & Joins (2-3 weeks)
Target Files:
batters/creation.py- Multi-way merges (lines 70, 33, 93)pitchers/creation.py- Similar join patterns- Player ID reconciliation functions
Operations to Migrate:
# Current pandas
pd.merge(vl_basic, vr_basic, on="playerId", suffixes=('_vL', '_vR'))
pd.merge(player_data, batting_stats, left_on='key_fangraphs', right_on='playerId')
# Polars equivalent
vl_basic.join(vr_basic, on="playerId", suffix="_vR")
player_data.join(batting_stats, left_on='key_fangraphs', right_on='playerId')
Complexity Considerations:
- Polars join syntax differences require careful testing
- Suffix handling works differently in Polars
- Index-based operations need conversion to column-based approaches
Implementation Tasks:
- Create join utility functions for common patterns
- Migrate split data merging (vs L/R handed stats)
- Update player ID reconciliation logic
- Comprehensive testing of join results
- Performance benchmarking vs pandas
Phase 3: Complex Transformations & Apply Functions (3-4 weeks)
Target Files:
batters/creation.py:203-create_batting_cardapply functionpitchers/creation.py- Similar card creation functions- Statistical calculation pipelines
Operations to Migrate:
# Current pandas
offense_stats.apply(create_batting_card, axis=1)
# Polars equivalent (requires rewriting business logic)
offense_stats.with_columns([
pl.when(condition).then(value).otherwise(default).alias("new_col")
for various card calculations
])
Challenges:
- Business logic embedded in apply functions needs complete rewrite
- D20 probability calculations require vectorization
- Complex conditional logic must be converted to Polars expressions
Implementation Tasks:
- Analyze and document all apply function logic
- Rewrite business logic using Polars expressions
- Create vectorized D20 probability functions
- Extensive testing of card generation accuracy
- Performance optimization of new expressions
Phase 4: Web Scraping & Advanced Operations (2-3 weeks)
Target Files:
defenders/calcs_defense.py:581- Scraped data processing- Complex aggregations and groupby operations
- Performance-critical loops and calculations
Operations to Migrate:
# Current pandas
pd.DataFrame(scraped_data, index=indices, columns=headers)
df.groupby('position').agg({'stat': 'mean'})
# Polars equivalent
pl.DataFrame({"col": scraped_data}).with_row_index()
df.group_by('position').agg(pl.col('stat').mean())
Implementation Tasks:
- Migrate web scraping data processing
- Convert complex groupby operations
- Optimize performance-critical calculations
- Update defensive rating algorithms
- Final integration testing
Implementation Timeline
Total Estimated Duration: 8-12 weeks
Week 1-2: Phase 1 - Foundation
- Environment setup and basic migrations
- CSV reading and simple operations
- Initial performance benchmarking
Week 3-5: Phase 2 - Data Merging
- Join operations and player matching
- Complex merge patterns
- Data integrity validation
Week 6-9: Phase 3 - Complex Transformations
- Apply function rewrites
- Business logic migration
- Card generation accuracy testing
Week 10-12: Phase 4 - Advanced Operations
- Web scraping integration
- Performance optimization
- Final testing and documentation
Technical Requirements
Dependencies
# Add to requirements.txt
polars>=0.20.0
Utility Module Structure
polars_helpers.py
├── csv_operations.py # CSV reading utilities
├── join_operations.py # Merge and join helpers
├── transformations.py # Apply function replacements
└── performance.py # Benchmarking utilities
Testing Strategy
- Unit Tests: Each migrated function
- Integration Tests: End-to-end card generation
- Performance Tests: Speed and memory comparisons
- Data Validation: Ensure identical outputs
Expected Benefits
Performance Improvements
- 30-50% faster CSV reading and processing
- 20-40% memory reduction for large datasets
- Faster joins and aggregations
- Better multi-threading for parallel operations
Code Quality Improvements
- Better type safety with explicit schemas
- More readable expression syntax
- Lazy evaluation for optimized query plans
- Modern DataFrame API with better ergonomics
Maintenance Benefits
- Future-proof technology with active development
- Better error messages and debugging
- Consistent API across operations
- Enhanced ecosystem integration
Risk Mitigation
Backwards Compatibility
- Keep pandas dependency during transition
- Gradual migration with feature flags
- Comprehensive testing at each phase
Data Integrity
- Validate outputs match exactly between pandas/polars
- Automated testing of card generation accuracy
- Performance regression monitoring
Team Adoption
- Documentation of Polars patterns and idioms
- Training on expression API usage
- Code review guidelines for Polars best practices
Success Metrics
- Functionality: 100% identical card outputs
- Performance: >20% improvement in processing time
- Memory: >15% reduction in peak memory usage
- Maintainability: Cleaner, more readable code
- Team Adoption: Successful transition with minimal disruption
This migration will position the Paper Dynasty card-creation system with modern, high-performance data processing capabilities while maintaining the robust functionality that drives the baseball card generation pipeline.