# Greenfield Polars Rebuild Plan ## Executive Summary This plan outlines a complete greenfield rebuild of the Paper Dynasty card-creation system, prioritizing code quality, maintainability, and performance. The new system will be built from the ground up with modern Python practices, Polars-first architecture, and comprehensive testing. **Timeline: 16-20 weeks** | **Risk Level: High** | **Expected Performance Gain: 50-70%** --- ## Architecture Design ### **Core Principles** 1. **Separation of Concerns** - Clean boundaries between data, business logic, and presentation 2. **Type Safety** - Full type hints and validation throughout 3. **Lazy Evaluation** - Polars lazy API as the foundation 4. **Testability** - Dependency injection and pure functions 5. **Observability** - Structured logging and metrics from day one 6. **Configuration Management** - Environment-based settings ### **System Architecture** ``` card-creation-v2/ ├── src/ │ ├── domain/ # Business logic (pure Python) │ │ ├── models/ # Data classes and types │ │ ├── calculations/ # Baseball statistics logic │ │ └── rules/ # Card generation rules │ ├── data/ # Data layer (Polars-optimized) │ │ ├── sources/ # External data ingestion │ │ ├── processors/ # Data transformation pipelines │ │ └── repositories/ # Data access abstractions │ ├── services/ # Application services │ │ ├── card_generation/ # Core card creation service │ │ ├── api_client/ # Paper Dynasty API integration │ │ └── web_scraper/ # Defensive stats scraping │ ├── infrastructure/ # External concerns │ │ ├── config/ # Configuration management │ │ ├── logging/ # Structured logging │ │ └── monitoring/ # Metrics and observability │ └── cli/ # Command-line interfaces ├── tests/ # Comprehensive test suite ├── docs/ # Technical documentation └── scripts/ # Deployment and maintenance ``` --- ## Phase-by-Phase Implementation ### **Phase 1: Foundation & Architecture (4 weeks)** #### **Week 1-2: Project Setup & Core Infrastructure** **Deliverables:** - Modern Python project structure with Poetry/pip-tools - Pre-commit hooks (black, ruff, mypy, pytest) - CI/CD pipeline (GitHub Actions) - Logging and configuration framework - Base domain models **Core Infrastructure:** ```python # src/domain/models/player.py @dataclass(frozen=True) class Player: id: PlayerId name: str positions: list[Position] hand: Hand team: Team # src/domain/models/statistics.py @dataclass(frozen=True) class BattingStats: plate_appearances: int hits: int doubles: int triples: int home_runs: int walks: int strikeouts: int @property def batting_average(self) -> Decimal: return Decimal(self.hits) / Decimal(self.plate_appearances) ``` #### **Week 3-4: Data Layer Foundation** **Deliverables:** - Polars-based data pipeline architecture - CSV ingestion framework - Schema validation system - Basic data transformation utilities **Data Pipeline Framework:** ```python # src/data/pipeline.py class DataPipeline: def __init__(self, lazy_frame: pl.LazyFrame): self._lf = lazy_frame def filter_minimum_playing_time(self, min_pa: int) -> 'DataPipeline': return DataPipeline( self._lf.filter(pl.col('PA') >= min_pa) ) def add_calculated_stats(self) -> 'DataPipeline': return DataPipeline( self._lf.with_columns([ (pl.col('H') / pl.col('PA')).alias('AVG'), (pl.col('BB') + pl.col('H')) / pl.col('PA').alias('OBP') ]) ) def collect(self) -> pl.DataFrame: return self._lf.collect() ``` --- ### **Phase 2: Data Ingestion & Processing (4 weeks)** #### **Week 5-6: External Data Sources** **Deliverables:** - FanGraphs CSV processor with validation - Baseball Reference integration - Player ID reconciliation system - Web scraping framework for defensive stats **Modern Data Ingestion:** ```python # src/data/sources/fangraphs.py class FanGraphsProcessor: def __init__(self, config: FanGraphsConfig): self.config = config async def load_hitting_stats( self, season: int, vs_hand: Hand ) -> pl.LazyFrame: file_path = self._build_file_path(season, 'hitting', vs_hand) return ( pl.scan_csv(file_path) .with_columns([ pl.col('playerId').cast(pl.Utf8), pl.col('PA').cast(pl.Int32), # ... other type casting ]) .filter(pl.col('PA') >= self.config.min_plate_appearances) ) def _validate_schema(self, df: pl.LazyFrame) -> pl.LazyFrame: required_columns = ['playerId', 'Name', 'Team', 'PA', 'H', 'BB'] # Schema validation logic return df ``` #### **Week 7-8: Data Transformation Pipeline** **Deliverables:** - Split-hand data merging (vs L/R) - Player matching across data sources - Statistical calculation engine - Data quality validation framework **Clean Data Processing:** ```python # src/data/processors/batting_processor.py class BattingStatsProcessor: def __init__( self, fangraphs: FanGraphsProcessor, bbref: BaseballReferenceProcessor ): self.fangraphs = fangraphs self.bbref = bbref async def process_season_stats( self, season: int, cardset_config: CardsetConfig ) -> BattingStatsResult: # Load data lazily vs_left = await self.fangraphs.load_hitting_stats(season, Hand.LEFT) vs_right = await self.fangraphs.load_hitting_stats(season, Hand.RIGHT) running = await self.bbref.load_baserunning_stats(season) # Join and process combined = self._join_split_hand_data(vs_left, vs_right) with_running = self._add_baserunning_stats(combined, running) validated = self._validate_data_quality(with_running) return BattingStatsResult( dataframe=validated.collect(), quality_report=self._generate_quality_report(validated) ) ``` --- ### **Phase 3: Business Logic & Calculations (4 weeks)** #### **Week 9-10: Statistical Calculations** **Deliverables:** - D20 probability calculation engine - Baseball statistics formulas - Rating calculation algorithms - Performance benchmarking framework **Clean Business Logic:** ```python # src/domain/calculations/ratings.py class RatingCalculator: def __init__(self, d20_table: D20ProbabilityTable): self.d20_table = d20_table def calculate_contact_rating( self, batting_avg: Decimal, league_avg: Decimal ) -> D20Rating: relative_performance = batting_avg / league_avg probability = self._performance_to_probability(relative_performance) return self.d20_table.probability_to_rating(probability) def calculate_power_rating( self, isolated_power: Decimal, league_avg: Decimal ) -> D20Rating: # Clean, testable business logic pass ``` #### **Week 11-12: Card Generation Logic** **Deliverables:** - Card creation services - Rarity assignment algorithms - Position-specific logic - Card validation framework **Service Layer:** ```python # src/services/card_generation/batting_card_service.py class BattingCardService: def __init__( self, rating_calculator: RatingCalculator, rarity_calculator: RarityCalculator, validator: CardValidator ): self.rating_calculator = rating_calculator self.rarity_calculator = rarity_calculator self.validator = validator async def generate_cards( self, batting_stats: pl.DataFrame, cardset_config: CardsetConfig ) -> list[BattingCard]: cards = [] for player_stats in batting_stats.iter_rows(named=True): card = self._create_single_card(player_stats, cardset_config) validated_card = await self.validator.validate(card) cards.append(validated_card) return cards ``` --- ### **Phase 4: Integration & APIs (3 weeks)** #### **Week 13-14: External Service Integration** **Deliverables:** - Paper Dynasty API client - Error handling and retry logic - Rate limiting and throttling - API response validation **Modern API Integration:** ```python # src/services/api_client/paper_dynasty_client.py class PaperDynastyApiClient: def __init__(self, config: ApiConfig, session: aiohttp.ClientSession): self.config = config self.session = session @retry(stop=stop_after_attempt(3), wait=wait_exponential()) async def create_batting_card(self, card: BattingCard) -> CardCreationResult: payload = self._serialize_card(card) async with self.session.post( f"{self.config.base_url}/battingcards", json=payload, headers=self._get_auth_headers() ) as response: response.raise_for_status() data = await response.json() return CardCreationResult.from_api_response(data) ``` #### **Week 15: Command Line Interfaces** **Deliverables:** - Modern CLI with Click/Typer - Configuration validation - Progress indicators and logging - Error reporting --- ### **Phase 5: Testing & Validation (3 weeks)** #### **Week 16-17: Comprehensive Testing** **Testing Strategy:** - **Unit Tests**: 90%+ coverage on business logic - **Integration Tests**: End-to-end card generation - **Performance Tests**: Benchmarking vs current system - **Property-Based Testing**: Statistical calculation validation **Test Architecture:** ```python # tests/domain/test_rating_calculator.py class TestRatingCalculator: @pytest.fixture def calculator(self): return RatingCalculator(D20ProbabilityTable.standard()) @pytest.mark.parametrize("avg,expected", [ (Decimal("0.300"), D20Rating(12)), (Decimal("0.250"), D20Rating(8)), ]) def test_contact_rating_calculation(self, calculator, avg, expected): result = calculator.calculate_contact_rating(avg, Decimal("0.260")) assert result == expected @given( batting_avg=st.decimals(min_value=0, max_value=1, places=3), league_avg=st.decimals(min_value=0.200, max_value=0.300, places=3) ) def test_rating_bounds(self, calculator, batting_avg, league_avg): rating = calculator.calculate_contact_rating(batting_avg, league_avg) assert 2 <= rating.value <= 20 ``` #### **Week 18: Migration & Deployment** **Deliverables:** - Data migration scripts - Deployment automation - Monitoring and alerting setup - Documentation and runbooks --- ## Risk Mitigation Strategy ### **Parallel Development Approach** 1. **Keep existing system running** - No disruption to card generation 2. **Side-by-side validation** - Compare outputs between systems 3. **Gradual rollout** - Start with test cardsets, then production 4. **Rollback capability** - Ability to revert if issues arise ### **Data Accuracy Validation** ```python # scripts/validation/compare_outputs.py class OutputValidator: async def validate_batting_cards( self, legacy_cards: list[dict], new_cards: list[BattingCard] ) -> ValidationReport: differences = [] for legacy, new in zip(legacy_cards, new_cards): card_diff = self._compare_cards(legacy, new) if card_diff.has_significant_differences(): differences.append(card_diff) return ValidationReport( total_cards=len(legacy_cards), differences=differences, accuracy_percentage=self._calculate_accuracy(differences) ) ``` ### **Performance Benchmarking** - Automated performance tests in CI - Memory usage monitoring - Processing time comparisons - Regression detection --- ## Quality Standards ### **Code Quality** - **Type Coverage**: 100% (enforced by mypy --strict) - **Test Coverage**: 90%+ (unit tests), 80%+ (integration) - **Documentation**: Comprehensive docstrings and API docs - **Code Review**: Required for all changes ### **Performance Targets** - **CSV Processing**: 50% faster than current system - **Memory Usage**: 30% reduction in peak memory - **API Throughput**: 2x current card creation rate - **Error Rate**: <0.1% for card generation ### **Operational Excellence** - **Observability**: Structured logging with metrics - **Error Handling**: Graceful degradation and recovery - **Configuration**: Environment-based, validated settings - **Deployment**: Automated with rollback capability --- ## Success Metrics ### **Technical Metrics** - **Performance**: >50% improvement in processing speed - **Reliability**: 99.9% success rate for card generation - **Maintainability**: Reduced cyclomatic complexity - **Test Coverage**: >90% across all modules ### **Business Metrics** - **Data Accuracy**: 100% matching of statistical calculations - **Feature Parity**: All existing functionality preserved - **Operational Efficiency**: Reduced manual intervention - **Developer Productivity**: Faster feature development --- ## Timeline Summary | Phase | Duration | Focus | Risk Level | |-------|----------|-------|------------| | **Foundation** | 4 weeks | Architecture & Infrastructure | Low | | **Data Layer** | 4 weeks | Ingestion & Processing | Medium | | **Business Logic** | 4 weeks | Calculations & Rules | High | | **Integration** | 3 weeks | APIs & Services | Medium | | **Testing & Deployment** | 3 weeks | Validation & Go-Live | High | | **Buffer** | 2 weeks | Contingency | - | **Total: 18-20 weeks** This greenfield approach will deliver a modern, maintainable, and high-performance system that positions Paper Dynasty for future growth while eliminating technical debt from the legacy codebase.