14 KiB
Greenfield Polars Rebuild Plan
Executive Summary
This plan outlines a complete greenfield rebuild of the Paper Dynasty card-creation system, prioritizing code quality, maintainability, and performance. The new system will be built from the ground up with modern Python practices, Polars-first architecture, and comprehensive testing.
Timeline: 16-20 weeks | Risk Level: High | Expected Performance Gain: 50-70%
Architecture Design
Core Principles
- Separation of Concerns - Clean boundaries between data, business logic, and presentation
- Type Safety - Full type hints and validation throughout
- Lazy Evaluation - Polars lazy API as the foundation
- Testability - Dependency injection and pure functions
- Observability - Structured logging and metrics from day one
- Configuration Management - Environment-based settings
System Architecture
card-creation-v2/
├── src/
│ ├── domain/ # Business logic (pure Python)
│ │ ├── models/ # Data classes and types
│ │ ├── calculations/ # Baseball statistics logic
│ │ └── rules/ # Card generation rules
│ ├── data/ # Data layer (Polars-optimized)
│ │ ├── sources/ # External data ingestion
│ │ ├── processors/ # Data transformation pipelines
│ │ └── repositories/ # Data access abstractions
│ ├── services/ # Application services
│ │ ├── card_generation/ # Core card creation service
│ │ ├── api_client/ # Paper Dynasty API integration
│ │ └── web_scraper/ # Defensive stats scraping
│ ├── infrastructure/ # External concerns
│ │ ├── config/ # Configuration management
│ │ ├── logging/ # Structured logging
│ │ └── monitoring/ # Metrics and observability
│ └── cli/ # Command-line interfaces
├── tests/ # Comprehensive test suite
├── docs/ # Technical documentation
└── scripts/ # Deployment and maintenance
Phase-by-Phase Implementation
Phase 1: Foundation & Architecture (4 weeks)
Week 1-2: Project Setup & Core Infrastructure
Deliverables:
- Modern Python project structure with Poetry/pip-tools
- Pre-commit hooks (black, ruff, mypy, pytest)
- CI/CD pipeline (GitHub Actions)
- Logging and configuration framework
- Base domain models
Core Infrastructure:
# src/domain/models/player.py
@dataclass(frozen=True)
class Player:
id: PlayerId
name: str
positions: list[Position]
hand: Hand
team: Team
# src/domain/models/statistics.py
@dataclass(frozen=True)
class BattingStats:
plate_appearances: int
hits: int
doubles: int
triples: int
home_runs: int
walks: int
strikeouts: int
@property
def batting_average(self) -> Decimal:
return Decimal(self.hits) / Decimal(self.plate_appearances)
Week 3-4: Data Layer Foundation
Deliverables:
- Polars-based data pipeline architecture
- CSV ingestion framework
- Schema validation system
- Basic data transformation utilities
Data Pipeline Framework:
# src/data/pipeline.py
class DataPipeline:
def __init__(self, lazy_frame: pl.LazyFrame):
self._lf = lazy_frame
def filter_minimum_playing_time(self, min_pa: int) -> 'DataPipeline':
return DataPipeline(
self._lf.filter(pl.col('PA') >= min_pa)
)
def add_calculated_stats(self) -> 'DataPipeline':
return DataPipeline(
self._lf.with_columns([
(pl.col('H') / pl.col('PA')).alias('AVG'),
(pl.col('BB') + pl.col('H')) / pl.col('PA').alias('OBP')
])
)
def collect(self) -> pl.DataFrame:
return self._lf.collect()
Phase 2: Data Ingestion & Processing (4 weeks)
Week 5-6: External Data Sources
Deliverables:
- FanGraphs CSV processor with validation
- Baseball Reference integration
- Player ID reconciliation system
- Web scraping framework for defensive stats
Modern Data Ingestion:
# src/data/sources/fangraphs.py
class FanGraphsProcessor:
def __init__(self, config: FanGraphsConfig):
self.config = config
async def load_hitting_stats(
self,
season: int,
vs_hand: Hand
) -> pl.LazyFrame:
file_path = self._build_file_path(season, 'hitting', vs_hand)
return (
pl.scan_csv(file_path)
.with_columns([
pl.col('playerId').cast(pl.Utf8),
pl.col('PA').cast(pl.Int32),
# ... other type casting
])
.filter(pl.col('PA') >= self.config.min_plate_appearances)
)
def _validate_schema(self, df: pl.LazyFrame) -> pl.LazyFrame:
required_columns = ['playerId', 'Name', 'Team', 'PA', 'H', 'BB']
# Schema validation logic
return df
Week 7-8: Data Transformation Pipeline
Deliverables:
- Split-hand data merging (vs L/R)
- Player matching across data sources
- Statistical calculation engine
- Data quality validation framework
Clean Data Processing:
# src/data/processors/batting_processor.py
class BattingStatsProcessor:
def __init__(
self,
fangraphs: FanGraphsProcessor,
bbref: BaseballReferenceProcessor
):
self.fangraphs = fangraphs
self.bbref = bbref
async def process_season_stats(
self,
season: int,
cardset_config: CardsetConfig
) -> BattingStatsResult:
# Load data lazily
vs_left = await self.fangraphs.load_hitting_stats(season, Hand.LEFT)
vs_right = await self.fangraphs.load_hitting_stats(season, Hand.RIGHT)
running = await self.bbref.load_baserunning_stats(season)
# Join and process
combined = self._join_split_hand_data(vs_left, vs_right)
with_running = self._add_baserunning_stats(combined, running)
validated = self._validate_data_quality(with_running)
return BattingStatsResult(
dataframe=validated.collect(),
quality_report=self._generate_quality_report(validated)
)
Phase 3: Business Logic & Calculations (4 weeks)
Week 9-10: Statistical Calculations
Deliverables:
- D20 probability calculation engine
- Baseball statistics formulas
- Rating calculation algorithms
- Performance benchmarking framework
Clean Business Logic:
# src/domain/calculations/ratings.py
class RatingCalculator:
def __init__(self, d20_table: D20ProbabilityTable):
self.d20_table = d20_table
def calculate_contact_rating(
self,
batting_avg: Decimal,
league_avg: Decimal
) -> D20Rating:
relative_performance = batting_avg / league_avg
probability = self._performance_to_probability(relative_performance)
return self.d20_table.probability_to_rating(probability)
def calculate_power_rating(
self,
isolated_power: Decimal,
league_avg: Decimal
) -> D20Rating:
# Clean, testable business logic
pass
Week 11-12: Card Generation Logic
Deliverables:
- Card creation services
- Rarity assignment algorithms
- Position-specific logic
- Card validation framework
Service Layer:
# src/services/card_generation/batting_card_service.py
class BattingCardService:
def __init__(
self,
rating_calculator: RatingCalculator,
rarity_calculator: RarityCalculator,
validator: CardValidator
):
self.rating_calculator = rating_calculator
self.rarity_calculator = rarity_calculator
self.validator = validator
async def generate_cards(
self,
batting_stats: pl.DataFrame,
cardset_config: CardsetConfig
) -> list[BattingCard]:
cards = []
for player_stats in batting_stats.iter_rows(named=True):
card = self._create_single_card(player_stats, cardset_config)
validated_card = await self.validator.validate(card)
cards.append(validated_card)
return cards
Phase 4: Integration & APIs (3 weeks)
Week 13-14: External Service Integration
Deliverables:
- Paper Dynasty API client
- Error handling and retry logic
- Rate limiting and throttling
- API response validation
Modern API Integration:
# src/services/api_client/paper_dynasty_client.py
class PaperDynastyApiClient:
def __init__(self, config: ApiConfig, session: aiohttp.ClientSession):
self.config = config
self.session = session
@retry(stop=stop_after_attempt(3), wait=wait_exponential())
async def create_batting_card(self, card: BattingCard) -> CardCreationResult:
payload = self._serialize_card(card)
async with self.session.post(
f"{self.config.base_url}/battingcards",
json=payload,
headers=self._get_auth_headers()
) as response:
response.raise_for_status()
data = await response.json()
return CardCreationResult.from_api_response(data)
Week 15: Command Line Interfaces
Deliverables:
- Modern CLI with Click/Typer
- Configuration validation
- Progress indicators and logging
- Error reporting
Phase 5: Testing & Validation (3 weeks)
Week 16-17: Comprehensive Testing
Testing Strategy:
- Unit Tests: 90%+ coverage on business logic
- Integration Tests: End-to-end card generation
- Performance Tests: Benchmarking vs current system
- Property-Based Testing: Statistical calculation validation
Test Architecture:
# tests/domain/test_rating_calculator.py
class TestRatingCalculator:
@pytest.fixture
def calculator(self):
return RatingCalculator(D20ProbabilityTable.standard())
@pytest.mark.parametrize("avg,expected", [
(Decimal("0.300"), D20Rating(12)),
(Decimal("0.250"), D20Rating(8)),
])
def test_contact_rating_calculation(self, calculator, avg, expected):
result = calculator.calculate_contact_rating(avg, Decimal("0.260"))
assert result == expected
@given(
batting_avg=st.decimals(min_value=0, max_value=1, places=3),
league_avg=st.decimals(min_value=0.200, max_value=0.300, places=3)
)
def test_rating_bounds(self, calculator, batting_avg, league_avg):
rating = calculator.calculate_contact_rating(batting_avg, league_avg)
assert 2 <= rating.value <= 20
Week 18: Migration & Deployment
Deliverables:
- Data migration scripts
- Deployment automation
- Monitoring and alerting setup
- Documentation and runbooks
Risk Mitigation Strategy
Parallel Development Approach
- Keep existing system running - No disruption to card generation
- Side-by-side validation - Compare outputs between systems
- Gradual rollout - Start with test cardsets, then production
- Rollback capability - Ability to revert if issues arise
Data Accuracy Validation
# scripts/validation/compare_outputs.py
class OutputValidator:
async def validate_batting_cards(
self,
legacy_cards: list[dict],
new_cards: list[BattingCard]
) -> ValidationReport:
differences = []
for legacy, new in zip(legacy_cards, new_cards):
card_diff = self._compare_cards(legacy, new)
if card_diff.has_significant_differences():
differences.append(card_diff)
return ValidationReport(
total_cards=len(legacy_cards),
differences=differences,
accuracy_percentage=self._calculate_accuracy(differences)
)
Performance Benchmarking
- Automated performance tests in CI
- Memory usage monitoring
- Processing time comparisons
- Regression detection
Quality Standards
Code Quality
- Type Coverage: 100% (enforced by mypy --strict)
- Test Coverage: 90%+ (unit tests), 80%+ (integration)
- Documentation: Comprehensive docstrings and API docs
- Code Review: Required for all changes
Performance Targets
- CSV Processing: 50% faster than current system
- Memory Usage: 30% reduction in peak memory
- API Throughput: 2x current card creation rate
- Error Rate: <0.1% for card generation
Operational Excellence
- Observability: Structured logging with metrics
- Error Handling: Graceful degradation and recovery
- Configuration: Environment-based, validated settings
- Deployment: Automated with rollback capability
Success Metrics
Technical Metrics
- Performance: >50% improvement in processing speed
- Reliability: 99.9% success rate for card generation
- Maintainability: Reduced cyclomatic complexity
- Test Coverage: >90% across all modules
Business Metrics
- Data Accuracy: 100% matching of statistical calculations
- Feature Parity: All existing functionality preserved
- Operational Efficiency: Reduced manual intervention
- Developer Productivity: Faster feature development
Timeline Summary
| Phase | Duration | Focus | Risk Level |
|---|---|---|---|
| Foundation | 4 weeks | Architecture & Infrastructure | Low |
| Data Layer | 4 weeks | Ingestion & Processing | Medium |
| Business Logic | 4 weeks | Calculations & Rules | High |
| Integration | 3 weeks | APIs & Services | Medium |
| Testing & Deployment | 3 weeks | Validation & Go-Live | High |
| Buffer | 2 weeks | Contingency | - |
Total: 18-20 weeks
This greenfield approach will deliver a modern, maintainable, and high-performance system that positions Paper Dynasty for future growth while eliminating technical debt from the legacy codebase.