# Greenfield Polars Rebuild Plan

## Executive Summary

This plan outlines a complete greenfield rebuild of the Paper Dynasty card-creation system, prioritizing code quality, maintainability, and performance. The new system will be built from the ground up with modern Python practices, Polars-first architecture, and comprehensive testing.

**Timeline: 16-20 weeks** | **Risk Level: High** | **Expected Performance Gain: 50-70%**

---

## Architecture Design

### **Core Principles**
1. **Separation of Concerns** - Clean boundaries between data, business logic, and presentation
2. **Type Safety** - Full type hints and validation throughout
3. **Lazy Evaluation** - Polars lazy API as the foundation
4. **Testability** - Dependency injection and pure functions
5. **Observability** - Structured logging and metrics from day one
6. **Configuration Management** - Environment-based settings

### **System Architecture**

```
card-creation-v2/
├── src/
│   ├── domain/                 # Business logic (pure Python)
│   │   ├── models/            # Data classes and types
│   │   ├── calculations/      # Baseball statistics logic
│   │   └── rules/             # Card generation rules
│   ├── data/                  # Data layer (Polars-optimized)
│   │   ├── sources/           # External data ingestion
│   │   ├── processors/        # Data transformation pipelines
│   │   └── repositories/      # Data access abstractions
│   ├── services/              # Application services
│   │   ├── card_generation/   # Core card creation service
│   │   ├── api_client/        # Paper Dynasty API integration
│   │   └── web_scraper/       # Defensive stats scraping
│   ├── infrastructure/        # External concerns
│   │   ├── config/            # Configuration management
│   │   ├── logging/           # Structured logging
│   │   └── monitoring/        # Metrics and observability
│   └── cli/                   # Command-line interfaces
├── tests/                     # Comprehensive test suite
├── docs/                      # Technical documentation
└── scripts/                   # Deployment and maintenance
```

---

## Phase-by-Phase Implementation

### **Phase 1: Foundation & Architecture (4 weeks)**

#### **Week 1-2: Project Setup & Core Infrastructure**

**Deliverables:**
- Modern Python project structure with Poetry/pip-tools
- Pre-commit hooks (black, ruff, mypy, pytest)
- CI/CD pipeline (GitHub Actions)
- Logging and configuration framework
- Base domain models

**Core Infrastructure:**
```python
# src/domain/models/player.py
@dataclass(frozen=True)
class Player:
    id: PlayerId
    name: str
    positions: list[Position]
    hand: Hand
    team: Team
    
# src/domain/models/statistics.py  
@dataclass(frozen=True)
class BattingStats:
    plate_appearances: int
    hits: int
    doubles: int
    triples: int
    home_runs: int
    walks: int
    strikeouts: int
    
    @property
    def batting_average(self) -> Decimal:
        return Decimal(self.hits) / Decimal(self.plate_appearances)
```

#### **Week 3-4: Data Layer Foundation**

**Deliverables:**
- Polars-based data pipeline architecture
- CSV ingestion framework
- Schema validation system
- Basic data transformation utilities

**Data Pipeline Framework:**
```python
# src/data/pipeline.py
class DataPipeline:
    def __init__(self, lazy_frame: pl.LazyFrame):
        self._lf = lazy_frame
    
    def filter_minimum_playing_time(self, min_pa: int) -> 'DataPipeline':
        return DataPipeline(
            self._lf.filter(pl.col('PA') >= min_pa)
        )
    
    def add_calculated_stats(self) -> 'DataPipeline':
        return DataPipeline(
            self._lf.with_columns([
                (pl.col('H') / pl.col('PA')).alias('AVG'),
                (pl.col('BB') + pl.col('H')) / pl.col('PA').alias('OBP')
            ])
        )
    
    def collect(self) -> pl.DataFrame:
        return self._lf.collect()
```

---

### **Phase 2: Data Ingestion & Processing (4 weeks)**

#### **Week 5-6: External Data Sources**

**Deliverables:**
- FanGraphs CSV processor with validation
- Baseball Reference integration
- Player ID reconciliation system
- Web scraping framework for defensive stats

**Modern Data Ingestion:**
```python
# src/data/sources/fangraphs.py
class FanGraphsProcessor:
    def __init__(self, config: FanGraphsConfig):
        self.config = config
    
    async def load_hitting_stats(
        self, 
        season: int, 
        vs_hand: Hand
    ) -> pl.LazyFrame:
        file_path = self._build_file_path(season, 'hitting', vs_hand)
        
        return (
            pl.scan_csv(file_path)
            .with_columns([
                pl.col('playerId').cast(pl.Utf8),
                pl.col('PA').cast(pl.Int32),
                # ... other type casting
            ])
            .filter(pl.col('PA') >= self.config.min_plate_appearances)
        )
    
    def _validate_schema(self, df: pl.LazyFrame) -> pl.LazyFrame:
        required_columns = ['playerId', 'Name', 'Team', 'PA', 'H', 'BB']
        # Schema validation logic
        return df
```

#### **Week 7-8: Data Transformation Pipeline**

**Deliverables:**
- Split-hand data merging (vs L/R)
- Player matching across data sources
- Statistical calculation engine
- Data quality validation framework

**Clean Data Processing:**
```python
# src/data/processors/batting_processor.py
class BattingStatsProcessor:
    def __init__(
        self, 
        fangraphs: FanGraphsProcessor,
        bbref: BaseballReferenceProcessor
    ):
        self.fangraphs = fangraphs
        self.bbref = bbref
    
    async def process_season_stats(
        self, 
        season: int, 
        cardset_config: CardsetConfig
    ) -> BattingStatsResult:
        
        # Load data lazily
        vs_left = await self.fangraphs.load_hitting_stats(season, Hand.LEFT)
        vs_right = await self.fangraphs.load_hitting_stats(season, Hand.RIGHT)
        running = await self.bbref.load_baserunning_stats(season)
        
        # Join and process
        combined = self._join_split_hand_data(vs_left, vs_right)
        with_running = self._add_baserunning_stats(combined, running)
        validated = self._validate_data_quality(with_running)
        
        return BattingStatsResult(
            dataframe=validated.collect(),
            quality_report=self._generate_quality_report(validated)
        )
```

---

### **Phase 3: Business Logic & Calculations (4 weeks)**

#### **Week 9-10: Statistical Calculations**

**Deliverables:**
- D20 probability calculation engine
- Baseball statistics formulas
- Rating calculation algorithms
- Performance benchmarking framework

**Clean Business Logic:**
```python
# src/domain/calculations/ratings.py
class RatingCalculator:
    def __init__(self, d20_table: D20ProbabilityTable):
        self.d20_table = d20_table
    
    def calculate_contact_rating(
        self, 
        batting_avg: Decimal, 
        league_avg: Decimal
    ) -> D20Rating:
        relative_performance = batting_avg / league_avg
        probability = self._performance_to_probability(relative_performance)
        return self.d20_table.probability_to_rating(probability)
    
    def calculate_power_rating(
        self, 
        isolated_power: Decimal, 
        league_avg: Decimal
    ) -> D20Rating:
        # Clean, testable business logic
        pass
```

#### **Week 11-12: Card Generation Logic**

**Deliverables:**
- Card creation services
- Rarity assignment algorithms
- Position-specific logic
- Card validation framework

**Service Layer:**
```python
# src/services/card_generation/batting_card_service.py
class BattingCardService:
    def __init__(
        self,
        rating_calculator: RatingCalculator,
        rarity_calculator: RarityCalculator,
        validator: CardValidator
    ):
        self.rating_calculator = rating_calculator
        self.rarity_calculator = rarity_calculator
        self.validator = validator
    
    async def generate_cards(
        self, 
        batting_stats: pl.DataFrame,
        cardset_config: CardsetConfig
    ) -> list[BattingCard]:
        
        cards = []
        for player_stats in batting_stats.iter_rows(named=True):
            card = self._create_single_card(player_stats, cardset_config)
            validated_card = await self.validator.validate(card)
            cards.append(validated_card)
        
        return cards
```

---

### **Phase 4: Integration & APIs (3 weeks)**

#### **Week 13-14: External Service Integration**

**Deliverables:**
- Paper Dynasty API client
- Error handling and retry logic
- Rate limiting and throttling
- API response validation

**Modern API Integration:**
```python
# src/services/api_client/paper_dynasty_client.py
class PaperDynastyApiClient:
    def __init__(self, config: ApiConfig, session: aiohttp.ClientSession):
        self.config = config
        self.session = session
    
    @retry(stop=stop_after_attempt(3), wait=wait_exponential())
    async def create_batting_card(self, card: BattingCard) -> CardCreationResult:
        payload = self._serialize_card(card)
        
        async with self.session.post(
            f"{self.config.base_url}/battingcards",
            json=payload,
            headers=self._get_auth_headers()
        ) as response:
            response.raise_for_status()
            data = await response.json()
            return CardCreationResult.from_api_response(data)
```

#### **Week 15: Command Line Interfaces**

**Deliverables:**
- Modern CLI with Click/Typer
- Configuration validation
- Progress indicators and logging
- Error reporting

---

### **Phase 5: Testing & Validation (3 weeks)**

#### **Week 16-17: Comprehensive Testing**

**Testing Strategy:**
- **Unit Tests**: 90%+ coverage on business logic
- **Integration Tests**: End-to-end card generation
- **Performance Tests**: Benchmarking vs current system
- **Property-Based Testing**: Statistical calculation validation

**Test Architecture:**
```python
# tests/domain/test_rating_calculator.py
class TestRatingCalculator:
    @pytest.fixture
    def calculator(self):
        return RatingCalculator(D20ProbabilityTable.standard())
    
    @pytest.mark.parametrize("avg,expected", [
        (Decimal("0.300"), D20Rating(12)),
        (Decimal("0.250"), D20Rating(8)),
    ])
    def test_contact_rating_calculation(self, calculator, avg, expected):
        result = calculator.calculate_contact_rating(avg, Decimal("0.260"))
        assert result == expected
    
    @given(
        batting_avg=st.decimals(min_value=0, max_value=1, places=3),
        league_avg=st.decimals(min_value=0.200, max_value=0.300, places=3)
    )
    def test_rating_bounds(self, calculator, batting_avg, league_avg):
        rating = calculator.calculate_contact_rating(batting_avg, league_avg)
        assert 2 <= rating.value <= 20
```

#### **Week 18: Migration & Deployment**

**Deliverables:**
- Data migration scripts
- Deployment automation
- Monitoring and alerting setup
- Documentation and runbooks

---

## Risk Mitigation Strategy

### **Parallel Development Approach**
1. **Keep existing system running** - No disruption to card generation
2. **Side-by-side validation** - Compare outputs between systems
3. **Gradual rollout** - Start with test cardsets, then production
4. **Rollback capability** - Ability to revert if issues arise

### **Data Accuracy Validation**
```python
# scripts/validation/compare_outputs.py
class OutputValidator:
    async def validate_batting_cards(
        self, 
        legacy_cards: list[dict], 
        new_cards: list[BattingCard]
    ) -> ValidationReport:
        differences = []
        
        for legacy, new in zip(legacy_cards, new_cards):
            card_diff = self._compare_cards(legacy, new)
            if card_diff.has_significant_differences():
                differences.append(card_diff)
        
        return ValidationReport(
            total_cards=len(legacy_cards),
            differences=differences,
            accuracy_percentage=self._calculate_accuracy(differences)
        )
```

### **Performance Benchmarking**
- Automated performance tests in CI
- Memory usage monitoring  
- Processing time comparisons
- Regression detection

---

## Quality Standards

### **Code Quality**
- **Type Coverage**: 100% (enforced by mypy --strict)
- **Test Coverage**: 90%+ (unit tests), 80%+ (integration)
- **Documentation**: Comprehensive docstrings and API docs
- **Code Review**: Required for all changes

### **Performance Targets**
- **CSV Processing**: 50% faster than current system
- **Memory Usage**: 30% reduction in peak memory
- **API Throughput**: 2x current card creation rate
- **Error Rate**: <0.1% for card generation

### **Operational Excellence**
- **Observability**: Structured logging with metrics
- **Error Handling**: Graceful degradation and recovery
- **Configuration**: Environment-based, validated settings
- **Deployment**: Automated with rollback capability

---

## Success Metrics

### **Technical Metrics**
- **Performance**: >50% improvement in processing speed
- **Reliability**: 99.9% success rate for card generation
- **Maintainability**: Reduced cyclomatic complexity
- **Test Coverage**: >90% across all modules

### **Business Metrics**
- **Data Accuracy**: 100% matching of statistical calculations
- **Feature Parity**: All existing functionality preserved
- **Operational Efficiency**: Reduced manual intervention
- **Developer Productivity**: Faster feature development

---

## Timeline Summary

| Phase | Duration | Focus | Risk Level |
|-------|----------|-------|------------|
| **Foundation** | 4 weeks | Architecture & Infrastructure | Low |
| **Data Layer** | 4 weeks | Ingestion & Processing | Medium |
| **Business Logic** | 4 weeks | Calculations & Rules | High |
| **Integration** | 3 weeks | APIs & Services | Medium |
| **Testing & Deployment** | 3 weeks | Validation & Go-Live | High |
| **Buffer** | 2 weeks | Contingency | - |

**Total: 18-20 weeks**

This greenfield approach will deliver a modern, maintainable, and high-performance system that positions Paper Dynasty for future growth while eliminating technical debt from the legacy codebase.