major-domo-database/POSTGRESQL_MIGRATION_DATA_INTEGRITY_ISSUE.md
2025-08-20 09:52:46 -05:00

244 lines
8.7 KiB
Markdown

# PostgreSQL Migration Data Integrity Issue - Critical Bug Report
## Issue Summary
**Critical data corruption discovered in PostgreSQL database migration**: Player IDs were not preserved during SQLite-to-PostgreSQL migration, causing systematic misalignment between player identities and their associated game statistics.
**Date Discovered**: August 19, 2025
**Severity**: Critical - All player-based statistics queries return incorrect results
**Status**: Identified, Root Cause Confirmed, Awaiting Fix
## Symptoms Observed
### Initial Problem
- API endpoint `http://localhost:801/api/v3/plays/batting?season=10&group_by=playerteam&limit=10&obc=111&sort=repri-desc` returned suspicious results
- Player ID 9916 appeared as "Trevor Williams (SP)" with high batting performance in bases-loaded situations
- This was anomalous because starting pitchers shouldn't be top batting performers
### Comparison with Source Data
**Correct SQLite API Response** (`https://sba.manticorum.com/api/v3/plays/batting?season=10&group_by=playerteam&limit=10&obc=111&sort=repri-desc`):
- Player ID 9916: **Marcell Ozuna (LF)** - 8.096 RE24
- Top performer: **Michael Harris (CF, ID 9958)** - 8.662 RE24
**Incorrect PostgreSQL API Response** (same endpoint on localhost:801):
- Player ID 9916: **Trevor Williams (SP)** - 8.096 RE24
- Missing correct top performer Michael Harris entirely
## Root Cause Analysis
### Database Investigation Results
#### Player ID Mapping Corruption
**SQLite Database (Correct)**:
```
ID 9916: Marcell Ozuna (LF)
ID 9958: Michael Harris (CF)
```
**PostgreSQL Database (Incorrect)**:
```
ID 9916: Trevor Williams (SP)
ID 9958: Xavier Edwards (2B)
```
#### Primary Key Assignment Issue
**SQLite Database Structure**:
- Player IDs: Range from ~1 to 12000+ with gaps (due to historical deletions)
- Example high IDs: 9346, 9347, 9348, 9349, 9350
- Preserves original IDs with gaps intact
**PostgreSQL Database Structure**:
- Player IDs: Sequential 1 to 12232 with NO gaps
- Total players: 12,232
- Range: 1-12232 (perfectly sequential)
#### Migration Logic Flaw
The migration process failed to preserve original SQLite primary keys:
1. **SQLite**: Marcell Ozuna had ID 9916 (with gaps in sequence)
2. **Migration**: PostgreSQL auto-assigned new sequential IDs starting from 1
3. **Result**: Marcell Ozuna received new ID 9658, while Trevor Williams was assigned ID 9916
4. **Impact**: All `stratplay` records still reference original IDs, but those IDs now point to different players
### Evidence of Systematic Corruption
#### Multiple Season Data
PostgreSQL contains duplicate players across seasons:
```sql
SELECT id, name, season FROM player WHERE name = 'Marcell Ozuna';
```
Results:
```
621 | Marcell Ozuna | Season 1
1627 | Marcell Ozuna | Season 2
2529 | Marcell Ozuna | Season 3
...
9658 | Marcell Ozuna | Season 10 <- Should be ID 9916
```
#### Verification Query
```sql
-- PostgreSQL shows wrong player for ID 9916
SELECT id, name, pos_1 FROM player WHERE id = 9916;
-- Result: 9916 | Trevor Williams | SP
-- SQLite API shows correct player for ID 9916
curl "https://sba.manticorum.com/api/v3/players/9916"
-- Result: {"id": 9916, "name": "Marcell Ozuna", "pos_1": "LF"}
```
## Technical Impact
### Affected Systems
- **All player-based statistics queries** return incorrect results
- **Batting statistics API** (`/api/v3/plays/batting`)
- **Pitching statistics API** (`/api/v3/plays/pitching`)
- **Fielding statistics API** (`/api/v3/plays/fielding`)
- **Player lookup endpoints** (`/api/v3/players/{id}`)
- **Any endpoint that joins `stratplay` with `player` tables**
### Data Integrity Scope
- **stratplay table**: Contains ~48,000 records with original SQLite player IDs
- **player table**: Contains remapped IDs that don't match stratplay references
- **Foreign key relationships**: Completely broken between stratplay.batter_id and player.id
### Related Issues Fixed During Investigation
1. **PostgreSQL GROUP BY Error**: Fixed SQL query that was selecting `game_id` without including it in GROUP BY clause
2. **ORDER BY Conflicts**: Removed `StratPlay.id` ordering from grouped queries to prevent PostgreSQL GROUP BY violations
## Reproduction Steps
1. **Query PostgreSQL database**:
```bash
curl "http://localhost:801/api/v3/plays/batting?season=10&group_by=playerteam&limit=10&obc=111&sort=repri-desc"
```
2. **Query SQLite database** (correct source):
```bash
curl "https://sba.manticorum.com/api/v3/plays/batting?season=10&group_by=playerteam&limit=10&obc=111&sort=repri-desc"
```
3. **Compare results**: Player names and statistics will be misaligned
4. **Verify specific player**:
```bash
# PostgreSQL (wrong)
curl "http://localhost:801/api/v3/players/9916"
# SQLite (correct)
curl "https://sba.manticorum.com/api/v3/players/9916"
```
## Migration Script Issue
### Current Problematic Behavior
The migration script appears to:
1. Extract player data from SQLite
2. Insert into PostgreSQL without preserving original IDs
3. Allow PostgreSQL to auto-assign sequential primary keys
4. Migrate stratplay data with original foreign key references
### Required Fix
The migration script must:
1. **Preserve original SQLite primary keys** during player table migration
2. **Explicitly set ID values** during INSERT operations
3. **Adjust PostgreSQL sequence** to start after the highest migrated ID
4. **Validate foreign key integrity** post-migration
### Example Corrected Migration Logic
```python
# Instead of:
cursor.execute("INSERT INTO player (name, pos_1, season) VALUES (%s, %s, %s)",
(player.name, player.pos_1, player.season))
# Should be:
cursor.execute("INSERT INTO player (id, name, pos_1, season) VALUES (%s, %s, %s, %s)",
(player.id, player.name, player.pos_1, player.season))
# Then reset sequence:
cursor.execute("SELECT setval('player_id_seq', (SELECT MAX(id) FROM player));")
```
## Database Environment Details
### PostgreSQL Setup
- **Container**: sba_postgres
- **Database**: sba_master
- **User**: sba_admin
- **Port**: 5432
- **Version**: PostgreSQL 16-alpine
### SQLite Source
- **API Endpoint**: https://sba.manticorum.com/api/v3/
- **Database Files**: `sba_master.db`, `pd_master.db`
- **Status**: Confirmed working and accurate
## Immediate Recommendations
### Priority 1: Stop Using PostgreSQL Database
- **All production queries should use SQLite API** until this is fixed
- **PostgreSQL database results are completely unreliable** for player statistics
### Priority 2: Fix Migration Script
- **Identify migration script location** (likely `migrate_to_postgres.py`)
- **Modify to preserve primary keys** from SQLite source
- **Add validation checks** for foreign key integrity
### Priority 3: Re-run Complete Migration
- **Drop and recreate PostgreSQL database**
- **Run corrected migration script**
- **Validate sample queries** against SQLite source before declaring fixed
### Priority 4: Add Data Validation Tests
- **Create automated tests** comparing PostgreSQL vs SQLite query results
- **Add foreign key constraint validation**
- **Implement post-migration data integrity checks**
## Files Involved in Investigation
### Modified During Debugging
- `/mnt/NV2/Development/major-domo/database/app/routers_v3/stratplay.py`
- Fixed GROUP BY and ORDER BY PostgreSQL compatibility issues
- Lines 317, 529, 1062: Removed/modified problematic query components
### Configuration Files
- `/mnt/NV2/Development/major-domo/database/docker-compose.yml`
- PostgreSQL connection details and credentials
### Migration Scripts (Suspected)
- `/mnt/NV2/Development/major-domo/database/migrate_to_postgres.py` (needs investigation)
- `/mnt/NV2/Development/major-domo/database/migrations.py`
## Test Queries for Validation
### Verify Player ID Mapping
```sql
-- Check specific problematic players
SELECT id, name, pos_1, season FROM player WHERE id IN (9916, 9958);
-- Verify Marcell Ozuna correct ID in season 10
SELECT id, name, season FROM player WHERE name = 'Marcell Ozuna' AND season = 10;
```
### Test Statistical Accuracy
```sql
-- Test bases-loaded batting performance (obc=111)
SELECT
t1.batter_id,
p.name,
p.pos_1,
SUM(t1.re24_primary) AS sum_repri
FROM stratplay AS t1
JOIN player p ON t1.batter_id = p.id
WHERE t1.game_id IN (SELECT t2.id FROM stratgame AS t2 WHERE t2.season = 10)
AND t1.batter_id IS NOT NULL
AND t1.on_base_code = '111'
GROUP BY t1.batter_id, p.name, p.pos_1
HAVING SUM(t1.pa) >= 1
ORDER BY sum_repri DESC
LIMIT 5;
```
## Contact Information
This issue was discovered during API endpoint debugging session on August 19, 2025. The investigation revealed systematic data corruption affecting all player-based statistics in the PostgreSQL migration.
**Next Steps**: Locate and fix the migration script to preserve SQLite primary keys, then re-run the complete database migration process.