# PostgreSQL Migration Data Integrity Issue - Critical Bug Report ## Issue Summary **Critical data corruption discovered in PostgreSQL database migration**: Player IDs were not preserved during SQLite-to-PostgreSQL migration, causing systematic misalignment between player identities and their associated game statistics. **Date Discovered**: August 19, 2025 **Severity**: Critical - All player-based statistics queries return incorrect results **Status**: Identified, Root Cause Confirmed, Awaiting Fix ## Symptoms Observed ### Initial Problem - API endpoint `http://localhost:801/api/v3/plays/batting?season=10&group_by=playerteam&limit=10&obc=111&sort=repri-desc` returned suspicious results - Player ID 9916 appeared as "Trevor Williams (SP)" with high batting performance in bases-loaded situations - This was anomalous because starting pitchers shouldn't be top batting performers ### Comparison with Source Data **Correct SQLite API Response** (`https://sba.manticorum.com/api/v3/plays/batting?season=10&group_by=playerteam&limit=10&obc=111&sort=repri-desc`): - Player ID 9916: **Marcell Ozuna (LF)** - 8.096 RE24 - Top performer: **Michael Harris (CF, ID 9958)** - 8.662 RE24 **Incorrect PostgreSQL API Response** (same endpoint on localhost:801): - Player ID 9916: **Trevor Williams (SP)** - 8.096 RE24 - Missing correct top performer Michael Harris entirely ## Root Cause Analysis ### Database Investigation Results #### Player ID Mapping Corruption **SQLite Database (Correct)**: ``` ID 9916: Marcell Ozuna (LF) ID 9958: Michael Harris (CF) ``` **PostgreSQL Database (Incorrect)**: ``` ID 9916: Trevor Williams (SP) ID 9958: Xavier Edwards (2B) ``` #### Primary Key Assignment Issue **SQLite Database Structure**: - Player IDs: Range from ~1 to 12000+ with gaps (due to historical deletions) - Example high IDs: 9346, 9347, 9348, 9349, 9350 - Preserves original IDs with gaps intact **PostgreSQL Database Structure**: - Player IDs: Sequential 1 to 12232 with NO gaps - Total players: 12,232 - Range: 1-12232 (perfectly sequential) #### Migration Logic Flaw The migration process failed to preserve original SQLite primary keys: 1. **SQLite**: Marcell Ozuna had ID 9916 (with gaps in sequence) 2. **Migration**: PostgreSQL auto-assigned new sequential IDs starting from 1 3. **Result**: Marcell Ozuna received new ID 9658, while Trevor Williams was assigned ID 9916 4. **Impact**: All `stratplay` records still reference original IDs, but those IDs now point to different players ### Evidence of Systematic Corruption #### Multiple Season Data PostgreSQL contains duplicate players across seasons: ```sql SELECT id, name, season FROM player WHERE name = 'Marcell Ozuna'; ``` Results: ``` 621 | Marcell Ozuna | Season 1 1627 | Marcell Ozuna | Season 2 2529 | Marcell Ozuna | Season 3 ... 9658 | Marcell Ozuna | Season 10 <- Should be ID 9916 ``` #### Verification Query ```sql -- PostgreSQL shows wrong player for ID 9916 SELECT id, name, pos_1 FROM player WHERE id = 9916; -- Result: 9916 | Trevor Williams | SP -- SQLite API shows correct player for ID 9916 curl "https://sba.manticorum.com/api/v3/players/9916" -- Result: {"id": 9916, "name": "Marcell Ozuna", "pos_1": "LF"} ``` ## Technical Impact ### Affected Systems - **All player-based statistics queries** return incorrect results - **Batting statistics API** (`/api/v3/plays/batting`) - **Pitching statistics API** (`/api/v3/plays/pitching`) - **Fielding statistics API** (`/api/v3/plays/fielding`) - **Player lookup endpoints** (`/api/v3/players/{id}`) - **Any endpoint that joins `stratplay` with `player` tables** ### Data Integrity Scope - **stratplay table**: Contains ~48,000 records with original SQLite player IDs - **player table**: Contains remapped IDs that don't match stratplay references - **Foreign key relationships**: Completely broken between stratplay.batter_id and player.id ### Related Issues Fixed During Investigation 1. **PostgreSQL GROUP BY Error**: Fixed SQL query that was selecting `game_id` without including it in GROUP BY clause 2. **ORDER BY Conflicts**: Removed `StratPlay.id` ordering from grouped queries to prevent PostgreSQL GROUP BY violations ## Reproduction Steps 1. **Query PostgreSQL database**: ```bash curl "http://localhost:801/api/v3/plays/batting?season=10&group_by=playerteam&limit=10&obc=111&sort=repri-desc" ``` 2. **Query SQLite database** (correct source): ```bash curl "https://sba.manticorum.com/api/v3/plays/batting?season=10&group_by=playerteam&limit=10&obc=111&sort=repri-desc" ``` 3. **Compare results**: Player names and statistics will be misaligned 4. **Verify specific player**: ```bash # PostgreSQL (wrong) curl "http://localhost:801/api/v3/players/9916" # SQLite (correct) curl "https://sba.manticorum.com/api/v3/players/9916" ``` ## Migration Script Issue ### Current Problematic Behavior The migration script appears to: 1. Extract player data from SQLite 2. Insert into PostgreSQL without preserving original IDs 3. Allow PostgreSQL to auto-assign sequential primary keys 4. Migrate stratplay data with original foreign key references ### Required Fix The migration script must: 1. **Preserve original SQLite primary keys** during player table migration 2. **Explicitly set ID values** during INSERT operations 3. **Adjust PostgreSQL sequence** to start after the highest migrated ID 4. **Validate foreign key integrity** post-migration ### Example Corrected Migration Logic ```python # Instead of: cursor.execute("INSERT INTO player (name, pos_1, season) VALUES (%s, %s, %s)", (player.name, player.pos_1, player.season)) # Should be: cursor.execute("INSERT INTO player (id, name, pos_1, season) VALUES (%s, %s, %s, %s)", (player.id, player.name, player.pos_1, player.season)) # Then reset sequence: cursor.execute("SELECT setval('player_id_seq', (SELECT MAX(id) FROM player));") ``` ## Database Environment Details ### PostgreSQL Setup - **Container**: sba_postgres - **Database**: sba_master - **User**: sba_admin - **Port**: 5432 - **Version**: PostgreSQL 16-alpine ### SQLite Source - **API Endpoint**: https://sba.manticorum.com/api/v3/ - **Database Files**: `sba_master.db`, `pd_master.db` - **Status**: Confirmed working and accurate ## Immediate Recommendations ### Priority 1: Stop Using PostgreSQL Database - **All production queries should use SQLite API** until this is fixed - **PostgreSQL database results are completely unreliable** for player statistics ### Priority 2: Fix Migration Script - **Identify migration script location** (likely `migrate_to_postgres.py`) - **Modify to preserve primary keys** from SQLite source - **Add validation checks** for foreign key integrity ### Priority 3: Re-run Complete Migration - **Drop and recreate PostgreSQL database** - **Run corrected migration script** - **Validate sample queries** against SQLite source before declaring fixed ### Priority 4: Add Data Validation Tests - **Create automated tests** comparing PostgreSQL vs SQLite query results - **Add foreign key constraint validation** - **Implement post-migration data integrity checks** ## Files Involved in Investigation ### Modified During Debugging - `/mnt/NV2/Development/major-domo/database/app/routers_v3/stratplay.py` - Fixed GROUP BY and ORDER BY PostgreSQL compatibility issues - Lines 317, 529, 1062: Removed/modified problematic query components ### Configuration Files - `/mnt/NV2/Development/major-domo/database/docker-compose.yml` - PostgreSQL connection details and credentials ### Migration Scripts (Suspected) - `/mnt/NV2/Development/major-domo/database/migrate_to_postgres.py` (needs investigation) - `/mnt/NV2/Development/major-domo/database/migrations.py` ## Test Queries for Validation ### Verify Player ID Mapping ```sql -- Check specific problematic players SELECT id, name, pos_1, season FROM player WHERE id IN (9916, 9958); -- Verify Marcell Ozuna correct ID in season 10 SELECT id, name, season FROM player WHERE name = 'Marcell Ozuna' AND season = 10; ``` ### Test Statistical Accuracy ```sql -- Test bases-loaded batting performance (obc=111) SELECT t1.batter_id, p.name, p.pos_1, SUM(t1.re24_primary) AS sum_repri FROM stratplay AS t1 JOIN player p ON t1.batter_id = p.id WHERE t1.game_id IN (SELECT t2.id FROM stratgame AS t2 WHERE t2.season = 10) AND t1.batter_id IS NOT NULL AND t1.on_base_code = '111' GROUP BY t1.batter_id, p.name, p.pos_1 HAVING SUM(t1.pa) >= 1 ORDER BY sum_repri DESC LIMIT 5; ``` ## Contact Information This issue was discovered during API endpoint debugging session on August 19, 2025. The investigation revealed systematic data corruption affecting all player-based statistics in the PostgreSQL migration. **Next Steps**: Locate and fix the migration script to preserve SQLite primary keys, then re-run the complete database migration process.