8.7 KiB
PostgreSQL Migration Data Integrity Issue - Critical Bug Report
Issue Summary
Critical data corruption discovered in PostgreSQL database migration: Player IDs were not preserved during SQLite-to-PostgreSQL migration, causing systematic misalignment between player identities and their associated game statistics.
Date Discovered: August 19, 2025
Severity: Critical - All player-based statistics queries return incorrect results
Status: Identified, Root Cause Confirmed, Awaiting Fix
Symptoms Observed
Initial Problem
- API endpoint
http://localhost:801/api/v3/plays/batting?season=10&group_by=playerteam&limit=10&obc=111&sort=repri-descreturned suspicious results - Player ID 9916 appeared as "Trevor Williams (SP)" with high batting performance in bases-loaded situations
- This was anomalous because starting pitchers shouldn't be top batting performers
Comparison with Source Data
Correct SQLite API Response (https://sba.manticorum.com/api/v3/plays/batting?season=10&group_by=playerteam&limit=10&obc=111&sort=repri-desc):
- Player ID 9916: Marcell Ozuna (LF) - 8.096 RE24
- Top performer: Michael Harris (CF, ID 9958) - 8.662 RE24
Incorrect PostgreSQL API Response (same endpoint on localhost:801):
- Player ID 9916: Trevor Williams (SP) - 8.096 RE24
- Missing correct top performer Michael Harris entirely
Root Cause Analysis
Database Investigation Results
Player ID Mapping Corruption
SQLite Database (Correct):
ID 9916: Marcell Ozuna (LF)
ID 9958: Michael Harris (CF)
PostgreSQL Database (Incorrect):
ID 9916: Trevor Williams (SP)
ID 9958: Xavier Edwards (2B)
Primary Key Assignment Issue
SQLite Database Structure:
- Player IDs: Range from ~1 to 12000+ with gaps (due to historical deletions)
- Example high IDs: 9346, 9347, 9348, 9349, 9350
- Preserves original IDs with gaps intact
PostgreSQL Database Structure:
- Player IDs: Sequential 1 to 12232 with NO gaps
- Total players: 12,232
- Range: 1-12232 (perfectly sequential)
Migration Logic Flaw
The migration process failed to preserve original SQLite primary keys:
- SQLite: Marcell Ozuna had ID 9916 (with gaps in sequence)
- Migration: PostgreSQL auto-assigned new sequential IDs starting from 1
- Result: Marcell Ozuna received new ID 9658, while Trevor Williams was assigned ID 9916
- Impact: All
stratplayrecords still reference original IDs, but those IDs now point to different players
Evidence of Systematic Corruption
Multiple Season Data
PostgreSQL contains duplicate players across seasons:
SELECT id, name, season FROM player WHERE name = 'Marcell Ozuna';
Results:
621 | Marcell Ozuna | Season 1
1627 | Marcell Ozuna | Season 2
2529 | Marcell Ozuna | Season 3
...
9658 | Marcell Ozuna | Season 10 <- Should be ID 9916
Verification Query
-- PostgreSQL shows wrong player for ID 9916
SELECT id, name, pos_1 FROM player WHERE id = 9916;
-- Result: 9916 | Trevor Williams | SP
-- SQLite API shows correct player for ID 9916
curl "https://sba.manticorum.com/api/v3/players/9916"
-- Result: {"id": 9916, "name": "Marcell Ozuna", "pos_1": "LF"}
Technical Impact
Affected Systems
- All player-based statistics queries return incorrect results
- Batting statistics API (
/api/v3/plays/batting) - Pitching statistics API (
/api/v3/plays/pitching) - Fielding statistics API (
/api/v3/plays/fielding) - Player lookup endpoints (
/api/v3/players/{id}) - Any endpoint that joins
stratplaywithplayertables
Data Integrity Scope
- stratplay table: Contains ~48,000 records with original SQLite player IDs
- player table: Contains remapped IDs that don't match stratplay references
- Foreign key relationships: Completely broken between stratplay.batter_id and player.id
Related Issues Fixed During Investigation
- PostgreSQL GROUP BY Error: Fixed SQL query that was selecting
game_idwithout including it in GROUP BY clause - ORDER BY Conflicts: Removed
StratPlay.idordering from grouped queries to prevent PostgreSQL GROUP BY violations
Reproduction Steps
-
Query PostgreSQL database:
curl "http://localhost:801/api/v3/plays/batting?season=10&group_by=playerteam&limit=10&obc=111&sort=repri-desc" -
Query SQLite database (correct source):
curl "https://sba.manticorum.com/api/v3/plays/batting?season=10&group_by=playerteam&limit=10&obc=111&sort=repri-desc" -
Compare results: Player names and statistics will be misaligned
-
Verify specific player:
# PostgreSQL (wrong) curl "http://localhost:801/api/v3/players/9916" # SQLite (correct) curl "https://sba.manticorum.com/api/v3/players/9916"
Migration Script Issue
Current Problematic Behavior
The migration script appears to:
- Extract player data from SQLite
- Insert into PostgreSQL without preserving original IDs
- Allow PostgreSQL to auto-assign sequential primary keys
- Migrate stratplay data with original foreign key references
Required Fix
The migration script must:
- Preserve original SQLite primary keys during player table migration
- Explicitly set ID values during INSERT operations
- Adjust PostgreSQL sequence to start after the highest migrated ID
- Validate foreign key integrity post-migration
Example Corrected Migration Logic
# Instead of:
cursor.execute("INSERT INTO player (name, pos_1, season) VALUES (%s, %s, %s)",
(player.name, player.pos_1, player.season))
# Should be:
cursor.execute("INSERT INTO player (id, name, pos_1, season) VALUES (%s, %s, %s, %s)",
(player.id, player.name, player.pos_1, player.season))
# Then reset sequence:
cursor.execute("SELECT setval('player_id_seq', (SELECT MAX(id) FROM player));")
Database Environment Details
PostgreSQL Setup
- Container: sba_postgres
- Database: sba_master
- User: sba_admin
- Port: 5432
- Version: PostgreSQL 16-alpine
SQLite Source
- API Endpoint: https://sba.manticorum.com/api/v3/
- Database Files:
sba_master.db,pd_master.db - Status: Confirmed working and accurate
Immediate Recommendations
Priority 1: Stop Using PostgreSQL Database
- All production queries should use SQLite API until this is fixed
- PostgreSQL database results are completely unreliable for player statistics
Priority 2: Fix Migration Script
- Identify migration script location (likely
migrate_to_postgres.py) - Modify to preserve primary keys from SQLite source
- Add validation checks for foreign key integrity
Priority 3: Re-run Complete Migration
- Drop and recreate PostgreSQL database
- Run corrected migration script
- Validate sample queries against SQLite source before declaring fixed
Priority 4: Add Data Validation Tests
- Create automated tests comparing PostgreSQL vs SQLite query results
- Add foreign key constraint validation
- Implement post-migration data integrity checks
Files Involved in Investigation
Modified During Debugging
/mnt/NV2/Development/major-domo/database/app/routers_v3/stratplay.py- Fixed GROUP BY and ORDER BY PostgreSQL compatibility issues
- Lines 317, 529, 1062: Removed/modified problematic query components
Configuration Files
/mnt/NV2/Development/major-domo/database/docker-compose.yml- PostgreSQL connection details and credentials
Migration Scripts (Suspected)
/mnt/NV2/Development/major-domo/database/migrate_to_postgres.py(needs investigation)/mnt/NV2/Development/major-domo/database/migrations.py
Test Queries for Validation
Verify Player ID Mapping
-- Check specific problematic players
SELECT id, name, pos_1, season FROM player WHERE id IN (9916, 9958);
-- Verify Marcell Ozuna correct ID in season 10
SELECT id, name, season FROM player WHERE name = 'Marcell Ozuna' AND season = 10;
Test Statistical Accuracy
-- Test bases-loaded batting performance (obc=111)
SELECT
t1.batter_id,
p.name,
p.pos_1,
SUM(t1.re24_primary) AS sum_repri
FROM stratplay AS t1
JOIN player p ON t1.batter_id = p.id
WHERE t1.game_id IN (SELECT t2.id FROM stratgame AS t2 WHERE t2.season = 10)
AND t1.batter_id IS NOT NULL
AND t1.on_base_code = '111'
GROUP BY t1.batter_id, p.name, p.pos_1
HAVING SUM(t1.pa) >= 1
ORDER BY sum_repri DESC
LIMIT 5;
Contact Information
This issue was discovered during API endpoint debugging session on August 19, 2025. The investigation revealed systematic data corruption affecting all player-based statistics in the PostgreSQL migration.
Next Steps: Locate and fix the migration script to preserve SQLite primary keys, then re-run the complete database migration process.