major-domo-database/POSTGRESQL_MIGRATION_DATA_INTEGRITY_ISSUE.md
2025-08-20 09:52:46 -05:00

8.7 KiB

PostgreSQL Migration Data Integrity Issue - Critical Bug Report

Issue Summary

Critical data corruption discovered in PostgreSQL database migration: Player IDs were not preserved during SQLite-to-PostgreSQL migration, causing systematic misalignment between player identities and their associated game statistics.

Date Discovered: August 19, 2025
Severity: Critical - All player-based statistics queries return incorrect results
Status: Identified, Root Cause Confirmed, Awaiting Fix

Symptoms Observed

Initial Problem

  • API endpoint http://localhost:801/api/v3/plays/batting?season=10&group_by=playerteam&limit=10&obc=111&sort=repri-desc returned suspicious results
  • Player ID 9916 appeared as "Trevor Williams (SP)" with high batting performance in bases-loaded situations
  • This was anomalous because starting pitchers shouldn't be top batting performers

Comparison with Source Data

Correct SQLite API Response (https://sba.manticorum.com/api/v3/plays/batting?season=10&group_by=playerteam&limit=10&obc=111&sort=repri-desc):

  • Player ID 9916: Marcell Ozuna (LF) - 8.096 RE24
  • Top performer: Michael Harris (CF, ID 9958) - 8.662 RE24

Incorrect PostgreSQL API Response (same endpoint on localhost:801):

  • Player ID 9916: Trevor Williams (SP) - 8.096 RE24
  • Missing correct top performer Michael Harris entirely

Root Cause Analysis

Database Investigation Results

Player ID Mapping Corruption

SQLite Database (Correct):

ID 9916: Marcell Ozuna (LF)
ID 9958: Michael Harris (CF)

PostgreSQL Database (Incorrect):

ID 9916: Trevor Williams (SP)
ID 9958: Xavier Edwards (2B)

Primary Key Assignment Issue

SQLite Database Structure:

  • Player IDs: Range from ~1 to 12000+ with gaps (due to historical deletions)
  • Example high IDs: 9346, 9347, 9348, 9349, 9350
  • Preserves original IDs with gaps intact

PostgreSQL Database Structure:

  • Player IDs: Sequential 1 to 12232 with NO gaps
  • Total players: 12,232
  • Range: 1-12232 (perfectly sequential)

Migration Logic Flaw

The migration process failed to preserve original SQLite primary keys:

  1. SQLite: Marcell Ozuna had ID 9916 (with gaps in sequence)
  2. Migration: PostgreSQL auto-assigned new sequential IDs starting from 1
  3. Result: Marcell Ozuna received new ID 9658, while Trevor Williams was assigned ID 9916
  4. Impact: All stratplay records still reference original IDs, but those IDs now point to different players

Evidence of Systematic Corruption

Multiple Season Data

PostgreSQL contains duplicate players across seasons:

SELECT id, name, season FROM player WHERE name = 'Marcell Ozuna';

Results:

  621 | Marcell Ozuna | Season 1
 1627 | Marcell Ozuna | Season 2
 2529 | Marcell Ozuna | Season 3
 ...
 9658 | Marcell Ozuna | Season 10  <- Should be ID 9916

Verification Query

-- PostgreSQL shows wrong player for ID 9916
SELECT id, name, pos_1 FROM player WHERE id = 9916;
-- Result: 9916 | Trevor Williams | SP

-- SQLite API shows correct player for ID 9916
curl "https://sba.manticorum.com/api/v3/players/9916"
-- Result: {"id": 9916, "name": "Marcell Ozuna", "pos_1": "LF"}

Technical Impact

Affected Systems

  • All player-based statistics queries return incorrect results
  • Batting statistics API (/api/v3/plays/batting)
  • Pitching statistics API (/api/v3/plays/pitching)
  • Fielding statistics API (/api/v3/plays/fielding)
  • Player lookup endpoints (/api/v3/players/{id})
  • Any endpoint that joins stratplay with player tables

Data Integrity Scope

  • stratplay table: Contains ~48,000 records with original SQLite player IDs
  • player table: Contains remapped IDs that don't match stratplay references
  • Foreign key relationships: Completely broken between stratplay.batter_id and player.id
  1. PostgreSQL GROUP BY Error: Fixed SQL query that was selecting game_id without including it in GROUP BY clause
  2. ORDER BY Conflicts: Removed StratPlay.id ordering from grouped queries to prevent PostgreSQL GROUP BY violations

Reproduction Steps

  1. Query PostgreSQL database:

    curl "http://localhost:801/api/v3/plays/batting?season=10&group_by=playerteam&limit=10&obc=111&sort=repri-desc"
    
  2. Query SQLite database (correct source):

    curl "https://sba.manticorum.com/api/v3/plays/batting?season=10&group_by=playerteam&limit=10&obc=111&sort=repri-desc"
    
  3. Compare results: Player names and statistics will be misaligned

  4. Verify specific player:

    # PostgreSQL (wrong)
    curl "http://localhost:801/api/v3/players/9916"
    # SQLite (correct)  
    curl "https://sba.manticorum.com/api/v3/players/9916"
    

Migration Script Issue

Current Problematic Behavior

The migration script appears to:

  1. Extract player data from SQLite
  2. Insert into PostgreSQL without preserving original IDs
  3. Allow PostgreSQL to auto-assign sequential primary keys
  4. Migrate stratplay data with original foreign key references

Required Fix

The migration script must:

  1. Preserve original SQLite primary keys during player table migration
  2. Explicitly set ID values during INSERT operations
  3. Adjust PostgreSQL sequence to start after the highest migrated ID
  4. Validate foreign key integrity post-migration

Example Corrected Migration Logic

# Instead of:
cursor.execute("INSERT INTO player (name, pos_1, season) VALUES (%s, %s, %s)", 
               (player.name, player.pos_1, player.season))

# Should be:
cursor.execute("INSERT INTO player (id, name, pos_1, season) VALUES (%s, %s, %s, %s)", 
               (player.id, player.name, player.pos_1, player.season))

# Then reset sequence:
cursor.execute("SELECT setval('player_id_seq', (SELECT MAX(id) FROM player));")

Database Environment Details

PostgreSQL Setup

  • Container: sba_postgres
  • Database: sba_master
  • User: sba_admin
  • Port: 5432
  • Version: PostgreSQL 16-alpine

SQLite Source

Immediate Recommendations

Priority 1: Stop Using PostgreSQL Database

  • All production queries should use SQLite API until this is fixed
  • PostgreSQL database results are completely unreliable for player statistics

Priority 2: Fix Migration Script

  • Identify migration script location (likely migrate_to_postgres.py)
  • Modify to preserve primary keys from SQLite source
  • Add validation checks for foreign key integrity

Priority 3: Re-run Complete Migration

  • Drop and recreate PostgreSQL database
  • Run corrected migration script
  • Validate sample queries against SQLite source before declaring fixed

Priority 4: Add Data Validation Tests

  • Create automated tests comparing PostgreSQL vs SQLite query results
  • Add foreign key constraint validation
  • Implement post-migration data integrity checks

Files Involved in Investigation

Modified During Debugging

  • /mnt/NV2/Development/major-domo/database/app/routers_v3/stratplay.py
    • Fixed GROUP BY and ORDER BY PostgreSQL compatibility issues
    • Lines 317, 529, 1062: Removed/modified problematic query components

Configuration Files

  • /mnt/NV2/Development/major-domo/database/docker-compose.yml
    • PostgreSQL connection details and credentials

Migration Scripts (Suspected)

  • /mnt/NV2/Development/major-domo/database/migrate_to_postgres.py (needs investigation)
  • /mnt/NV2/Development/major-domo/database/migrations.py

Test Queries for Validation

Verify Player ID Mapping

-- Check specific problematic players
SELECT id, name, pos_1, season FROM player WHERE id IN (9916, 9958);

-- Verify Marcell Ozuna correct ID in season 10
SELECT id, name, season FROM player WHERE name = 'Marcell Ozuna' AND season = 10;

Test Statistical Accuracy

-- Test bases-loaded batting performance (obc=111)
SELECT 
    t1.batter_id,
    p.name,
    p.pos_1,
    SUM(t1.re24_primary) AS sum_repri
FROM stratplay AS t1
JOIN player p ON t1.batter_id = p.id
WHERE t1.game_id IN (SELECT t2.id FROM stratgame AS t2 WHERE t2.season = 10)
AND t1.batter_id IS NOT NULL
AND t1.on_base_code = '111'
GROUP BY t1.batter_id, p.name, p.pos_1
HAVING SUM(t1.pa) >= 1
ORDER BY sum_repri DESC
LIMIT 5;

Contact Information

This issue was discovered during API endpoint debugging session on August 19, 2025. The investigation revealed systematic data corruption affecting all player-based statistics in the PostgreSQL migration.

Next Steps: Locate and fix the migration script to preserve SQLite primary keys, then re-run the complete database migration process.