244 lines
8.7 KiB
Markdown
244 lines
8.7 KiB
Markdown
# PostgreSQL Migration Data Integrity Issue - Critical Bug Report
|
|
|
|
## Issue Summary
|
|
**Critical data corruption discovered in PostgreSQL database migration**: Player IDs were not preserved during SQLite-to-PostgreSQL migration, causing systematic misalignment between player identities and their associated game statistics.
|
|
|
|
**Date Discovered**: August 19, 2025
|
|
**Severity**: Critical - All player-based statistics queries return incorrect results
|
|
**Status**: Identified, Root Cause Confirmed, Awaiting Fix
|
|
|
|
## Symptoms Observed
|
|
|
|
### Initial Problem
|
|
- API endpoint `http://localhost:801/api/v3/plays/batting?season=10&group_by=playerteam&limit=10&obc=111&sort=repri-desc` returned suspicious results
|
|
- Player ID 9916 appeared as "Trevor Williams (SP)" with high batting performance in bases-loaded situations
|
|
- This was anomalous because starting pitchers shouldn't be top batting performers
|
|
|
|
### Comparison with Source Data
|
|
**Correct SQLite API Response** (`https://sba.manticorum.com/api/v3/plays/batting?season=10&group_by=playerteam&limit=10&obc=111&sort=repri-desc`):
|
|
- Player ID 9916: **Marcell Ozuna (LF)** - 8.096 RE24
|
|
- Top performer: **Michael Harris (CF, ID 9958)** - 8.662 RE24
|
|
|
|
**Incorrect PostgreSQL API Response** (same endpoint on localhost:801):
|
|
- Player ID 9916: **Trevor Williams (SP)** - 8.096 RE24
|
|
- Missing correct top performer Michael Harris entirely
|
|
|
|
## Root Cause Analysis
|
|
|
|
### Database Investigation Results
|
|
|
|
#### Player ID Mapping Corruption
|
|
**SQLite Database (Correct)**:
|
|
```
|
|
ID 9916: Marcell Ozuna (LF)
|
|
ID 9958: Michael Harris (CF)
|
|
```
|
|
|
|
**PostgreSQL Database (Incorrect)**:
|
|
```
|
|
ID 9916: Trevor Williams (SP)
|
|
ID 9958: Xavier Edwards (2B)
|
|
```
|
|
|
|
#### Primary Key Assignment Issue
|
|
**SQLite Database Structure**:
|
|
- Player IDs: Range from ~1 to 12000+ with gaps (due to historical deletions)
|
|
- Example high IDs: 9346, 9347, 9348, 9349, 9350
|
|
- Preserves original IDs with gaps intact
|
|
|
|
**PostgreSQL Database Structure**:
|
|
- Player IDs: Sequential 1 to 12232 with NO gaps
|
|
- Total players: 12,232
|
|
- Range: 1-12232 (perfectly sequential)
|
|
|
|
#### Migration Logic Flaw
|
|
The migration process failed to preserve original SQLite primary keys:
|
|
|
|
1. **SQLite**: Marcell Ozuna had ID 9916 (with gaps in sequence)
|
|
2. **Migration**: PostgreSQL auto-assigned new sequential IDs starting from 1
|
|
3. **Result**: Marcell Ozuna received new ID 9658, while Trevor Williams was assigned ID 9916
|
|
4. **Impact**: All `stratplay` records still reference original IDs, but those IDs now point to different players
|
|
|
|
### Evidence of Systematic Corruption
|
|
|
|
#### Multiple Season Data
|
|
PostgreSQL contains duplicate players across seasons:
|
|
```sql
|
|
SELECT id, name, season FROM player WHERE name = 'Marcell Ozuna';
|
|
```
|
|
Results:
|
|
```
|
|
621 | Marcell Ozuna | Season 1
|
|
1627 | Marcell Ozuna | Season 2
|
|
2529 | Marcell Ozuna | Season 3
|
|
...
|
|
9658 | Marcell Ozuna | Season 10 <- Should be ID 9916
|
|
```
|
|
|
|
#### Verification Query
|
|
```sql
|
|
-- PostgreSQL shows wrong player for ID 9916
|
|
SELECT id, name, pos_1 FROM player WHERE id = 9916;
|
|
-- Result: 9916 | Trevor Williams | SP
|
|
|
|
-- SQLite API shows correct player for ID 9916
|
|
curl "https://sba.manticorum.com/api/v3/players/9916"
|
|
-- Result: {"id": 9916, "name": "Marcell Ozuna", "pos_1": "LF"}
|
|
```
|
|
|
|
## Technical Impact
|
|
|
|
### Affected Systems
|
|
- **All player-based statistics queries** return incorrect results
|
|
- **Batting statistics API** (`/api/v3/plays/batting`)
|
|
- **Pitching statistics API** (`/api/v3/plays/pitching`)
|
|
- **Fielding statistics API** (`/api/v3/plays/fielding`)
|
|
- **Player lookup endpoints** (`/api/v3/players/{id}`)
|
|
- **Any endpoint that joins `stratplay` with `player` tables**
|
|
|
|
### Data Integrity Scope
|
|
- **stratplay table**: Contains ~48,000 records with original SQLite player IDs
|
|
- **player table**: Contains remapped IDs that don't match stratplay references
|
|
- **Foreign key relationships**: Completely broken between stratplay.batter_id and player.id
|
|
|
|
### Related Issues Fixed During Investigation
|
|
1. **PostgreSQL GROUP BY Error**: Fixed SQL query that was selecting `game_id` without including it in GROUP BY clause
|
|
2. **ORDER BY Conflicts**: Removed `StratPlay.id` ordering from grouped queries to prevent PostgreSQL GROUP BY violations
|
|
|
|
## Reproduction Steps
|
|
|
|
1. **Query PostgreSQL database**:
|
|
```bash
|
|
curl "http://localhost:801/api/v3/plays/batting?season=10&group_by=playerteam&limit=10&obc=111&sort=repri-desc"
|
|
```
|
|
|
|
2. **Query SQLite database** (correct source):
|
|
```bash
|
|
curl "https://sba.manticorum.com/api/v3/plays/batting?season=10&group_by=playerteam&limit=10&obc=111&sort=repri-desc"
|
|
```
|
|
|
|
3. **Compare results**: Player names and statistics will be misaligned
|
|
|
|
4. **Verify specific player**:
|
|
```bash
|
|
# PostgreSQL (wrong)
|
|
curl "http://localhost:801/api/v3/players/9916"
|
|
# SQLite (correct)
|
|
curl "https://sba.manticorum.com/api/v3/players/9916"
|
|
```
|
|
|
|
## Migration Script Issue
|
|
|
|
### Current Problematic Behavior
|
|
The migration script appears to:
|
|
1. Extract player data from SQLite
|
|
2. Insert into PostgreSQL without preserving original IDs
|
|
3. Allow PostgreSQL to auto-assign sequential primary keys
|
|
4. Migrate stratplay data with original foreign key references
|
|
|
|
### Required Fix
|
|
The migration script must:
|
|
1. **Preserve original SQLite primary keys** during player table migration
|
|
2. **Explicitly set ID values** during INSERT operations
|
|
3. **Adjust PostgreSQL sequence** to start after the highest migrated ID
|
|
4. **Validate foreign key integrity** post-migration
|
|
|
|
### Example Corrected Migration Logic
|
|
```python
|
|
# Instead of:
|
|
cursor.execute("INSERT INTO player (name, pos_1, season) VALUES (%s, %s, %s)",
|
|
(player.name, player.pos_1, player.season))
|
|
|
|
# Should be:
|
|
cursor.execute("INSERT INTO player (id, name, pos_1, season) VALUES (%s, %s, %s, %s)",
|
|
(player.id, player.name, player.pos_1, player.season))
|
|
|
|
# Then reset sequence:
|
|
cursor.execute("SELECT setval('player_id_seq', (SELECT MAX(id) FROM player));")
|
|
```
|
|
|
|
## Database Environment Details
|
|
|
|
### PostgreSQL Setup
|
|
- **Container**: sba_postgres
|
|
- **Database**: sba_master
|
|
- **User**: sba_admin
|
|
- **Port**: 5432
|
|
- **Version**: PostgreSQL 16-alpine
|
|
|
|
### SQLite Source
|
|
- **API Endpoint**: https://sba.manticorum.com/api/v3/
|
|
- **Database Files**: `sba_master.db`, `pd_master.db`
|
|
- **Status**: Confirmed working and accurate
|
|
|
|
## Immediate Recommendations
|
|
|
|
### Priority 1: Stop Using PostgreSQL Database
|
|
- **All production queries should use SQLite API** until this is fixed
|
|
- **PostgreSQL database results are completely unreliable** for player statistics
|
|
|
|
### Priority 2: Fix Migration Script
|
|
- **Identify migration script location** (likely `migrate_to_postgres.py`)
|
|
- **Modify to preserve primary keys** from SQLite source
|
|
- **Add validation checks** for foreign key integrity
|
|
|
|
### Priority 3: Re-run Complete Migration
|
|
- **Drop and recreate PostgreSQL database**
|
|
- **Run corrected migration script**
|
|
- **Validate sample queries** against SQLite source before declaring fixed
|
|
|
|
### Priority 4: Add Data Validation Tests
|
|
- **Create automated tests** comparing PostgreSQL vs SQLite query results
|
|
- **Add foreign key constraint validation**
|
|
- **Implement post-migration data integrity checks**
|
|
|
|
## Files Involved in Investigation
|
|
|
|
### Modified During Debugging
|
|
- `/mnt/NV2/Development/major-domo/database/app/routers_v3/stratplay.py`
|
|
- Fixed GROUP BY and ORDER BY PostgreSQL compatibility issues
|
|
- Lines 317, 529, 1062: Removed/modified problematic query components
|
|
|
|
### Configuration Files
|
|
- `/mnt/NV2/Development/major-domo/database/docker-compose.yml`
|
|
- PostgreSQL connection details and credentials
|
|
|
|
### Migration Scripts (Suspected)
|
|
- `/mnt/NV2/Development/major-domo/database/migrate_to_postgres.py` (needs investigation)
|
|
- `/mnt/NV2/Development/major-domo/database/migrations.py`
|
|
|
|
## Test Queries for Validation
|
|
|
|
### Verify Player ID Mapping
|
|
```sql
|
|
-- Check specific problematic players
|
|
SELECT id, name, pos_1, season FROM player WHERE id IN (9916, 9958);
|
|
|
|
-- Verify Marcell Ozuna correct ID in season 10
|
|
SELECT id, name, season FROM player WHERE name = 'Marcell Ozuna' AND season = 10;
|
|
```
|
|
|
|
### Test Statistical Accuracy
|
|
```sql
|
|
-- Test bases-loaded batting performance (obc=111)
|
|
SELECT
|
|
t1.batter_id,
|
|
p.name,
|
|
p.pos_1,
|
|
SUM(t1.re24_primary) AS sum_repri
|
|
FROM stratplay AS t1
|
|
JOIN player p ON t1.batter_id = p.id
|
|
WHERE t1.game_id IN (SELECT t2.id FROM stratgame AS t2 WHERE t2.season = 10)
|
|
AND t1.batter_id IS NOT NULL
|
|
AND t1.on_base_code = '111'
|
|
GROUP BY t1.batter_id, p.name, p.pos_1
|
|
HAVING SUM(t1.pa) >= 1
|
|
ORDER BY sum_repri DESC
|
|
LIMIT 5;
|
|
```
|
|
|
|
## Contact Information
|
|
|
|
This issue was discovered during API endpoint debugging session on August 19, 2025. The investigation revealed systematic data corruption affecting all player-based statistics in the PostgreSQL migration.
|
|
|
|
**Next Steps**: Locate and fix the migration script to preserve SQLite primary keys, then re-run the complete database migration process. |