strat-gameplay-webapp/WEBSOCKET_ARCHITECTURE_ANALYSIS.md
Cal Corum e0c12467b0 CLAUDE: Improve UX with single-click OAuth, enhanced games list, and layout fix
Frontend UX improvements:
- Single-click Discord OAuth from home page (no intermediate /auth page)
- Auto-redirect authenticated users from home to /games
- Fixed Nuxt layout system - app.vue now wraps NuxtPage with NuxtLayout
- Games page now has proper card container with shadow/border styling
- Layout header includes working logout with API cookie clearing

Games list enhancements:
- Display team names (lname) instead of just team IDs
- Show current score for each team
- Show inning indicator (Top/Bot X) for active games
- Responsive header with wrapped buttons on mobile

Backend improvements:
- Added team caching to SbaApiClient (1-hour TTL)
- Enhanced GameListItem with team names, scores, inning data
- Games endpoint now enriches response with SBA API team data

Docker optimizations:
- Optimized Dockerfile using --chown flag on COPY (faster than chown -R)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-05 16:14:00 -06:00

830 lines
24 KiB
Markdown

# WebSocket Implementation Analysis Report
## Paper Dynasty Real-Time Game Engine Backend
**Date**: 2025-11-27
**Scope**: `/mnt/NV2/Development/strat-gameplay-webapp/backend/app/websocket/`
**Status**: Comprehensive exploration of production-ready implementation
---
## EXECUTIVE SUMMARY
The WebSocket implementation is **well-architected and production-ready** with 15 event handlers, proper error handling, and recovery mechanisms. However, there are several **identified risk areas** related to state synchronization, memory management, and race conditions that require attention before handling high-concurrency scenarios.
### Key Findings:
-**Solid Foundation**: Clean separation of concerns, proper async/await patterns
-**Good Error Handling**: 157+ error handling statements across handlers
- ⚠️ **State Mutation Risk**: Direct state object mutations without proper locking
- ⚠️ **Memory Leak Potential**: Game resource cleanup not guaranteed on all code paths
- ⚠️ **Race Condition Risk**: Multiple concurrent events could corrupt state
- ⚠️ **Reconnection Gap**: No explicit reconnection handling or session recovery strategy
- ⚠️ **Room Cleanup**: Stale game rooms may persist indefinitely
---
## 1. ARCHITECTURE OVERVIEW
### Directory Structure
```
app/websocket/
├── connection_manager.py (71 lines) # Room & session management
├── handlers.py (1,312 lines) # 15 event handlers
├── __init__.py # Package marker
└── CLAUDE.md # Documentation
```
### Socket.io Initialization (main.py)
```python
sio = socketio.AsyncServer(
async_mode="asgi",
cors_allowed_origins=settings.cors_origins,
logger=True,
engineio_logger=False
)
socket_app = socketio.ASGIApp(sio, app)
connection_manager = ConnectionManager(sio)
register_handlers(sio, connection_manager)
game_engine.set_connection_manager(connection_manager)
```
**Critical Point**: Connection manager is a singleton instance passed to all handlers via closure.
---
## 2. CONNECTION MANAGEMENT
### ConnectionManager Structure
```python
class ConnectionManager:
def __init__(self, sio: socketio.AsyncServer):
self.sio = sio
self.user_sessions: dict[str, str] = {} # sid -> user_id
self.game_rooms: dict[str, set[str]] = {} # game_id -> set of sids
```
### Connection Lifecycle
#### 1. Connect Handler
```python
@sio.event
async def connect(sid, environ, auth):
# Dual auth support:
# 1. HttpOnly cookies (from nginx proxy)
# 2. Auth object (direct JS clients)
token = extract_token(environ, auth)
user_data = verify_token(token)
user_id = user_data.get("user_id")
await manager.connect(sid, user_id)
await sio.emit("connected", {"user_id": user_id}, room=sid)
return True # Accept connection
```
**Status**: ✅ Properly authenticated with fallback support
#### 2. Disconnect Handler
```python
@sio.event
async def disconnect(sid):
await manager.disconnect(sid)
async def disconnect(self, sid: str) -> None:
user_id = self.user_sessions.pop(sid, None)
if user_id:
logger.info(f"User {user_id} disconnected (session {sid})")
# Remove from all game rooms
for game_id, sids in self.game_rooms.items():
if sid in sids:
sids.remove(sid)
await self.broadcast_to_game(
game_id, "user_disconnected", {"user_id": user_id}
)
```
**Status**: ✅ Removes user from all rooms
### Room Management
#### Join Game
```python
async def join_game(self, sid: str, game_id: str, role: str) -> None:
await self.sio.enter_room(sid, game_id)
if game_id not in self.game_rooms:
self.game_rooms[game_id] = set()
self.game_rooms[game_id].add(sid)
```
**Issue #1**: Redundant tracking - Socket.io already maintains room membership
- `self.sio.enter_room()` registers the sid in Socket.io's internal room
- `self.game_rooms[game_id]` duplicates this information
- **Risk**: Synchronization could diverge if Socket.io rooms are accessed directly
#### Leave Game
```python
async def leave_game(self, sid: str, game_id: str) -> None:
await self.sio.leave_room(sid, game_id)
if game_id in self.game_rooms:
self.game_rooms[game_id].discard(sid)
```
**Issue #2**: Manual set cleanup but no cleanup of empty game rooms
- After last user leaves, `self.game_rooms[game_id]` remains as empty set
- **Memory Impact**: Accumulates empty sets over time
- **Recommendation**: Remove empty game rooms after discard
---
## 3. EVENT HANDLERS (15 Total)
### Handler Categories
#### Connection Events (2)
- `connect` - JWT authentication
- `disconnect` - Session cleanup
#### Game Flow Events (4)
- `join_game` - Join game room
- `leave_game` - Leave game room
- `request_game_state` - Recovery after disconnect/initial load
- `heartbeat` - Keep-alive ping
#### Decision Submission (2)
- `submit_defensive_decision` - Defense strategy
- `submit_offensive_decision` - Offense strategy
#### Manual Outcome Flow (2)
- `roll_dice` - Roll dice for play
- `submit_manual_outcome` - Submit card result
#### Substitutions (3)
- `request_pinch_hitter` - Batting substitution
- `request_pitching_change` - Pitcher substitution
- `request_defensive_replacement` - Field substitution
#### Query Events (2)
- `get_lineup` - Get team lineup
- `get_box_score` - Get statistics
### Error Handling Pattern
**Consistent "Emit or Return" Pattern**:
```python
try:
# Validate input
if not game_id_str:
await manager.emit_to_user(sid, "error", {"message": "Missing game_id"})
return
# Parse and validate
try:
game_id = UUID(game_id_str)
except (ValueError, AttributeError):
await manager.emit_to_user(sid, "error", {"message": "Invalid format"})
return
# Get state
state = state_manager.get_state(game_id)
if not state:
await manager.emit_to_user(sid, "error", {"message": f"Game not found"})
return
# Process action
result = await game_engine.some_operation(...)
# Broadcast result
await manager.broadcast_to_game(str(game_id), "event_name", data)
except GameValidationError as e:
await manager.emit_to_user(sid, "outcome_rejected", {"message": str(e)})
except Exception as e:
logger.error(f"Error: {e}", exc_info=True)
await manager.emit_to_user(sid, "error", {"message": str(e)})
```
**Analysis**: 157+ error handling statements providing comprehensive coverage
---
## 4. STATE SYNCHRONIZATION PATTERNS
### State Flow Diagram
```
Handler gets state → Mutates state object → Updates state manager → Broadcasts
```
### Critical State Mutation: Dice Roll Example
```python
# Line 216-217 in handlers.py (roll_dice)
state.pending_manual_roll = ab_roll
state_manager.update_state(game_id, state)
```
**Issue #3 - RACE CONDITION RISK**: State mutation without locking
- Multiple concurrent `roll_dice` calls could overwrite `pending_manual_roll`
- `state_manager.update_state()` does NOT use locks
- **Scenario**: Player 1 rolls, Player 2 rolls before Player 1 submits outcome
- `pending_manual_roll` is overwritten
- Player 1's submitted outcome rejects: "No pending dice roll"
### Similar Risk in Other Handlers
**All decision submission handlers (lines 271-409)**:
```python
# submit_defensive_decision
async def submit_defensive_decision(sid, data):
# ...
updated_state = await game_engine.submit_defensive_decision(game_id, decision)
# ...
await manager.broadcast_to_game(str(game_id), "defensive_decision_submitted", {...})
```
**Race Condition Risk**: Two simultaneous defensive decision submissions
- Both handlers enter concurrently
- Both call `game_engine.submit_defensive_decision()`
- Both may be processed if game_engine locks are insufficient
**Analysis of game_engine locking**:
```python
class GameEngine:
def _get_game_lock(self, game_id: UUID) -> asyncio.Lock:
if game_id not in self._game_locks:
self._game_locks[game_id] = asyncio.Lock()
return self._game_locks[game_id]
```
**Problem**: Locks are in game_engine, but WebSocket handlers don't use them!
- Handlers call state_manager directly without locks
- Game engine operations are locked, but handler-level mutations are not
- **Two-phase risk**: Handler mutation + Engine operation both need protection
---
## 5. RECONNECTION AND RECOVERY
### Game State Recovery Handler
```python
@sio.event
async def request_game_state(sid, data):
"""Client requests full game state (recovery after disconnect)"""
game_id_str = data.get("game_id")
game_id = UUID(game_id_str)
# Try memory first
state = state_manager.get_state(game_id)
# If not in memory, recover from database
if not state:
logger.info(f"Game {game_id} not in memory, recovering from database")
state = await state_manager.recover_game(game_id)
if state:
await manager.emit_to_user(sid, "game_state", state.model_dump(mode="json"))
else:
await manager.emit_to_user(sid, "error", {"message": f"Game not found"})
```
**Status**: ✅ Recovery mechanism exists
**However**:
### Issue #4 - No Explicit Reconnection Handling
- Socket.io supports client reconnection automatically, but no handler for it
- When client reconnects with same session:
- New socket connection establishes
- Old session ID may still be in `user_sessions`
- **Duplicate session risk**: Same user_id mapped to two sids
```python
async def connect(sid, environ, auth):
# ...
await manager.connect(sid, user_id) # <- Could overwrite old session
```
**Missing Code**: No check for existing sessions
```python
# MISSING:
if user_id in [uid for uid in self.user_sessions.values()]:
# Existing session detected - close old session?
# Or allow multiple sessions per user?
```
### Issue #5 - No Session Expiration
- `user_sessions` dictionary grows indefinitely
- If disconnect event doesn't fire (network failure), sessions accumulate
- After 1000 disconnected users: 1000 orphaned entries in memory
---
## 6. MEMORY LEAK POTENTIAL
### Issue #6 - Game Room Cleanup
**Current cleanup in disconnect**:
```python
for game_id, sids in self.game_rooms.items():
if sid in sids:
sids.remove(sid) # <- Leaves empty set in dict
```
**Problem**: Empty sets persist indefinitely
```python
# After all players leave game 123:
game_rooms = {
"123": set(), # <- Still in dict, wastes memory
"456": {sid1, sid2}
}
```
**Over time**: Hundreds of dead game rooms accumulate
**Recommendation**:
```python
sids.discard(sid)
if not sids: # If set is now empty
del self.game_rooms[game_id] # Remove the game room
```
### Issue #7 - Event Listener Cleanup
**Socket.io events are registered once at startup**:
```python
def register_handlers(sio: AsyncServer, manager: ConnectionManager) -> None:
@sio.event
async def connect(sid, environ, auth):
# Registered once, called many times
```
**Analysis**: ✅ No memory leak from event listeners (they're registered once as closures)
**However**: Check if state_manager has internal leaks
- `_pending_decisions: dict[tuple, asyncio.Future]` - could hold stale futures
- `_last_access: dict[UUID, pendulum.DateTime]` - could grow indefinitely
- `_lineups: dict[UUID, dict[int, TeamLineupState]]` - needs cleanup when game ends
---
## 7. GAME LIFECYCLE MANAGEMENT
### Issue #8 - Resource Cleanup Not Guaranteed
**Where games should cleanup**:
1. ✅ When game completes naturally (27 outs)
2. ❓ When game is manually ended
3. ❓ When user force-closes connection
4. ❓ After idle timeout
**Current cleanup in game_engine.py**:
```python
async def _cleanup_game_resources(game_id: UUID):
if game_id in self._rolls_this_inning:
del self._rolls_this_inning[game_id]
if game_id in self._game_locks:
del self._game_locks[game_id]
```
**Problem**: This cleanup is called only from game completion path
- No cleanup on unexpected disconnection
- No cleanup on timeout
- Game state remains in state_manager indefinitely
**Recommendation**:
```python
# When game ends (any reason):
await game_engine._cleanup_game_resources(game_id)
await state_manager.remove_game(game_id)
await manager.cleanup_game_room(game_id)
```
### Missing endpoint: Force game cleanup
**No REST endpoint or WebSocket event to cleanup a game**
- Useful for: admin actions, manual cleanup after crashes
- Should be admin-only
- Should emit notification to all players in room
---
## 8. AUTHORIZATION GAPS
### Issue #9 - Authorization TODOs
Found 11 security TODOs indicating missing authorization checks:
```python
# Line 90: join_game
# TODO: Verify user has access to game
# Line 200: roll_dice
# TODO: Verify user is participant in this game
# Line 293: submit_manual_outcome
# TODO: Verify user is active batter or authorized to submit
# Lines 558, 723, 880: Substitution handlers
# TODO: Verify user is authorized to make substitution for this team
```
**Risk**: Currently **ANY connected user** can:
- Roll dice for any game
- Submit outcomes
- Make substitutions
- View any lineup
- View any box score
**Should implement**:
```python
def _verify_game_access(sid: str, game_id: UUID) -> bool:
"""Check if user in this session has access to game"""
user_id = manager.user_sessions.get(sid)
# Check against game participant list in DB
def _verify_is_batter(sid: str, game_id: UUID) -> bool:
"""Check if user is the current batter"""
def _verify_is_pitcher(sid: str, game_id: UUID) -> bool:
"""Check if user is the current pitcher"""
```
---
## 9. POTENTIAL RACE CONDITIONS
### Race Condition Scenario #1: Concurrent Outcome Submissions
```
Timeline:
T1: Player1.submit_manual_outcome() - gets pending_manual_roll
T2: Player2.submit_manual_outcome() - gets SAME pending_manual_roll
T3: Player1 clears pending_manual_roll
T4: Player2 tries to clear it again - but it's already None
```
**Actual Problem**: Both players process the same roll!
- Play is resolved twice
- State is corrupted
- Both receive "outcome_accepted"
**Fix**: Use lock in handlers
```python
async def submit_manual_outcome(sid, data):
lock = game_engine._get_game_lock(game_id)
async with lock:
# Safe: only one handler can be here at a time
state = state_manager.get_state(game_id)
if not state.pending_manual_roll:
# Safely reject
```
### Race Condition Scenario #2: Substitution During Play
```
Timeline:
T1: substitute_player() - modifies lineup
T2: resolve_play() - reads lineup (sees old player)
T3: substitute_player() completes - broadcasts change
T4: resolve_play() completes - broadcasts different player in play result
```
**Result**: Confusion about which player was involved
### Race Condition Scenario #3: State Recovery During Active Play
```
Timeline:
T1: Player A submits outcome
T2: Client B reconnects
T3: Client B requests_game_state()
T4: state_manager.recover_game() loads from DB (sees old state)
T5: Client B receives outdated state while play is resolving
```
**Recovery timing issue**: Overlapping with active game operations
---
## 10. BROADCAST PATTERN RISKS
### Issue #10 - No Guarantees on Broadcast Delivery
```python
async def broadcast_to_game(self, game_id: str, event: str, data: dict) -> None:
await self.sio.emit(event, data, room=game_id)
logger.debug(f"Broadcast {event} to game {game_id}")
```
**Characteristics**:
- No acknowledgment mechanism
- No retry logic
- No delivery confirmation
- Fire-and-forget
**Risk**: Client doesn't receive "play_resolved" event
- State is updated on server
- Client never knows about it
- Client thinks outcome is still pending
- Timeout eventually kicks in
**Mitigation**: Socket.io provides:
- Client-side event acknowledgment (ack callbacks)
- Currently not used
- Should implement for critical events
---
## 11. HEARTBEAT AND KEEPALIVE
### Heartbeat Implementation
```python
@sio.event
async def heartbeat(sid):
"""Handle heartbeat ping"""
await sio.emit("heartbeat_ack", {}, room=sid)
```
**Status**: ✅ Basic heartbeat exists
**However**:
- No server-initiated heartbeat
- No timeout tracking
- No automatic cleanup of idle sessions
- Client must ping at regular intervals
**Production Issue**:
- Disconnected clients (network issue) may not send disconnect event
- Sessions persist indefinitely
- After 1000 idle connections: significant memory waste
**Recommendation**:
```python
# Server-side keepalive
async def maintain_heartbeats():
while True:
await asyncio.sleep(30) # Every 30 seconds
# Check for stale sessions without recent activity
now = pendulum.now('UTC')
for sid, user_id in list(manager.user_sessions.items()):
last_activity = manager.last_activity.get(sid)
if last_activity and (now - last_activity).seconds > 300:
# 5 minutes idle - disconnect
await sio.disconnect(sid)
```
---
## 12. DATABASE INTERACTION RISKS
### Issue #11 - Async DB Operations Without Proper Isolation
```python
@sio.event
async def submit_manual_outcome(sid, data):
# ...
result = await game_engine.resolve_manual_play(
game_id=game_id,
ab_roll=ab_roll,
outcome=outcome,
hit_location=submission.hit_location,
)
# Updated state might lag behind DB
updated_state = state_manager.get_state(game_id)
if updated_state:
await manager.broadcast_to_game(
str(game_id), "game_state_update", updated_state.model_dump(mode="json")
)
```
**Issue**: State broadcast happens synchronously, but DB write is async
- Broadcast might happen before DB write completes
- DB connection failure silently fails (logged only)
- Client has state but DB might not have it yet
**Better approach**:
```python
result = await game_engine.resolve_manual_play(...) # Includes DB write
# Only then broadcast
await manager.broadcast_to_game(...)
```
---
## 13. SOCKET.IO CONFIGURATION
### Current Configuration (main.py)
```python
sio = socketio.AsyncServer(
async_mode="asgi",
cors_allowed_origins=settings.cors_origins,
logger=True,
engineio_logger=False,
)
```
### Missing Configuration Options
```python
# Should consider adding:
sio = socketio.AsyncServer(
async_mode="asgi",
cors_allowed_origins=settings.cors_origins,
logger=True,
engineio_logger=False,
# Connection options
ping_timeout=60, # Disconnect after 60s no response to ping
ping_interval=25, # Send ping every 25 seconds
max_http_buffer_size=1e6, # 1MB max message size
# Event handling
async_handlers=True, # Already default with async
# Namespaces
# Could partition different game types into separate namespaces
)
```
**Without ping_timeout**: Zombie connections can persist indefinitely
---
## 14. TEST COVERAGE
### Current Tests
- **File**: `tests/unit/websocket/test_manual_outcome_handlers.py`
- **Status**: Unit tests with mocks for critical handlers
### Test Coverage Analysis
**Covered**:
- `roll_dice` success path
- `roll_dice` missing game_id
- `roll_dice` invalid game_id
- `submit_manual_outcome` (partial)
**Not Covered**:
- Concurrent submissions (race condition tests)
- Connection cleanup
- Game room cleanup after disconnect
- Authorization verification (because TODOs indicate not implemented)
- Reconnection scenarios
- State recovery from DB
- All 15 handlers not fully tested
**Recommendation**: Add integration tests for:
```python
@pytest.mark.asyncio
async def test_concurrent_outcome_submissions():
"""Two players submit outcomes simultaneously"""
# Create game
# Player 1 starts submit_manual_outcome
# Player 2 starts submit_manual_outcome
# Both should fail or one should win
@pytest.mark.asyncio
async def test_connection_cleanup_on_disconnect():
"""Game room should be cleaned up after all players disconnect"""
# Connect 3 players to game 123
# Disconnect all 3
# Assert game_rooms doesn't contain empty set for game 123
```
---
## 15. RECOMMENDATIONS
### Priority 1: Critical (Production Risk)
1. **Add Locking to Handler-Level State Mutations**
- Location: All handlers that modify state before calling game_engine
- Impact: Prevents race conditions in critical paths
- Effort: Medium
- Example: Wrap state mutation in game_engine locks
```python
async def submit_manual_outcome(sid, data):
lock = game_engine._get_game_lock(game_id)
async with lock:
# Safe state modifications
```
2. **Implement Authorization Checks**
- Location: All 11 TODO comments in handlers.py
- Impact: Prevents unauthorized game access
- Effort: Medium
- Add helper functions for permission verification
3. **Add Session Expiration**
- Location: connection_manager.py, heartbeat handler
- Impact: Prevent memory leak from zombie sessions
- Effort: Low
- Track last_activity per sid, cleanup stale sessions
### Priority 2: Important (Data Integrity)
4. **Clean Up Empty Game Rooms**
- Location: disconnect() method in connection_manager
- Impact: Prevent memory leak
- Effort: Low
- Delete empty sets from game_rooms dict
5. **Implement Game Cleanup Lifecycle**
- Location: game_engine.py
- Impact: Proper resource cleanup
- Effort: High
- Add cleanup calls to state_manager on game completion
6. **Add Reconnection Handling**
- Location: connect handler
- Impact: Handle clients reconnecting with same user
- Effort: Medium
- Detect duplicate sessions, manage accordingly
### Priority 3: Recommended (Reliability)
7. **Implement Broadcast Acknowledgments**
- Location: broadcast_to_game calls
- Impact: Ensure critical events are delivered
- Effort: Medium
- Use Socket.io ack callbacks for game_state_update
8. **Add Server-Initiated Keepalive**
- Location: Lifespan context manager or separate task
- Impact: Detect disconnected clients promptly
- Effort: Medium
- Periodic server-side health check
9. **Configure Socket.io Ping Timeouts**
- Location: main.py socket initialization
- Impact: Automatic cleanup of dead connections
- Effort: Low
- Set ping_timeout=60, ping_interval=25
10. **Add Comprehensive Integration Tests**
- Location: tests/integration/websocket/
- Impact: Prevent regressions
- Effort: High
- Test concurrent scenarios, recovery, cleanup
### Priority 4: Enhancement (Performance)
11. **Implement Namespace Separation**
- Partition different game instances into separate namespaces
- Reduces broadcast overhead
- Effort: High
12. **Add Event Compression**
- For large game_state_update broadcasts
- Reduce network overhead
- Effort: Medium
---
## 16. SUMMARY TABLE
| Issue | Severity | Category | Impact | Fix Effort |
|-------|----------|----------|--------|-----------|
| State mutation race conditions | CRITICAL | Concurrency | Data corruption | Medium |
| Missing authorization | CRITICAL | Security | Unauthorized access | Medium |
| Session expiration | HIGH | Memory | Unbounded memory growth | Low |
| Empty game room cleanup | HIGH | Memory | Memory leak | Low |
| Game resource cleanup | HIGH | Lifecycle | Orphaned resources | High |
| Reconnection handling | HIGH | Reliability | Duplicate sessions | Medium |
| Broadcast delivery guarantees | MEDIUM | Reliability | Missed updates | Medium |
| Integration test coverage | MEDIUM | Testing | Undetected bugs | High |
| Socket.io config optimization | MEDIUM | Reliability | Long disconnection detection | Low |
| DB operation isolation | LOW | Data Integrity | Minor timing issues | Low |
---
## 17. IMPLEMENTATION CHECKLIST
### Phase 1: Critical Fixes (Week 1)
- [ ] Add locking to handler state mutations
- [ ] Implement authorization middleware
- [ ] Add session expiration logic
- [ ] Clean up empty game rooms
### Phase 2: Important Fixes (Week 2-3)
- [ ] Implement game cleanup lifecycle
- [ ] Add reconnection handling
- [ ] Add Socket.io ping configuration
- [ ] Implement broadcast acknowledgments
### Phase 3: Testing & Validation (Week 4)
- [ ] Add concurrency tests
- [ ] Add recovery tests
- [ ] Add cleanup verification tests
- [ ] Load testing with 100+ concurrent connections
---
**Report Generated**: 2025-11-27
**Analysis Tool**: Claude Code (File Search Specialist)
**Scope**: 2,643 lines analyzed across 3 core files + CLAUDE.md documentation