strat-gameplay-webapp/WEBSOCKET_ARCHITECTURE_ANALYSIS.md
Cal Corum e0c12467b0 CLAUDE: Improve UX with single-click OAuth, enhanced games list, and layout fix
Frontend UX improvements:
- Single-click Discord OAuth from home page (no intermediate /auth page)
- Auto-redirect authenticated users from home to /games
- Fixed Nuxt layout system - app.vue now wraps NuxtPage with NuxtLayout
- Games page now has proper card container with shadow/border styling
- Layout header includes working logout with API cookie clearing

Games list enhancements:
- Display team names (lname) instead of just team IDs
- Show current score for each team
- Show inning indicator (Top/Bot X) for active games
- Responsive header with wrapped buttons on mobile

Backend improvements:
- Added team caching to SbaApiClient (1-hour TTL)
- Enhanced GameListItem with team names, scores, inning data
- Games endpoint now enriches response with SBA API team data

Docker optimizations:
- Optimized Dockerfile using --chown flag on COPY (faster than chown -R)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-05 16:14:00 -06:00

24 KiB

WebSocket Implementation Analysis Report

Paper Dynasty Real-Time Game Engine Backend

Date: 2025-11-27
Scope: /mnt/NV2/Development/strat-gameplay-webapp/backend/app/websocket/
Status: Comprehensive exploration of production-ready implementation


EXECUTIVE SUMMARY

The WebSocket implementation is well-architected and production-ready with 15 event handlers, proper error handling, and recovery mechanisms. However, there are several identified risk areas related to state synchronization, memory management, and race conditions that require attention before handling high-concurrency scenarios.

Key Findings:

  • Solid Foundation: Clean separation of concerns, proper async/await patterns
  • Good Error Handling: 157+ error handling statements across handlers
  • ⚠️ State Mutation Risk: Direct state object mutations without proper locking
  • ⚠️ Memory Leak Potential: Game resource cleanup not guaranteed on all code paths
  • ⚠️ Race Condition Risk: Multiple concurrent events could corrupt state
  • ⚠️ Reconnection Gap: No explicit reconnection handling or session recovery strategy
  • ⚠️ Room Cleanup: Stale game rooms may persist indefinitely

1. ARCHITECTURE OVERVIEW

Directory Structure

app/websocket/
├── connection_manager.py (71 lines)   # Room & session management
├── handlers.py (1,312 lines)          # 15 event handlers
├── __init__.py                         # Package marker
└── CLAUDE.md                           # Documentation

Socket.io Initialization (main.py)

sio = socketio.AsyncServer(
    async_mode="asgi",
    cors_allowed_origins=settings.cors_origins,
    logger=True,
    engineio_logger=False
)
socket_app = socketio.ASGIApp(sio, app)

connection_manager = ConnectionManager(sio)
register_handlers(sio, connection_manager)
game_engine.set_connection_manager(connection_manager)

Critical Point: Connection manager is a singleton instance passed to all handlers via closure.


2. CONNECTION MANAGEMENT

ConnectionManager Structure

class ConnectionManager:
    def __init__(self, sio: socketio.AsyncServer):
        self.sio = sio
        self.user_sessions: dict[str, str] = {}  # sid -> user_id
        self.game_rooms: dict[str, set[str]] = {}  # game_id -> set of sids

Connection Lifecycle

1. Connect Handler

@sio.event
async def connect(sid, environ, auth):
    # Dual auth support:
    # 1. HttpOnly cookies (from nginx proxy)
    # 2. Auth object (direct JS clients)
    token = extract_token(environ, auth)
    user_data = verify_token(token)
    user_id = user_data.get("user_id")
    
    await manager.connect(sid, user_id)
    await sio.emit("connected", {"user_id": user_id}, room=sid)
    return True  # Accept connection

Status: Properly authenticated with fallback support

2. Disconnect Handler

@sio.event
async def disconnect(sid):
    await manager.disconnect(sid)

async def disconnect(self, sid: str) -> None:
    user_id = self.user_sessions.pop(sid, None)
    if user_id:
        logger.info(f"User {user_id} disconnected (session {sid})")
        
        # Remove from all game rooms
        for game_id, sids in self.game_rooms.items():
            if sid in sids:
                sids.remove(sid)
                await self.broadcast_to_game(
                    game_id, "user_disconnected", {"user_id": user_id}
                )

Status: Removes user from all rooms

Room Management

Join Game

async def join_game(self, sid: str, game_id: str, role: str) -> None:
    await self.sio.enter_room(sid, game_id)
    
    if game_id not in self.game_rooms:
        self.game_rooms[game_id] = set()
    self.game_rooms[game_id].add(sid)

Issue #1: Redundant tracking - Socket.io already maintains room membership

  • self.sio.enter_room() registers the sid in Socket.io's internal room
  • self.game_rooms[game_id] duplicates this information
  • Risk: Synchronization could diverge if Socket.io rooms are accessed directly

Leave Game

async def leave_game(self, sid: str, game_id: str) -> None:
    await self.sio.leave_room(sid, game_id)
    
    if game_id in self.game_rooms:
        self.game_rooms[game_id].discard(sid)

Issue #2: Manual set cleanup but no cleanup of empty game rooms

  • After last user leaves, self.game_rooms[game_id] remains as empty set
  • Memory Impact: Accumulates empty sets over time
  • Recommendation: Remove empty game rooms after discard

3. EVENT HANDLERS (15 Total)

Handler Categories

Connection Events (2)

  • connect - JWT authentication
  • disconnect - Session cleanup

Game Flow Events (4)

  • join_game - Join game room
  • leave_game - Leave game room
  • request_game_state - Recovery after disconnect/initial load
  • heartbeat - Keep-alive ping

Decision Submission (2)

  • submit_defensive_decision - Defense strategy
  • submit_offensive_decision - Offense strategy

Manual Outcome Flow (2)

  • roll_dice - Roll dice for play
  • submit_manual_outcome - Submit card result

Substitutions (3)

  • request_pinch_hitter - Batting substitution
  • request_pitching_change - Pitcher substitution
  • request_defensive_replacement - Field substitution

Query Events (2)

  • get_lineup - Get team lineup
  • get_box_score - Get statistics

Error Handling Pattern

Consistent "Emit or Return" Pattern:

try:
    # Validate input
    if not game_id_str:
        await manager.emit_to_user(sid, "error", {"message": "Missing game_id"})
        return
    
    # Parse and validate
    try:
        game_id = UUID(game_id_str)
    except (ValueError, AttributeError):
        await manager.emit_to_user(sid, "error", {"message": "Invalid format"})
        return
    
    # Get state
    state = state_manager.get_state(game_id)
    if not state:
        await manager.emit_to_user(sid, "error", {"message": f"Game not found"})
        return
    
    # Process action
    result = await game_engine.some_operation(...)
    
    # Broadcast result
    await manager.broadcast_to_game(str(game_id), "event_name", data)
    
except GameValidationError as e:
    await manager.emit_to_user(sid, "outcome_rejected", {"message": str(e)})
except Exception as e:
    logger.error(f"Error: {e}", exc_info=True)
    await manager.emit_to_user(sid, "error", {"message": str(e)})

Analysis: 157+ error handling statements providing comprehensive coverage


4. STATE SYNCHRONIZATION PATTERNS

State Flow Diagram

Handler gets state → Mutates state object → Updates state manager → Broadcasts

Critical State Mutation: Dice Roll Example

# Line 216-217 in handlers.py (roll_dice)
state.pending_manual_roll = ab_roll
state_manager.update_state(game_id, state)

Issue #3 - RACE CONDITION RISK: State mutation without locking

  • Multiple concurrent roll_dice calls could overwrite pending_manual_roll
  • state_manager.update_state() does NOT use locks
  • Scenario: Player 1 rolls, Player 2 rolls before Player 1 submits outcome
    • pending_manual_roll is overwritten
    • Player 1's submitted outcome rejects: "No pending dice roll"

Similar Risk in Other Handlers

All decision submission handlers (lines 271-409):

# submit_defensive_decision
async def submit_defensive_decision(sid, data):
    # ...
    updated_state = await game_engine.submit_defensive_decision(game_id, decision)
    # ...
    await manager.broadcast_to_game(str(game_id), "defensive_decision_submitted", {...})

Race Condition Risk: Two simultaneous defensive decision submissions

  • Both handlers enter concurrently
  • Both call game_engine.submit_defensive_decision()
  • Both may be processed if game_engine locks are insufficient

Analysis of game_engine locking:

class GameEngine:
    def _get_game_lock(self, game_id: UUID) -> asyncio.Lock:
        if game_id not in self._game_locks:
            self._game_locks[game_id] = asyncio.Lock()
        return self._game_locks[game_id]

Problem: Locks are in game_engine, but WebSocket handlers don't use them!

  • Handlers call state_manager directly without locks
  • Game engine operations are locked, but handler-level mutations are not
  • Two-phase risk: Handler mutation + Engine operation both need protection

5. RECONNECTION AND RECOVERY

Game State Recovery Handler

@sio.event
async def request_game_state(sid, data):
    """Client requests full game state (recovery after disconnect)"""
    game_id_str = data.get("game_id")
    game_id = UUID(game_id_str)
    
    # Try memory first
    state = state_manager.get_state(game_id)
    
    # If not in memory, recover from database
    if not state:
        logger.info(f"Game {game_id} not in memory, recovering from database")
        state = await state_manager.recover_game(game_id)
    
    if state:
        await manager.emit_to_user(sid, "game_state", state.model_dump(mode="json"))
    else:
        await manager.emit_to_user(sid, "error", {"message": f"Game not found"})

Status: Recovery mechanism exists

However:

Issue #4 - No Explicit Reconnection Handling

  • Socket.io supports client reconnection automatically, but no handler for it
  • When client reconnects with same session:
    • New socket connection establishes
    • Old session ID may still be in user_sessions
    • Duplicate session risk: Same user_id mapped to two sids
async def connect(sid, environ, auth):
    # ...
    await manager.connect(sid, user_id)  # <- Could overwrite old session

Missing Code: No check for existing sessions

# MISSING: 
if user_id in [uid for uid in self.user_sessions.values()]:
    # Existing session detected - close old session?
    # Or allow multiple sessions per user?

Issue #5 - No Session Expiration

  • user_sessions dictionary grows indefinitely
  • If disconnect event doesn't fire (network failure), sessions accumulate
  • After 1000 disconnected users: 1000 orphaned entries in memory

6. MEMORY LEAK POTENTIAL

Issue #6 - Game Room Cleanup

Current cleanup in disconnect:

for game_id, sids in self.game_rooms.items():
    if sid in sids:
        sids.remove(sid)  # <- Leaves empty set in dict

Problem: Empty sets persist indefinitely

# After all players leave game 123:
game_rooms = {
    "123": set(),      # <- Still in dict, wastes memory
    "456": {sid1, sid2}
}

Over time: Hundreds of dead game rooms accumulate

Recommendation:

sids.discard(sid)
if not sids:  # If set is now empty
    del self.game_rooms[game_id]  # Remove the game room

Issue #7 - Event Listener Cleanup

Socket.io events are registered once at startup:

def register_handlers(sio: AsyncServer, manager: ConnectionManager) -> None:
    @sio.event
    async def connect(sid, environ, auth):
        # Registered once, called many times

Analysis: No memory leak from event listeners (they're registered once as closures)

However: Check if state_manager has internal leaks

  • _pending_decisions: dict[tuple, asyncio.Future] - could hold stale futures
  • _last_access: dict[UUID, pendulum.DateTime] - could grow indefinitely
  • _lineups: dict[UUID, dict[int, TeamLineupState]] - needs cleanup when game ends

7. GAME LIFECYCLE MANAGEMENT

Issue #8 - Resource Cleanup Not Guaranteed

Where games should cleanup:

  1. When game completes naturally (27 outs)
  2. When game is manually ended
  3. When user force-closes connection
  4. After idle timeout

Current cleanup in game_engine.py:

async def _cleanup_game_resources(game_id: UUID):
    if game_id in self._rolls_this_inning:
        del self._rolls_this_inning[game_id]
    if game_id in self._game_locks:
        del self._game_locks[game_id]

Problem: This cleanup is called only from game completion path

  • No cleanup on unexpected disconnection
  • No cleanup on timeout
  • Game state remains in state_manager indefinitely

Recommendation:

# When game ends (any reason):
await game_engine._cleanup_game_resources(game_id)
await state_manager.remove_game(game_id)
await manager.cleanup_game_room(game_id)

Missing endpoint: Force game cleanup

No REST endpoint or WebSocket event to cleanup a game

  • Useful for: admin actions, manual cleanup after crashes
  • Should be admin-only
  • Should emit notification to all players in room

8. AUTHORIZATION GAPS

Issue #9 - Authorization TODOs

Found 11 security TODOs indicating missing authorization checks:

# Line 90: join_game
# TODO: Verify user has access to game

# Line 200: roll_dice
# TODO: Verify user is participant in this game

# Line 293: submit_manual_outcome
# TODO: Verify user is active batter or authorized to submit

# Lines 558, 723, 880: Substitution handlers
# TODO: Verify user is authorized to make substitution for this team

Risk: Currently ANY connected user can:

  • Roll dice for any game
  • Submit outcomes
  • Make substitutions
  • View any lineup
  • View any box score

Should implement:

def _verify_game_access(sid: str, game_id: UUID) -> bool:
    """Check if user in this session has access to game"""
    user_id = manager.user_sessions.get(sid)
    # Check against game participant list in DB
    
def _verify_is_batter(sid: str, game_id: UUID) -> bool:
    """Check if user is the current batter"""
    
def _verify_is_pitcher(sid: str, game_id: UUID) -> bool:
    """Check if user is the current pitcher"""

9. POTENTIAL RACE CONDITIONS

Race Condition Scenario #1: Concurrent Outcome Submissions

Timeline:
T1: Player1.submit_manual_outcome() - gets pending_manual_roll
T2: Player2.submit_manual_outcome() - gets SAME pending_manual_roll
T3: Player1 clears pending_manual_roll
T4: Player2 tries to clear it again - but it's already None

Actual Problem: Both players process the same roll!

  • Play is resolved twice
  • State is corrupted
  • Both receive "outcome_accepted"

Fix: Use lock in handlers

async def submit_manual_outcome(sid, data):
    lock = game_engine._get_game_lock(game_id)
    async with lock:
        # Safe: only one handler can be here at a time
        state = state_manager.get_state(game_id)
        if not state.pending_manual_roll:
            # Safely reject

Race Condition Scenario #2: Substitution During Play

Timeline:
T1: substitute_player() - modifies lineup
T2: resolve_play() - reads lineup (sees old player)
T3: substitute_player() completes - broadcasts change
T4: resolve_play() completes - broadcasts different player in play result

Result: Confusion about which player was involved

Race Condition Scenario #3: State Recovery During Active Play

Timeline:
T1: Player A submits outcome
T2: Client B reconnects
T3: Client B requests_game_state()
T4: state_manager.recover_game() loads from DB (sees old state)
T5: Client B receives outdated state while play is resolving

Recovery timing issue: Overlapping with active game operations


10. BROADCAST PATTERN RISKS

Issue #10 - No Guarantees on Broadcast Delivery

async def broadcast_to_game(self, game_id: str, event: str, data: dict) -> None:
    await self.sio.emit(event, data, room=game_id)
    logger.debug(f"Broadcast {event} to game {game_id}")

Characteristics:

  • No acknowledgment mechanism
  • No retry logic
  • No delivery confirmation
  • Fire-and-forget

Risk: Client doesn't receive "play_resolved" event

  • State is updated on server
  • Client never knows about it
  • Client thinks outcome is still pending
  • Timeout eventually kicks in

Mitigation: Socket.io provides:

  • Client-side event acknowledgment (ack callbacks)
  • Currently not used
  • Should implement for critical events

11. HEARTBEAT AND KEEPALIVE

Heartbeat Implementation

@sio.event
async def heartbeat(sid):
    """Handle heartbeat ping"""
    await sio.emit("heartbeat_ack", {}, room=sid)

Status: Basic heartbeat exists

However:

  • No server-initiated heartbeat
  • No timeout tracking
  • No automatic cleanup of idle sessions
  • Client must ping at regular intervals

Production Issue:

  • Disconnected clients (network issue) may not send disconnect event
  • Sessions persist indefinitely
  • After 1000 idle connections: significant memory waste

Recommendation:

# Server-side keepalive
async def maintain_heartbeats():
    while True:
        await asyncio.sleep(30)  # Every 30 seconds
        # Check for stale sessions without recent activity
        now = pendulum.now('UTC')
        for sid, user_id in list(manager.user_sessions.items()):
            last_activity = manager.last_activity.get(sid)
            if last_activity and (now - last_activity).seconds > 300:
                # 5 minutes idle - disconnect
                await sio.disconnect(sid)

12. DATABASE INTERACTION RISKS

Issue #11 - Async DB Operations Without Proper Isolation

@sio.event
async def submit_manual_outcome(sid, data):
    # ...
    result = await game_engine.resolve_manual_play(
        game_id=game_id,
        ab_roll=ab_roll,
        outcome=outcome,
        hit_location=submission.hit_location,
    )
    
    # Updated state might lag behind DB
    updated_state = state_manager.get_state(game_id)
    if updated_state:
        await manager.broadcast_to_game(
            str(game_id), "game_state_update", updated_state.model_dump(mode="json")
        )

Issue: State broadcast happens synchronously, but DB write is async

  • Broadcast might happen before DB write completes
  • DB connection failure silently fails (logged only)
  • Client has state but DB might not have it yet

Better approach:

result = await game_engine.resolve_manual_play(...)  # Includes DB write
# Only then broadcast
await manager.broadcast_to_game(...)

13. SOCKET.IO CONFIGURATION

Current Configuration (main.py)

sio = socketio.AsyncServer(
    async_mode="asgi",
    cors_allowed_origins=settings.cors_origins,
    logger=True,
    engineio_logger=False,
)

Missing Configuration Options

# Should consider adding:
sio = socketio.AsyncServer(
    async_mode="asgi",
    cors_allowed_origins=settings.cors_origins,
    logger=True,
    engineio_logger=False,
    
    # Connection options
    ping_timeout=60,              # Disconnect after 60s no response to ping
    ping_interval=25,             # Send ping every 25 seconds
    max_http_buffer_size=1e6,     # 1MB max message size
    
    # Event handling
    async_handlers=True,          # Already default with async
    
    # Namespaces
    # Could partition different game types into separate namespaces
)

Without ping_timeout: Zombie connections can persist indefinitely


14. TEST COVERAGE

Current Tests

  • File: tests/unit/websocket/test_manual_outcome_handlers.py
  • Status: Unit tests with mocks for critical handlers

Test Coverage Analysis

Covered:

  • roll_dice success path
  • roll_dice missing game_id
  • roll_dice invalid game_id
  • submit_manual_outcome (partial)

Not Covered:

  • Concurrent submissions (race condition tests)
  • Connection cleanup
  • Game room cleanup after disconnect
  • Authorization verification (because TODOs indicate not implemented)
  • Reconnection scenarios
  • State recovery from DB
  • All 15 handlers not fully tested

Recommendation: Add integration tests for:

@pytest.mark.asyncio
async def test_concurrent_outcome_submissions():
    """Two players submit outcomes simultaneously"""
    # Create game
    # Player 1 starts submit_manual_outcome
    # Player 2 starts submit_manual_outcome
    # Both should fail or one should win
    
@pytest.mark.asyncio
async def test_connection_cleanup_on_disconnect():
    """Game room should be cleaned up after all players disconnect"""
    # Connect 3 players to game 123
    # Disconnect all 3
    # Assert game_rooms doesn't contain empty set for game 123

15. RECOMMENDATIONS

Priority 1: Critical (Production Risk)

  1. Add Locking to Handler-Level State Mutations

    • Location: All handlers that modify state before calling game_engine
    • Impact: Prevents race conditions in critical paths
    • Effort: Medium
    • Example: Wrap state mutation in game_engine locks
    async def submit_manual_outcome(sid, data):
        lock = game_engine._get_game_lock(game_id)
        async with lock:
            # Safe state modifications
    
  2. Implement Authorization Checks

    • Location: All 11 TODO comments in handlers.py
    • Impact: Prevents unauthorized game access
    • Effort: Medium
    • Add helper functions for permission verification
  3. Add Session Expiration

    • Location: connection_manager.py, heartbeat handler
    • Impact: Prevent memory leak from zombie sessions
    • Effort: Low
    • Track last_activity per sid, cleanup stale sessions

Priority 2: Important (Data Integrity)

  1. Clean Up Empty Game Rooms

    • Location: disconnect() method in connection_manager
    • Impact: Prevent memory leak
    • Effort: Low
    • Delete empty sets from game_rooms dict
  2. Implement Game Cleanup Lifecycle

    • Location: game_engine.py
    • Impact: Proper resource cleanup
    • Effort: High
    • Add cleanup calls to state_manager on game completion
  3. Add Reconnection Handling

    • Location: connect handler
    • Impact: Handle clients reconnecting with same user
    • Effort: Medium
    • Detect duplicate sessions, manage accordingly
  1. Implement Broadcast Acknowledgments

    • Location: broadcast_to_game calls
    • Impact: Ensure critical events are delivered
    • Effort: Medium
    • Use Socket.io ack callbacks for game_state_update
  2. Add Server-Initiated Keepalive

    • Location: Lifespan context manager or separate task
    • Impact: Detect disconnected clients promptly
    • Effort: Medium
    • Periodic server-side health check
  3. Configure Socket.io Ping Timeouts

    • Location: main.py socket initialization
    • Impact: Automatic cleanup of dead connections
    • Effort: Low
    • Set ping_timeout=60, ping_interval=25
  4. Add Comprehensive Integration Tests

    • Location: tests/integration/websocket/
    • Impact: Prevent regressions
    • Effort: High
    • Test concurrent scenarios, recovery, cleanup

Priority 4: Enhancement (Performance)

  1. Implement Namespace Separation

    • Partition different game instances into separate namespaces
    • Reduces broadcast overhead
    • Effort: High
  2. Add Event Compression

    • For large game_state_update broadcasts
    • Reduce network overhead
    • Effort: Medium

16. SUMMARY TABLE

Issue Severity Category Impact Fix Effort
State mutation race conditions CRITICAL Concurrency Data corruption Medium
Missing authorization CRITICAL Security Unauthorized access Medium
Session expiration HIGH Memory Unbounded memory growth Low
Empty game room cleanup HIGH Memory Memory leak Low
Game resource cleanup HIGH Lifecycle Orphaned resources High
Reconnection handling HIGH Reliability Duplicate sessions Medium
Broadcast delivery guarantees MEDIUM Reliability Missed updates Medium
Integration test coverage MEDIUM Testing Undetected bugs High
Socket.io config optimization MEDIUM Reliability Long disconnection detection Low
DB operation isolation LOW Data Integrity Minor timing issues Low

17. IMPLEMENTATION CHECKLIST

Phase 1: Critical Fixes (Week 1)

  • Add locking to handler state mutations
  • Implement authorization middleware
  • Add session expiration logic
  • Clean up empty game rooms

Phase 2: Important Fixes (Week 2-3)

  • Implement game cleanup lifecycle
  • Add reconnection handling
  • Add Socket.io ping configuration
  • Implement broadcast acknowledgments

Phase 3: Testing & Validation (Week 4)

  • Add concurrency tests
  • Add recovery tests
  • Add cleanup verification tests
  • Load testing with 100+ concurrent connections

Report Generated: 2025-11-27
Analysis Tool: Claude Code (File Search Specialist)
Scope: 2,643 lines analyzed across 3 core files + CLAUDE.md documentation