Frontend UX improvements: - Single-click Discord OAuth from home page (no intermediate /auth page) - Auto-redirect authenticated users from home to /games - Fixed Nuxt layout system - app.vue now wraps NuxtPage with NuxtLayout - Games page now has proper card container with shadow/border styling - Layout header includes working logout with API cookie clearing Games list enhancements: - Display team names (lname) instead of just team IDs - Show current score for each team - Show inning indicator (Top/Bot X) for active games - Responsive header with wrapped buttons on mobile Backend improvements: - Added team caching to SbaApiClient (1-hour TTL) - Enhanced GameListItem with team names, scores, inning data - Games endpoint now enriches response with SBA API team data Docker optimizations: - Optimized Dockerfile using --chown flag on COPY (faster than chown -R) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
24 KiB
WebSocket Implementation Analysis Report
Paper Dynasty Real-Time Game Engine Backend
Date: 2025-11-27
Scope: /mnt/NV2/Development/strat-gameplay-webapp/backend/app/websocket/
Status: Comprehensive exploration of production-ready implementation
EXECUTIVE SUMMARY
The WebSocket implementation is well-architected and production-ready with 15 event handlers, proper error handling, and recovery mechanisms. However, there are several identified risk areas related to state synchronization, memory management, and race conditions that require attention before handling high-concurrency scenarios.
Key Findings:
- ✅ Solid Foundation: Clean separation of concerns, proper async/await patterns
- ✅ Good Error Handling: 157+ error handling statements across handlers
- ⚠️ State Mutation Risk: Direct state object mutations without proper locking
- ⚠️ Memory Leak Potential: Game resource cleanup not guaranteed on all code paths
- ⚠️ Race Condition Risk: Multiple concurrent events could corrupt state
- ⚠️ Reconnection Gap: No explicit reconnection handling or session recovery strategy
- ⚠️ Room Cleanup: Stale game rooms may persist indefinitely
1. ARCHITECTURE OVERVIEW
Directory Structure
app/websocket/
├── connection_manager.py (71 lines) # Room & session management
├── handlers.py (1,312 lines) # 15 event handlers
├── __init__.py # Package marker
└── CLAUDE.md # Documentation
Socket.io Initialization (main.py)
sio = socketio.AsyncServer(
async_mode="asgi",
cors_allowed_origins=settings.cors_origins,
logger=True,
engineio_logger=False
)
socket_app = socketio.ASGIApp(sio, app)
connection_manager = ConnectionManager(sio)
register_handlers(sio, connection_manager)
game_engine.set_connection_manager(connection_manager)
Critical Point: Connection manager is a singleton instance passed to all handlers via closure.
2. CONNECTION MANAGEMENT
ConnectionManager Structure
class ConnectionManager:
def __init__(self, sio: socketio.AsyncServer):
self.sio = sio
self.user_sessions: dict[str, str] = {} # sid -> user_id
self.game_rooms: dict[str, set[str]] = {} # game_id -> set of sids
Connection Lifecycle
1. Connect Handler
@sio.event
async def connect(sid, environ, auth):
# Dual auth support:
# 1. HttpOnly cookies (from nginx proxy)
# 2. Auth object (direct JS clients)
token = extract_token(environ, auth)
user_data = verify_token(token)
user_id = user_data.get("user_id")
await manager.connect(sid, user_id)
await sio.emit("connected", {"user_id": user_id}, room=sid)
return True # Accept connection
Status: ✅ Properly authenticated with fallback support
2. Disconnect Handler
@sio.event
async def disconnect(sid):
await manager.disconnect(sid)
async def disconnect(self, sid: str) -> None:
user_id = self.user_sessions.pop(sid, None)
if user_id:
logger.info(f"User {user_id} disconnected (session {sid})")
# Remove from all game rooms
for game_id, sids in self.game_rooms.items():
if sid in sids:
sids.remove(sid)
await self.broadcast_to_game(
game_id, "user_disconnected", {"user_id": user_id}
)
Status: ✅ Removes user from all rooms
Room Management
Join Game
async def join_game(self, sid: str, game_id: str, role: str) -> None:
await self.sio.enter_room(sid, game_id)
if game_id not in self.game_rooms:
self.game_rooms[game_id] = set()
self.game_rooms[game_id].add(sid)
Issue #1: Redundant tracking - Socket.io already maintains room membership
self.sio.enter_room()registers the sid in Socket.io's internal roomself.game_rooms[game_id]duplicates this information- Risk: Synchronization could diverge if Socket.io rooms are accessed directly
Leave Game
async def leave_game(self, sid: str, game_id: str) -> None:
await self.sio.leave_room(sid, game_id)
if game_id in self.game_rooms:
self.game_rooms[game_id].discard(sid)
Issue #2: Manual set cleanup but no cleanup of empty game rooms
- After last user leaves,
self.game_rooms[game_id]remains as empty set - Memory Impact: Accumulates empty sets over time
- Recommendation: Remove empty game rooms after discard
3. EVENT HANDLERS (15 Total)
Handler Categories
Connection Events (2)
connect- JWT authenticationdisconnect- Session cleanup
Game Flow Events (4)
join_game- Join game roomleave_game- Leave game roomrequest_game_state- Recovery after disconnect/initial loadheartbeat- Keep-alive ping
Decision Submission (2)
submit_defensive_decision- Defense strategysubmit_offensive_decision- Offense strategy
Manual Outcome Flow (2)
roll_dice- Roll dice for playsubmit_manual_outcome- Submit card result
Substitutions (3)
request_pinch_hitter- Batting substitutionrequest_pitching_change- Pitcher substitutionrequest_defensive_replacement- Field substitution
Query Events (2)
get_lineup- Get team lineupget_box_score- Get statistics
Error Handling Pattern
Consistent "Emit or Return" Pattern:
try:
# Validate input
if not game_id_str:
await manager.emit_to_user(sid, "error", {"message": "Missing game_id"})
return
# Parse and validate
try:
game_id = UUID(game_id_str)
except (ValueError, AttributeError):
await manager.emit_to_user(sid, "error", {"message": "Invalid format"})
return
# Get state
state = state_manager.get_state(game_id)
if not state:
await manager.emit_to_user(sid, "error", {"message": f"Game not found"})
return
# Process action
result = await game_engine.some_operation(...)
# Broadcast result
await manager.broadcast_to_game(str(game_id), "event_name", data)
except GameValidationError as e:
await manager.emit_to_user(sid, "outcome_rejected", {"message": str(e)})
except Exception as e:
logger.error(f"Error: {e}", exc_info=True)
await manager.emit_to_user(sid, "error", {"message": str(e)})
Analysis: 157+ error handling statements providing comprehensive coverage
4. STATE SYNCHRONIZATION PATTERNS
State Flow Diagram
Handler gets state → Mutates state object → Updates state manager → Broadcasts
Critical State Mutation: Dice Roll Example
# Line 216-217 in handlers.py (roll_dice)
state.pending_manual_roll = ab_roll
state_manager.update_state(game_id, state)
Issue #3 - RACE CONDITION RISK: State mutation without locking
- Multiple concurrent
roll_dicecalls could overwritepending_manual_roll state_manager.update_state()does NOT use locks- Scenario: Player 1 rolls, Player 2 rolls before Player 1 submits outcome
pending_manual_rollis overwritten- Player 1's submitted outcome rejects: "No pending dice roll"
Similar Risk in Other Handlers
All decision submission handlers (lines 271-409):
# submit_defensive_decision
async def submit_defensive_decision(sid, data):
# ...
updated_state = await game_engine.submit_defensive_decision(game_id, decision)
# ...
await manager.broadcast_to_game(str(game_id), "defensive_decision_submitted", {...})
Race Condition Risk: Two simultaneous defensive decision submissions
- Both handlers enter concurrently
- Both call
game_engine.submit_defensive_decision() - Both may be processed if game_engine locks are insufficient
Analysis of game_engine locking:
class GameEngine:
def _get_game_lock(self, game_id: UUID) -> asyncio.Lock:
if game_id not in self._game_locks:
self._game_locks[game_id] = asyncio.Lock()
return self._game_locks[game_id]
Problem: Locks are in game_engine, but WebSocket handlers don't use them!
- Handlers call state_manager directly without locks
- Game engine operations are locked, but handler-level mutations are not
- Two-phase risk: Handler mutation + Engine operation both need protection
5. RECONNECTION AND RECOVERY
Game State Recovery Handler
@sio.event
async def request_game_state(sid, data):
"""Client requests full game state (recovery after disconnect)"""
game_id_str = data.get("game_id")
game_id = UUID(game_id_str)
# Try memory first
state = state_manager.get_state(game_id)
# If not in memory, recover from database
if not state:
logger.info(f"Game {game_id} not in memory, recovering from database")
state = await state_manager.recover_game(game_id)
if state:
await manager.emit_to_user(sid, "game_state", state.model_dump(mode="json"))
else:
await manager.emit_to_user(sid, "error", {"message": f"Game not found"})
Status: ✅ Recovery mechanism exists
However:
Issue #4 - No Explicit Reconnection Handling
- Socket.io supports client reconnection automatically, but no handler for it
- When client reconnects with same session:
- New socket connection establishes
- Old session ID may still be in
user_sessions - Duplicate session risk: Same user_id mapped to two sids
async def connect(sid, environ, auth):
# ...
await manager.connect(sid, user_id) # <- Could overwrite old session
Missing Code: No check for existing sessions
# MISSING:
if user_id in [uid for uid in self.user_sessions.values()]:
# Existing session detected - close old session?
# Or allow multiple sessions per user?
Issue #5 - No Session Expiration
user_sessionsdictionary grows indefinitely- If disconnect event doesn't fire (network failure), sessions accumulate
- After 1000 disconnected users: 1000 orphaned entries in memory
6. MEMORY LEAK POTENTIAL
Issue #6 - Game Room Cleanup
Current cleanup in disconnect:
for game_id, sids in self.game_rooms.items():
if sid in sids:
sids.remove(sid) # <- Leaves empty set in dict
Problem: Empty sets persist indefinitely
# After all players leave game 123:
game_rooms = {
"123": set(), # <- Still in dict, wastes memory
"456": {sid1, sid2}
}
Over time: Hundreds of dead game rooms accumulate
Recommendation:
sids.discard(sid)
if not sids: # If set is now empty
del self.game_rooms[game_id] # Remove the game room
Issue #7 - Event Listener Cleanup
Socket.io events are registered once at startup:
def register_handlers(sio: AsyncServer, manager: ConnectionManager) -> None:
@sio.event
async def connect(sid, environ, auth):
# Registered once, called many times
Analysis: ✅ No memory leak from event listeners (they're registered once as closures)
However: Check if state_manager has internal leaks
_pending_decisions: dict[tuple, asyncio.Future]- could hold stale futures_last_access: dict[UUID, pendulum.DateTime]- could grow indefinitely_lineups: dict[UUID, dict[int, TeamLineupState]]- needs cleanup when game ends
7. GAME LIFECYCLE MANAGEMENT
Issue #8 - Resource Cleanup Not Guaranteed
Where games should cleanup:
- ✅ When game completes naturally (27 outs)
- ❓ When game is manually ended
- ❓ When user force-closes connection
- ❓ After idle timeout
Current cleanup in game_engine.py:
async def _cleanup_game_resources(game_id: UUID):
if game_id in self._rolls_this_inning:
del self._rolls_this_inning[game_id]
if game_id in self._game_locks:
del self._game_locks[game_id]
Problem: This cleanup is called only from game completion path
- No cleanup on unexpected disconnection
- No cleanup on timeout
- Game state remains in state_manager indefinitely
Recommendation:
# When game ends (any reason):
await game_engine._cleanup_game_resources(game_id)
await state_manager.remove_game(game_id)
await manager.cleanup_game_room(game_id)
Missing endpoint: Force game cleanup
No REST endpoint or WebSocket event to cleanup a game
- Useful for: admin actions, manual cleanup after crashes
- Should be admin-only
- Should emit notification to all players in room
8. AUTHORIZATION GAPS
Issue #9 - Authorization TODOs
Found 11 security TODOs indicating missing authorization checks:
# Line 90: join_game
# TODO: Verify user has access to game
# Line 200: roll_dice
# TODO: Verify user is participant in this game
# Line 293: submit_manual_outcome
# TODO: Verify user is active batter or authorized to submit
# Lines 558, 723, 880: Substitution handlers
# TODO: Verify user is authorized to make substitution for this team
Risk: Currently ANY connected user can:
- Roll dice for any game
- Submit outcomes
- Make substitutions
- View any lineup
- View any box score
Should implement:
def _verify_game_access(sid: str, game_id: UUID) -> bool:
"""Check if user in this session has access to game"""
user_id = manager.user_sessions.get(sid)
# Check against game participant list in DB
def _verify_is_batter(sid: str, game_id: UUID) -> bool:
"""Check if user is the current batter"""
def _verify_is_pitcher(sid: str, game_id: UUID) -> bool:
"""Check if user is the current pitcher"""
9. POTENTIAL RACE CONDITIONS
Race Condition Scenario #1: Concurrent Outcome Submissions
Timeline:
T1: Player1.submit_manual_outcome() - gets pending_manual_roll
T2: Player2.submit_manual_outcome() - gets SAME pending_manual_roll
T3: Player1 clears pending_manual_roll
T4: Player2 tries to clear it again - but it's already None
Actual Problem: Both players process the same roll!
- Play is resolved twice
- State is corrupted
- Both receive "outcome_accepted"
Fix: Use lock in handlers
async def submit_manual_outcome(sid, data):
lock = game_engine._get_game_lock(game_id)
async with lock:
# Safe: only one handler can be here at a time
state = state_manager.get_state(game_id)
if not state.pending_manual_roll:
# Safely reject
Race Condition Scenario #2: Substitution During Play
Timeline:
T1: substitute_player() - modifies lineup
T2: resolve_play() - reads lineup (sees old player)
T3: substitute_player() completes - broadcasts change
T4: resolve_play() completes - broadcasts different player in play result
Result: Confusion about which player was involved
Race Condition Scenario #3: State Recovery During Active Play
Timeline:
T1: Player A submits outcome
T2: Client B reconnects
T3: Client B requests_game_state()
T4: state_manager.recover_game() loads from DB (sees old state)
T5: Client B receives outdated state while play is resolving
Recovery timing issue: Overlapping with active game operations
10. BROADCAST PATTERN RISKS
Issue #10 - No Guarantees on Broadcast Delivery
async def broadcast_to_game(self, game_id: str, event: str, data: dict) -> None:
await self.sio.emit(event, data, room=game_id)
logger.debug(f"Broadcast {event} to game {game_id}")
Characteristics:
- No acknowledgment mechanism
- No retry logic
- No delivery confirmation
- Fire-and-forget
Risk: Client doesn't receive "play_resolved" event
- State is updated on server
- Client never knows about it
- Client thinks outcome is still pending
- Timeout eventually kicks in
Mitigation: Socket.io provides:
- Client-side event acknowledgment (ack callbacks)
- Currently not used
- Should implement for critical events
11. HEARTBEAT AND KEEPALIVE
Heartbeat Implementation
@sio.event
async def heartbeat(sid):
"""Handle heartbeat ping"""
await sio.emit("heartbeat_ack", {}, room=sid)
Status: ✅ Basic heartbeat exists
However:
- No server-initiated heartbeat
- No timeout tracking
- No automatic cleanup of idle sessions
- Client must ping at regular intervals
Production Issue:
- Disconnected clients (network issue) may not send disconnect event
- Sessions persist indefinitely
- After 1000 idle connections: significant memory waste
Recommendation:
# Server-side keepalive
async def maintain_heartbeats():
while True:
await asyncio.sleep(30) # Every 30 seconds
# Check for stale sessions without recent activity
now = pendulum.now('UTC')
for sid, user_id in list(manager.user_sessions.items()):
last_activity = manager.last_activity.get(sid)
if last_activity and (now - last_activity).seconds > 300:
# 5 minutes idle - disconnect
await sio.disconnect(sid)
12. DATABASE INTERACTION RISKS
Issue #11 - Async DB Operations Without Proper Isolation
@sio.event
async def submit_manual_outcome(sid, data):
# ...
result = await game_engine.resolve_manual_play(
game_id=game_id,
ab_roll=ab_roll,
outcome=outcome,
hit_location=submission.hit_location,
)
# Updated state might lag behind DB
updated_state = state_manager.get_state(game_id)
if updated_state:
await manager.broadcast_to_game(
str(game_id), "game_state_update", updated_state.model_dump(mode="json")
)
Issue: State broadcast happens synchronously, but DB write is async
- Broadcast might happen before DB write completes
- DB connection failure silently fails (logged only)
- Client has state but DB might not have it yet
Better approach:
result = await game_engine.resolve_manual_play(...) # Includes DB write
# Only then broadcast
await manager.broadcast_to_game(...)
13. SOCKET.IO CONFIGURATION
Current Configuration (main.py)
sio = socketio.AsyncServer(
async_mode="asgi",
cors_allowed_origins=settings.cors_origins,
logger=True,
engineio_logger=False,
)
Missing Configuration Options
# Should consider adding:
sio = socketio.AsyncServer(
async_mode="asgi",
cors_allowed_origins=settings.cors_origins,
logger=True,
engineio_logger=False,
# Connection options
ping_timeout=60, # Disconnect after 60s no response to ping
ping_interval=25, # Send ping every 25 seconds
max_http_buffer_size=1e6, # 1MB max message size
# Event handling
async_handlers=True, # Already default with async
# Namespaces
# Could partition different game types into separate namespaces
)
Without ping_timeout: Zombie connections can persist indefinitely
14. TEST COVERAGE
Current Tests
- File:
tests/unit/websocket/test_manual_outcome_handlers.py - Status: Unit tests with mocks for critical handlers
Test Coverage Analysis
✅ Covered:
roll_dicesuccess pathroll_dicemissing game_idroll_diceinvalid game_idsubmit_manual_outcome(partial)
❌ Not Covered:
- Concurrent submissions (race condition tests)
- Connection cleanup
- Game room cleanup after disconnect
- Authorization verification (because TODOs indicate not implemented)
- Reconnection scenarios
- State recovery from DB
- All 15 handlers not fully tested
Recommendation: Add integration tests for:
@pytest.mark.asyncio
async def test_concurrent_outcome_submissions():
"""Two players submit outcomes simultaneously"""
# Create game
# Player 1 starts submit_manual_outcome
# Player 2 starts submit_manual_outcome
# Both should fail or one should win
@pytest.mark.asyncio
async def test_connection_cleanup_on_disconnect():
"""Game room should be cleaned up after all players disconnect"""
# Connect 3 players to game 123
# Disconnect all 3
# Assert game_rooms doesn't contain empty set for game 123
15. RECOMMENDATIONS
Priority 1: Critical (Production Risk)
-
Add Locking to Handler-Level State Mutations
- Location: All handlers that modify state before calling game_engine
- Impact: Prevents race conditions in critical paths
- Effort: Medium
- Example: Wrap state mutation in game_engine locks
async def submit_manual_outcome(sid, data): lock = game_engine._get_game_lock(game_id) async with lock: # Safe state modifications -
Implement Authorization Checks
- Location: All 11 TODO comments in handlers.py
- Impact: Prevents unauthorized game access
- Effort: Medium
- Add helper functions for permission verification
-
Add Session Expiration
- Location: connection_manager.py, heartbeat handler
- Impact: Prevent memory leak from zombie sessions
- Effort: Low
- Track last_activity per sid, cleanup stale sessions
Priority 2: Important (Data Integrity)
-
Clean Up Empty Game Rooms
- Location: disconnect() method in connection_manager
- Impact: Prevent memory leak
- Effort: Low
- Delete empty sets from game_rooms dict
-
Implement Game Cleanup Lifecycle
- Location: game_engine.py
- Impact: Proper resource cleanup
- Effort: High
- Add cleanup calls to state_manager on game completion
-
Add Reconnection Handling
- Location: connect handler
- Impact: Handle clients reconnecting with same user
- Effort: Medium
- Detect duplicate sessions, manage accordingly
Priority 3: Recommended (Reliability)
-
Implement Broadcast Acknowledgments
- Location: broadcast_to_game calls
- Impact: Ensure critical events are delivered
- Effort: Medium
- Use Socket.io ack callbacks for game_state_update
-
Add Server-Initiated Keepalive
- Location: Lifespan context manager or separate task
- Impact: Detect disconnected clients promptly
- Effort: Medium
- Periodic server-side health check
-
Configure Socket.io Ping Timeouts
- Location: main.py socket initialization
- Impact: Automatic cleanup of dead connections
- Effort: Low
- Set ping_timeout=60, ping_interval=25
-
Add Comprehensive Integration Tests
- Location: tests/integration/websocket/
- Impact: Prevent regressions
- Effort: High
- Test concurrent scenarios, recovery, cleanup
Priority 4: Enhancement (Performance)
-
Implement Namespace Separation
- Partition different game instances into separate namespaces
- Reduces broadcast overhead
- Effort: High
-
Add Event Compression
- For large game_state_update broadcasts
- Reduce network overhead
- Effort: Medium
16. SUMMARY TABLE
| Issue | Severity | Category | Impact | Fix Effort |
|---|---|---|---|---|
| State mutation race conditions | CRITICAL | Concurrency | Data corruption | Medium |
| Missing authorization | CRITICAL | Security | Unauthorized access | Medium |
| Session expiration | HIGH | Memory | Unbounded memory growth | Low |
| Empty game room cleanup | HIGH | Memory | Memory leak | Low |
| Game resource cleanup | HIGH | Lifecycle | Orphaned resources | High |
| Reconnection handling | HIGH | Reliability | Duplicate sessions | Medium |
| Broadcast delivery guarantees | MEDIUM | Reliability | Missed updates | Medium |
| Integration test coverage | MEDIUM | Testing | Undetected bugs | High |
| Socket.io config optimization | MEDIUM | Reliability | Long disconnection detection | Low |
| DB operation isolation | LOW | Data Integrity | Minor timing issues | Low |
17. IMPLEMENTATION CHECKLIST
Phase 1: Critical Fixes (Week 1)
- Add locking to handler state mutations
- Implement authorization middleware
- Add session expiration logic
- Clean up empty game rooms
Phase 2: Important Fixes (Week 2-3)
- Implement game cleanup lifecycle
- Add reconnection handling
- Add Socket.io ping configuration
- Implement broadcast acknowledgments
Phase 3: Testing & Validation (Week 4)
- Add concurrency tests
- Add recovery tests
- Add cleanup verification tests
- Load testing with 100+ concurrent connections
Report Generated: 2025-11-27
Analysis Tool: Claude Code (File Search Specialist)
Scope: 2,643 lines analyzed across 3 core files + CLAUDE.md documentation