Frontend UX improvements: - Single-click Discord OAuth from home page (no intermediate /auth page) - Auto-redirect authenticated users from home to /games - Fixed Nuxt layout system - app.vue now wraps NuxtPage with NuxtLayout - Games page now has proper card container with shadow/border styling - Layout header includes working logout with API cookie clearing Games list enhancements: - Display team names (lname) instead of just team IDs - Show current score for each team - Show inning indicator (Top/Bot X) for active games - Responsive header with wrapped buttons on mobile Backend improvements: - Added team caching to SbaApiClient (1-hour TTL) - Enhanced GameListItem with team names, scores, inning data - Games endpoint now enriches response with SBA API team data Docker optimizations: - Optimized Dockerfile using --chown flag on COPY (faster than chown -R) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
830 lines
24 KiB
Markdown
830 lines
24 KiB
Markdown
# WebSocket Implementation Analysis Report
|
|
## Paper Dynasty Real-Time Game Engine Backend
|
|
|
|
**Date**: 2025-11-27
|
|
**Scope**: `/mnt/NV2/Development/strat-gameplay-webapp/backend/app/websocket/`
|
|
**Status**: Comprehensive exploration of production-ready implementation
|
|
|
|
---
|
|
|
|
## EXECUTIVE SUMMARY
|
|
|
|
The WebSocket implementation is **well-architected and production-ready** with 15 event handlers, proper error handling, and recovery mechanisms. However, there are several **identified risk areas** related to state synchronization, memory management, and race conditions that require attention before handling high-concurrency scenarios.
|
|
|
|
### Key Findings:
|
|
- ✅ **Solid Foundation**: Clean separation of concerns, proper async/await patterns
|
|
- ✅ **Good Error Handling**: 157+ error handling statements across handlers
|
|
- ⚠️ **State Mutation Risk**: Direct state object mutations without proper locking
|
|
- ⚠️ **Memory Leak Potential**: Game resource cleanup not guaranteed on all code paths
|
|
- ⚠️ **Race Condition Risk**: Multiple concurrent events could corrupt state
|
|
- ⚠️ **Reconnection Gap**: No explicit reconnection handling or session recovery strategy
|
|
- ⚠️ **Room Cleanup**: Stale game rooms may persist indefinitely
|
|
|
|
---
|
|
|
|
## 1. ARCHITECTURE OVERVIEW
|
|
|
|
### Directory Structure
|
|
```
|
|
app/websocket/
|
|
├── connection_manager.py (71 lines) # Room & session management
|
|
├── handlers.py (1,312 lines) # 15 event handlers
|
|
├── __init__.py # Package marker
|
|
└── CLAUDE.md # Documentation
|
|
```
|
|
|
|
### Socket.io Initialization (main.py)
|
|
```python
|
|
sio = socketio.AsyncServer(
|
|
async_mode="asgi",
|
|
cors_allowed_origins=settings.cors_origins,
|
|
logger=True,
|
|
engineio_logger=False
|
|
)
|
|
socket_app = socketio.ASGIApp(sio, app)
|
|
|
|
connection_manager = ConnectionManager(sio)
|
|
register_handlers(sio, connection_manager)
|
|
game_engine.set_connection_manager(connection_manager)
|
|
```
|
|
|
|
**Critical Point**: Connection manager is a singleton instance passed to all handlers via closure.
|
|
|
|
---
|
|
|
|
## 2. CONNECTION MANAGEMENT
|
|
|
|
### ConnectionManager Structure
|
|
```python
|
|
class ConnectionManager:
|
|
def __init__(self, sio: socketio.AsyncServer):
|
|
self.sio = sio
|
|
self.user_sessions: dict[str, str] = {} # sid -> user_id
|
|
self.game_rooms: dict[str, set[str]] = {} # game_id -> set of sids
|
|
```
|
|
|
|
### Connection Lifecycle
|
|
|
|
#### 1. Connect Handler
|
|
```python
|
|
@sio.event
|
|
async def connect(sid, environ, auth):
|
|
# Dual auth support:
|
|
# 1. HttpOnly cookies (from nginx proxy)
|
|
# 2. Auth object (direct JS clients)
|
|
token = extract_token(environ, auth)
|
|
user_data = verify_token(token)
|
|
user_id = user_data.get("user_id")
|
|
|
|
await manager.connect(sid, user_id)
|
|
await sio.emit("connected", {"user_id": user_id}, room=sid)
|
|
return True # Accept connection
|
|
```
|
|
|
|
**Status**: ✅ Properly authenticated with fallback support
|
|
|
|
#### 2. Disconnect Handler
|
|
```python
|
|
@sio.event
|
|
async def disconnect(sid):
|
|
await manager.disconnect(sid)
|
|
|
|
async def disconnect(self, sid: str) -> None:
|
|
user_id = self.user_sessions.pop(sid, None)
|
|
if user_id:
|
|
logger.info(f"User {user_id} disconnected (session {sid})")
|
|
|
|
# Remove from all game rooms
|
|
for game_id, sids in self.game_rooms.items():
|
|
if sid in sids:
|
|
sids.remove(sid)
|
|
await self.broadcast_to_game(
|
|
game_id, "user_disconnected", {"user_id": user_id}
|
|
)
|
|
```
|
|
|
|
**Status**: ✅ Removes user from all rooms
|
|
|
|
### Room Management
|
|
|
|
#### Join Game
|
|
```python
|
|
async def join_game(self, sid: str, game_id: str, role: str) -> None:
|
|
await self.sio.enter_room(sid, game_id)
|
|
|
|
if game_id not in self.game_rooms:
|
|
self.game_rooms[game_id] = set()
|
|
self.game_rooms[game_id].add(sid)
|
|
```
|
|
|
|
**Issue #1**: Redundant tracking - Socket.io already maintains room membership
|
|
- `self.sio.enter_room()` registers the sid in Socket.io's internal room
|
|
- `self.game_rooms[game_id]` duplicates this information
|
|
- **Risk**: Synchronization could diverge if Socket.io rooms are accessed directly
|
|
|
|
#### Leave Game
|
|
```python
|
|
async def leave_game(self, sid: str, game_id: str) -> None:
|
|
await self.sio.leave_room(sid, game_id)
|
|
|
|
if game_id in self.game_rooms:
|
|
self.game_rooms[game_id].discard(sid)
|
|
```
|
|
|
|
**Issue #2**: Manual set cleanup but no cleanup of empty game rooms
|
|
- After last user leaves, `self.game_rooms[game_id]` remains as empty set
|
|
- **Memory Impact**: Accumulates empty sets over time
|
|
- **Recommendation**: Remove empty game rooms after discard
|
|
|
|
---
|
|
|
|
## 3. EVENT HANDLERS (15 Total)
|
|
|
|
### Handler Categories
|
|
|
|
#### Connection Events (2)
|
|
- `connect` - JWT authentication
|
|
- `disconnect` - Session cleanup
|
|
|
|
#### Game Flow Events (4)
|
|
- `join_game` - Join game room
|
|
- `leave_game` - Leave game room
|
|
- `request_game_state` - Recovery after disconnect/initial load
|
|
- `heartbeat` - Keep-alive ping
|
|
|
|
#### Decision Submission (2)
|
|
- `submit_defensive_decision` - Defense strategy
|
|
- `submit_offensive_decision` - Offense strategy
|
|
|
|
#### Manual Outcome Flow (2)
|
|
- `roll_dice` - Roll dice for play
|
|
- `submit_manual_outcome` - Submit card result
|
|
|
|
#### Substitutions (3)
|
|
- `request_pinch_hitter` - Batting substitution
|
|
- `request_pitching_change` - Pitcher substitution
|
|
- `request_defensive_replacement` - Field substitution
|
|
|
|
#### Query Events (2)
|
|
- `get_lineup` - Get team lineup
|
|
- `get_box_score` - Get statistics
|
|
|
|
### Error Handling Pattern
|
|
|
|
**Consistent "Emit or Return" Pattern**:
|
|
```python
|
|
try:
|
|
# Validate input
|
|
if not game_id_str:
|
|
await manager.emit_to_user(sid, "error", {"message": "Missing game_id"})
|
|
return
|
|
|
|
# Parse and validate
|
|
try:
|
|
game_id = UUID(game_id_str)
|
|
except (ValueError, AttributeError):
|
|
await manager.emit_to_user(sid, "error", {"message": "Invalid format"})
|
|
return
|
|
|
|
# Get state
|
|
state = state_manager.get_state(game_id)
|
|
if not state:
|
|
await manager.emit_to_user(sid, "error", {"message": f"Game not found"})
|
|
return
|
|
|
|
# Process action
|
|
result = await game_engine.some_operation(...)
|
|
|
|
# Broadcast result
|
|
await manager.broadcast_to_game(str(game_id), "event_name", data)
|
|
|
|
except GameValidationError as e:
|
|
await manager.emit_to_user(sid, "outcome_rejected", {"message": str(e)})
|
|
except Exception as e:
|
|
logger.error(f"Error: {e}", exc_info=True)
|
|
await manager.emit_to_user(sid, "error", {"message": str(e)})
|
|
```
|
|
|
|
**Analysis**: 157+ error handling statements providing comprehensive coverage
|
|
|
|
---
|
|
|
|
## 4. STATE SYNCHRONIZATION PATTERNS
|
|
|
|
### State Flow Diagram
|
|
```
|
|
Handler gets state → Mutates state object → Updates state manager → Broadcasts
|
|
```
|
|
|
|
### Critical State Mutation: Dice Roll Example
|
|
|
|
```python
|
|
# Line 216-217 in handlers.py (roll_dice)
|
|
state.pending_manual_roll = ab_roll
|
|
state_manager.update_state(game_id, state)
|
|
```
|
|
|
|
**Issue #3 - RACE CONDITION RISK**: State mutation without locking
|
|
- Multiple concurrent `roll_dice` calls could overwrite `pending_manual_roll`
|
|
- `state_manager.update_state()` does NOT use locks
|
|
- **Scenario**: Player 1 rolls, Player 2 rolls before Player 1 submits outcome
|
|
- `pending_manual_roll` is overwritten
|
|
- Player 1's submitted outcome rejects: "No pending dice roll"
|
|
|
|
### Similar Risk in Other Handlers
|
|
|
|
**All decision submission handlers (lines 271-409)**:
|
|
```python
|
|
# submit_defensive_decision
|
|
async def submit_defensive_decision(sid, data):
|
|
# ...
|
|
updated_state = await game_engine.submit_defensive_decision(game_id, decision)
|
|
# ...
|
|
await manager.broadcast_to_game(str(game_id), "defensive_decision_submitted", {...})
|
|
```
|
|
|
|
**Race Condition Risk**: Two simultaneous defensive decision submissions
|
|
- Both handlers enter concurrently
|
|
- Both call `game_engine.submit_defensive_decision()`
|
|
- Both may be processed if game_engine locks are insufficient
|
|
|
|
**Analysis of game_engine locking**:
|
|
```python
|
|
class GameEngine:
|
|
def _get_game_lock(self, game_id: UUID) -> asyncio.Lock:
|
|
if game_id not in self._game_locks:
|
|
self._game_locks[game_id] = asyncio.Lock()
|
|
return self._game_locks[game_id]
|
|
```
|
|
|
|
**Problem**: Locks are in game_engine, but WebSocket handlers don't use them!
|
|
- Handlers call state_manager directly without locks
|
|
- Game engine operations are locked, but handler-level mutations are not
|
|
- **Two-phase risk**: Handler mutation + Engine operation both need protection
|
|
|
|
---
|
|
|
|
## 5. RECONNECTION AND RECOVERY
|
|
|
|
### Game State Recovery Handler
|
|
|
|
```python
|
|
@sio.event
|
|
async def request_game_state(sid, data):
|
|
"""Client requests full game state (recovery after disconnect)"""
|
|
game_id_str = data.get("game_id")
|
|
game_id = UUID(game_id_str)
|
|
|
|
# Try memory first
|
|
state = state_manager.get_state(game_id)
|
|
|
|
# If not in memory, recover from database
|
|
if not state:
|
|
logger.info(f"Game {game_id} not in memory, recovering from database")
|
|
state = await state_manager.recover_game(game_id)
|
|
|
|
if state:
|
|
await manager.emit_to_user(sid, "game_state", state.model_dump(mode="json"))
|
|
else:
|
|
await manager.emit_to_user(sid, "error", {"message": f"Game not found"})
|
|
```
|
|
|
|
**Status**: ✅ Recovery mechanism exists
|
|
|
|
**However**:
|
|
|
|
### Issue #4 - No Explicit Reconnection Handling
|
|
- Socket.io supports client reconnection automatically, but no handler for it
|
|
- When client reconnects with same session:
|
|
- New socket connection establishes
|
|
- Old session ID may still be in `user_sessions`
|
|
- **Duplicate session risk**: Same user_id mapped to two sids
|
|
|
|
```python
|
|
async def connect(sid, environ, auth):
|
|
# ...
|
|
await manager.connect(sid, user_id) # <- Could overwrite old session
|
|
```
|
|
|
|
**Missing Code**: No check for existing sessions
|
|
```python
|
|
# MISSING:
|
|
if user_id in [uid for uid in self.user_sessions.values()]:
|
|
# Existing session detected - close old session?
|
|
# Or allow multiple sessions per user?
|
|
```
|
|
|
|
### Issue #5 - No Session Expiration
|
|
- `user_sessions` dictionary grows indefinitely
|
|
- If disconnect event doesn't fire (network failure), sessions accumulate
|
|
- After 1000 disconnected users: 1000 orphaned entries in memory
|
|
|
|
---
|
|
|
|
## 6. MEMORY LEAK POTENTIAL
|
|
|
|
### Issue #6 - Game Room Cleanup
|
|
|
|
**Current cleanup in disconnect**:
|
|
```python
|
|
for game_id, sids in self.game_rooms.items():
|
|
if sid in sids:
|
|
sids.remove(sid) # <- Leaves empty set in dict
|
|
```
|
|
|
|
**Problem**: Empty sets persist indefinitely
|
|
```python
|
|
# After all players leave game 123:
|
|
game_rooms = {
|
|
"123": set(), # <- Still in dict, wastes memory
|
|
"456": {sid1, sid2}
|
|
}
|
|
```
|
|
|
|
**Over time**: Hundreds of dead game rooms accumulate
|
|
|
|
**Recommendation**:
|
|
```python
|
|
sids.discard(sid)
|
|
if not sids: # If set is now empty
|
|
del self.game_rooms[game_id] # Remove the game room
|
|
```
|
|
|
|
### Issue #7 - Event Listener Cleanup
|
|
|
|
**Socket.io events are registered once at startup**:
|
|
```python
|
|
def register_handlers(sio: AsyncServer, manager: ConnectionManager) -> None:
|
|
@sio.event
|
|
async def connect(sid, environ, auth):
|
|
# Registered once, called many times
|
|
```
|
|
|
|
**Analysis**: ✅ No memory leak from event listeners (they're registered once as closures)
|
|
|
|
**However**: Check if state_manager has internal leaks
|
|
- `_pending_decisions: dict[tuple, asyncio.Future]` - could hold stale futures
|
|
- `_last_access: dict[UUID, pendulum.DateTime]` - could grow indefinitely
|
|
- `_lineups: dict[UUID, dict[int, TeamLineupState]]` - needs cleanup when game ends
|
|
|
|
---
|
|
|
|
## 7. GAME LIFECYCLE MANAGEMENT
|
|
|
|
### Issue #8 - Resource Cleanup Not Guaranteed
|
|
|
|
**Where games should cleanup**:
|
|
1. ✅ When game completes naturally (27 outs)
|
|
2. ❓ When game is manually ended
|
|
3. ❓ When user force-closes connection
|
|
4. ❓ After idle timeout
|
|
|
|
**Current cleanup in game_engine.py**:
|
|
```python
|
|
async def _cleanup_game_resources(game_id: UUID):
|
|
if game_id in self._rolls_this_inning:
|
|
del self._rolls_this_inning[game_id]
|
|
if game_id in self._game_locks:
|
|
del self._game_locks[game_id]
|
|
```
|
|
|
|
**Problem**: This cleanup is called only from game completion path
|
|
- No cleanup on unexpected disconnection
|
|
- No cleanup on timeout
|
|
- Game state remains in state_manager indefinitely
|
|
|
|
**Recommendation**:
|
|
```python
|
|
# When game ends (any reason):
|
|
await game_engine._cleanup_game_resources(game_id)
|
|
await state_manager.remove_game(game_id)
|
|
await manager.cleanup_game_room(game_id)
|
|
```
|
|
|
|
### Missing endpoint: Force game cleanup
|
|
|
|
**No REST endpoint or WebSocket event to cleanup a game**
|
|
- Useful for: admin actions, manual cleanup after crashes
|
|
- Should be admin-only
|
|
- Should emit notification to all players in room
|
|
|
|
---
|
|
|
|
## 8. AUTHORIZATION GAPS
|
|
|
|
### Issue #9 - Authorization TODOs
|
|
|
|
Found 11 security TODOs indicating missing authorization checks:
|
|
|
|
```python
|
|
# Line 90: join_game
|
|
# TODO: Verify user has access to game
|
|
|
|
# Line 200: roll_dice
|
|
# TODO: Verify user is participant in this game
|
|
|
|
# Line 293: submit_manual_outcome
|
|
# TODO: Verify user is active batter or authorized to submit
|
|
|
|
# Lines 558, 723, 880: Substitution handlers
|
|
# TODO: Verify user is authorized to make substitution for this team
|
|
```
|
|
|
|
**Risk**: Currently **ANY connected user** can:
|
|
- Roll dice for any game
|
|
- Submit outcomes
|
|
- Make substitutions
|
|
- View any lineup
|
|
- View any box score
|
|
|
|
**Should implement**:
|
|
```python
|
|
def _verify_game_access(sid: str, game_id: UUID) -> bool:
|
|
"""Check if user in this session has access to game"""
|
|
user_id = manager.user_sessions.get(sid)
|
|
# Check against game participant list in DB
|
|
|
|
def _verify_is_batter(sid: str, game_id: UUID) -> bool:
|
|
"""Check if user is the current batter"""
|
|
|
|
def _verify_is_pitcher(sid: str, game_id: UUID) -> bool:
|
|
"""Check if user is the current pitcher"""
|
|
```
|
|
|
|
---
|
|
|
|
## 9. POTENTIAL RACE CONDITIONS
|
|
|
|
### Race Condition Scenario #1: Concurrent Outcome Submissions
|
|
|
|
```
|
|
Timeline:
|
|
T1: Player1.submit_manual_outcome() - gets pending_manual_roll
|
|
T2: Player2.submit_manual_outcome() - gets SAME pending_manual_roll
|
|
T3: Player1 clears pending_manual_roll
|
|
T4: Player2 tries to clear it again - but it's already None
|
|
```
|
|
|
|
**Actual Problem**: Both players process the same roll!
|
|
- Play is resolved twice
|
|
- State is corrupted
|
|
- Both receive "outcome_accepted"
|
|
|
|
**Fix**: Use lock in handlers
|
|
```python
|
|
async def submit_manual_outcome(sid, data):
|
|
lock = game_engine._get_game_lock(game_id)
|
|
async with lock:
|
|
# Safe: only one handler can be here at a time
|
|
state = state_manager.get_state(game_id)
|
|
if not state.pending_manual_roll:
|
|
# Safely reject
|
|
```
|
|
|
|
### Race Condition Scenario #2: Substitution During Play
|
|
|
|
```
|
|
Timeline:
|
|
T1: substitute_player() - modifies lineup
|
|
T2: resolve_play() - reads lineup (sees old player)
|
|
T3: substitute_player() completes - broadcasts change
|
|
T4: resolve_play() completes - broadcasts different player in play result
|
|
```
|
|
|
|
**Result**: Confusion about which player was involved
|
|
|
|
### Race Condition Scenario #3: State Recovery During Active Play
|
|
|
|
```
|
|
Timeline:
|
|
T1: Player A submits outcome
|
|
T2: Client B reconnects
|
|
T3: Client B requests_game_state()
|
|
T4: state_manager.recover_game() loads from DB (sees old state)
|
|
T5: Client B receives outdated state while play is resolving
|
|
```
|
|
|
|
**Recovery timing issue**: Overlapping with active game operations
|
|
|
|
---
|
|
|
|
## 10. BROADCAST PATTERN RISKS
|
|
|
|
### Issue #10 - No Guarantees on Broadcast Delivery
|
|
|
|
```python
|
|
async def broadcast_to_game(self, game_id: str, event: str, data: dict) -> None:
|
|
await self.sio.emit(event, data, room=game_id)
|
|
logger.debug(f"Broadcast {event} to game {game_id}")
|
|
```
|
|
|
|
**Characteristics**:
|
|
- No acknowledgment mechanism
|
|
- No retry logic
|
|
- No delivery confirmation
|
|
- Fire-and-forget
|
|
|
|
**Risk**: Client doesn't receive "play_resolved" event
|
|
- State is updated on server
|
|
- Client never knows about it
|
|
- Client thinks outcome is still pending
|
|
- Timeout eventually kicks in
|
|
|
|
**Mitigation**: Socket.io provides:
|
|
- Client-side event acknowledgment (ack callbacks)
|
|
- Currently not used
|
|
- Should implement for critical events
|
|
|
|
---
|
|
|
|
## 11. HEARTBEAT AND KEEPALIVE
|
|
|
|
### Heartbeat Implementation
|
|
|
|
```python
|
|
@sio.event
|
|
async def heartbeat(sid):
|
|
"""Handle heartbeat ping"""
|
|
await sio.emit("heartbeat_ack", {}, room=sid)
|
|
```
|
|
|
|
**Status**: ✅ Basic heartbeat exists
|
|
|
|
**However**:
|
|
- No server-initiated heartbeat
|
|
- No timeout tracking
|
|
- No automatic cleanup of idle sessions
|
|
- Client must ping at regular intervals
|
|
|
|
**Production Issue**:
|
|
- Disconnected clients (network issue) may not send disconnect event
|
|
- Sessions persist indefinitely
|
|
- After 1000 idle connections: significant memory waste
|
|
|
|
**Recommendation**:
|
|
```python
|
|
# Server-side keepalive
|
|
async def maintain_heartbeats():
|
|
while True:
|
|
await asyncio.sleep(30) # Every 30 seconds
|
|
# Check for stale sessions without recent activity
|
|
now = pendulum.now('UTC')
|
|
for sid, user_id in list(manager.user_sessions.items()):
|
|
last_activity = manager.last_activity.get(sid)
|
|
if last_activity and (now - last_activity).seconds > 300:
|
|
# 5 minutes idle - disconnect
|
|
await sio.disconnect(sid)
|
|
```
|
|
|
|
---
|
|
|
|
## 12. DATABASE INTERACTION RISKS
|
|
|
|
### Issue #11 - Async DB Operations Without Proper Isolation
|
|
|
|
```python
|
|
@sio.event
|
|
async def submit_manual_outcome(sid, data):
|
|
# ...
|
|
result = await game_engine.resolve_manual_play(
|
|
game_id=game_id,
|
|
ab_roll=ab_roll,
|
|
outcome=outcome,
|
|
hit_location=submission.hit_location,
|
|
)
|
|
|
|
# Updated state might lag behind DB
|
|
updated_state = state_manager.get_state(game_id)
|
|
if updated_state:
|
|
await manager.broadcast_to_game(
|
|
str(game_id), "game_state_update", updated_state.model_dump(mode="json")
|
|
)
|
|
```
|
|
|
|
**Issue**: State broadcast happens synchronously, but DB write is async
|
|
- Broadcast might happen before DB write completes
|
|
- DB connection failure silently fails (logged only)
|
|
- Client has state but DB might not have it yet
|
|
|
|
**Better approach**:
|
|
```python
|
|
result = await game_engine.resolve_manual_play(...) # Includes DB write
|
|
# Only then broadcast
|
|
await manager.broadcast_to_game(...)
|
|
```
|
|
|
|
---
|
|
|
|
## 13. SOCKET.IO CONFIGURATION
|
|
|
|
### Current Configuration (main.py)
|
|
```python
|
|
sio = socketio.AsyncServer(
|
|
async_mode="asgi",
|
|
cors_allowed_origins=settings.cors_origins,
|
|
logger=True,
|
|
engineio_logger=False,
|
|
)
|
|
```
|
|
|
|
### Missing Configuration Options
|
|
|
|
```python
|
|
# Should consider adding:
|
|
sio = socketio.AsyncServer(
|
|
async_mode="asgi",
|
|
cors_allowed_origins=settings.cors_origins,
|
|
logger=True,
|
|
engineio_logger=False,
|
|
|
|
# Connection options
|
|
ping_timeout=60, # Disconnect after 60s no response to ping
|
|
ping_interval=25, # Send ping every 25 seconds
|
|
max_http_buffer_size=1e6, # 1MB max message size
|
|
|
|
# Event handling
|
|
async_handlers=True, # Already default with async
|
|
|
|
# Namespaces
|
|
# Could partition different game types into separate namespaces
|
|
)
|
|
```
|
|
|
|
**Without ping_timeout**: Zombie connections can persist indefinitely
|
|
|
|
---
|
|
|
|
## 14. TEST COVERAGE
|
|
|
|
### Current Tests
|
|
- **File**: `tests/unit/websocket/test_manual_outcome_handlers.py`
|
|
- **Status**: Unit tests with mocks for critical handlers
|
|
|
|
### Test Coverage Analysis
|
|
|
|
✅ **Covered**:
|
|
- `roll_dice` success path
|
|
- `roll_dice` missing game_id
|
|
- `roll_dice` invalid game_id
|
|
- `submit_manual_outcome` (partial)
|
|
|
|
❌ **Not Covered**:
|
|
- Concurrent submissions (race condition tests)
|
|
- Connection cleanup
|
|
- Game room cleanup after disconnect
|
|
- Authorization verification (because TODOs indicate not implemented)
|
|
- Reconnection scenarios
|
|
- State recovery from DB
|
|
- All 15 handlers not fully tested
|
|
|
|
**Recommendation**: Add integration tests for:
|
|
```python
|
|
@pytest.mark.asyncio
|
|
async def test_concurrent_outcome_submissions():
|
|
"""Two players submit outcomes simultaneously"""
|
|
# Create game
|
|
# Player 1 starts submit_manual_outcome
|
|
# Player 2 starts submit_manual_outcome
|
|
# Both should fail or one should win
|
|
|
|
@pytest.mark.asyncio
|
|
async def test_connection_cleanup_on_disconnect():
|
|
"""Game room should be cleaned up after all players disconnect"""
|
|
# Connect 3 players to game 123
|
|
# Disconnect all 3
|
|
# Assert game_rooms doesn't contain empty set for game 123
|
|
```
|
|
|
|
---
|
|
|
|
## 15. RECOMMENDATIONS
|
|
|
|
### Priority 1: Critical (Production Risk)
|
|
|
|
1. **Add Locking to Handler-Level State Mutations**
|
|
- Location: All handlers that modify state before calling game_engine
|
|
- Impact: Prevents race conditions in critical paths
|
|
- Effort: Medium
|
|
- Example: Wrap state mutation in game_engine locks
|
|
```python
|
|
async def submit_manual_outcome(sid, data):
|
|
lock = game_engine._get_game_lock(game_id)
|
|
async with lock:
|
|
# Safe state modifications
|
|
```
|
|
|
|
2. **Implement Authorization Checks**
|
|
- Location: All 11 TODO comments in handlers.py
|
|
- Impact: Prevents unauthorized game access
|
|
- Effort: Medium
|
|
- Add helper functions for permission verification
|
|
|
|
3. **Add Session Expiration**
|
|
- Location: connection_manager.py, heartbeat handler
|
|
- Impact: Prevent memory leak from zombie sessions
|
|
- Effort: Low
|
|
- Track last_activity per sid, cleanup stale sessions
|
|
|
|
### Priority 2: Important (Data Integrity)
|
|
|
|
4. **Clean Up Empty Game Rooms**
|
|
- Location: disconnect() method in connection_manager
|
|
- Impact: Prevent memory leak
|
|
- Effort: Low
|
|
- Delete empty sets from game_rooms dict
|
|
|
|
5. **Implement Game Cleanup Lifecycle**
|
|
- Location: game_engine.py
|
|
- Impact: Proper resource cleanup
|
|
- Effort: High
|
|
- Add cleanup calls to state_manager on game completion
|
|
|
|
6. **Add Reconnection Handling**
|
|
- Location: connect handler
|
|
- Impact: Handle clients reconnecting with same user
|
|
- Effort: Medium
|
|
- Detect duplicate sessions, manage accordingly
|
|
|
|
### Priority 3: Recommended (Reliability)
|
|
|
|
7. **Implement Broadcast Acknowledgments**
|
|
- Location: broadcast_to_game calls
|
|
- Impact: Ensure critical events are delivered
|
|
- Effort: Medium
|
|
- Use Socket.io ack callbacks for game_state_update
|
|
|
|
8. **Add Server-Initiated Keepalive**
|
|
- Location: Lifespan context manager or separate task
|
|
- Impact: Detect disconnected clients promptly
|
|
- Effort: Medium
|
|
- Periodic server-side health check
|
|
|
|
9. **Configure Socket.io Ping Timeouts**
|
|
- Location: main.py socket initialization
|
|
- Impact: Automatic cleanup of dead connections
|
|
- Effort: Low
|
|
- Set ping_timeout=60, ping_interval=25
|
|
|
|
10. **Add Comprehensive Integration Tests**
|
|
- Location: tests/integration/websocket/
|
|
- Impact: Prevent regressions
|
|
- Effort: High
|
|
- Test concurrent scenarios, recovery, cleanup
|
|
|
|
### Priority 4: Enhancement (Performance)
|
|
|
|
11. **Implement Namespace Separation**
|
|
- Partition different game instances into separate namespaces
|
|
- Reduces broadcast overhead
|
|
- Effort: High
|
|
|
|
12. **Add Event Compression**
|
|
- For large game_state_update broadcasts
|
|
- Reduce network overhead
|
|
- Effort: Medium
|
|
|
|
---
|
|
|
|
## 16. SUMMARY TABLE
|
|
|
|
| Issue | Severity | Category | Impact | Fix Effort |
|
|
|-------|----------|----------|--------|-----------|
|
|
| State mutation race conditions | CRITICAL | Concurrency | Data corruption | Medium |
|
|
| Missing authorization | CRITICAL | Security | Unauthorized access | Medium |
|
|
| Session expiration | HIGH | Memory | Unbounded memory growth | Low |
|
|
| Empty game room cleanup | HIGH | Memory | Memory leak | Low |
|
|
| Game resource cleanup | HIGH | Lifecycle | Orphaned resources | High |
|
|
| Reconnection handling | HIGH | Reliability | Duplicate sessions | Medium |
|
|
| Broadcast delivery guarantees | MEDIUM | Reliability | Missed updates | Medium |
|
|
| Integration test coverage | MEDIUM | Testing | Undetected bugs | High |
|
|
| Socket.io config optimization | MEDIUM | Reliability | Long disconnection detection | Low |
|
|
| DB operation isolation | LOW | Data Integrity | Minor timing issues | Low |
|
|
|
|
---
|
|
|
|
## 17. IMPLEMENTATION CHECKLIST
|
|
|
|
### Phase 1: Critical Fixes (Week 1)
|
|
- [ ] Add locking to handler state mutations
|
|
- [ ] Implement authorization middleware
|
|
- [ ] Add session expiration logic
|
|
- [ ] Clean up empty game rooms
|
|
|
|
### Phase 2: Important Fixes (Week 2-3)
|
|
- [ ] Implement game cleanup lifecycle
|
|
- [ ] Add reconnection handling
|
|
- [ ] Add Socket.io ping configuration
|
|
- [ ] Implement broadcast acknowledgments
|
|
|
|
### Phase 3: Testing & Validation (Week 4)
|
|
- [ ] Add concurrency tests
|
|
- [ ] Add recovery tests
|
|
- [ ] Add cleanup verification tests
|
|
- [ ] Load testing with 100+ concurrent connections
|
|
|
|
---
|
|
|
|
**Report Generated**: 2025-11-27
|
|
**Analysis Tool**: Claude Code (File Search Specialist)
|
|
**Scope**: 2,643 lines analyzed across 3 core files + CLAUDE.md documentation
|