strat-gameplay-webapp/.claude/plans/002-websocket-locking.md
Cal Corum e0c12467b0 CLAUDE: Improve UX with single-click OAuth, enhanced games list, and layout fix
Frontend UX improvements:
- Single-click Discord OAuth from home page (no intermediate /auth page)
- Auto-redirect authenticated users from home to /games
- Fixed Nuxt layout system - app.vue now wraps NuxtPage with NuxtLayout
- Games page now has proper card container with shadow/border styling
- Layout header includes working logout with API cookie clearing

Games list enhancements:
- Display team names (lname) instead of just team IDs
- Show current score for each team
- Show inning indicator (Top/Bot X) for active games
- Responsive header with wrapped buttons on mobile

Backend improvements:
- Added team caching to SbaApiClient (1-hour TTL)
- Enhanced GameListItem with team names, scores, inning data
- Games endpoint now enriches response with SBA API team data

Docker optimizations:
- Optimized Dockerfile using --chown flag on COPY (faster than chown -R)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-05 16:14:00 -06:00

267 lines
7.9 KiB
Markdown

# Plan 002: WebSocket Handler Locking
**Priority**: CRITICAL
**Effort**: 2-3 hours
**Status**: NOT STARTED
**Risk Level**: HIGH - Data corruption
---
## Problem Statement
WebSocket handlers mutate `GameState` without acquiring per-game locks, allowing concurrent handlers to corrupt game state.
**Example Race Condition**:
```
Time Client A (roll_dice) Client B (roll_dice)
---- -------------------- --------------------
T1 Read pending_roll = None
T2 Read pending_roll = None
T3 Set pending_roll = RollA
T4 Set pending_roll = RollB (overwrites!)
T5 Process with RollA
T6 Process with RollB (uses wrong roll!)
```
## Impact
- **Data Corruption**: Invalid game states
- **Invalid Outcomes**: Wrong play resolutions
- **User Trust**: Inconsistent game behavior
## Files to Modify
| File | Changes |
|------|---------|
| `backend/app/websocket/handlers.py` | Add lock acquisition to handlers |
| `backend/app/core/state_manager.py` | Expose lock acquisition method |
## Current Lock Implementation
The `StateManager` already has per-game locks:
```python
# backend/app/core/state_manager.py
class StateManager:
def __init__(self):
self._game_locks: dict[UUID, asyncio.Lock] = {}
def _get_game_lock(self, game_id: UUID) -> asyncio.Lock:
if game_id not in self._game_locks:
self._game_locks[game_id] = asyncio.Lock()
return self._game_locks[game_id]
```
**Problem**: Locks exist but handlers don't use them!
## Implementation Steps
### Step 1: Expose Lock Context Manager (15 min)
Update `backend/app/core/state_manager.py`:
```python
from contextlib import asynccontextmanager
class StateManager:
# ... existing code ...
@asynccontextmanager
async def game_lock(self, game_id: UUID):
"""
Acquire exclusive lock for game operations.
Use this for any handler that modifies game state.
Usage:
async with state_manager.game_lock(game_id):
# Perform state modifications
"""
lock = self._get_game_lock(game_id)
async with lock:
yield
```
### Step 2: Identify Handlers Requiring Locks (15 min)
Handlers that modify state (MUST lock):
- [x] `submit_defensive_decision` - modifies pending decisions
- [x] `submit_offensive_decision` - modifies pending decisions
- [x] `roll_dice` - modifies pending_manual_roll
- [x] `submit_manual_outcome` - modifies game state
- [x] `request_pinch_hitter` - modifies lineup
- [x] `request_defensive_replacement` - modifies lineup
- [x] `request_pitching_change` - modifies lineup
Handlers that only read (no lock needed):
- [ ] `join_game` - read only
- [ ] `leave_game` - read only
- [ ] `get_lineup` - read only
- [ ] `get_box_score` - read only
### Step 3: Update Decision Handlers (30 min)
```python
@sio.event
async def submit_defensive_decision(sid, data):
game_id = UUID(data.get("game_id"))
async with state_manager.game_lock(game_id):
# All state modifications inside lock
game_state = state_manager.get_game_state(game_id)
if not game_state:
await sio.emit("error", {"message": "Game not found"}, to=sid)
return
# Process decision
decision = DefensiveDecision(**data.get("decision"))
game_state.pending_defensive_decision = decision
# Check if both decisions received
if game_state.pending_offensive_decision:
# Resolve play within lock
result = await game_engine.resolve_play(game_state)
await manager.broadcast_to_game(game_id, "play_resolved", result)
```
### Step 4: Update Roll/Outcome Handlers (30 min)
```python
@sio.event
async def roll_dice(sid, data):
game_id = UUID(data.get("game_id"))
roll_type = data.get("roll_type")
async with state_manager.game_lock(game_id):
game_state = state_manager.get_game_state(game_id)
# Validate no pending roll
if game_state.pending_manual_roll:
await sio.emit("error", {"message": "Roll already pending"}, to=sid)
return
# Perform roll and set pending
roll = dice.roll(roll_type, game_state.league_id)
game_state.pending_manual_roll = roll
await manager.broadcast_to_game(game_id, "dice_rolled", roll.to_dict())
```
### Step 5: Update Substitution Handlers (30 min)
```python
@sio.event
async def request_pinch_hitter(sid, data):
game_id = UUID(data.get("game_id"))
async with state_manager.game_lock(game_id):
game_state = state_manager.get_game_state(game_id)
# Validate and process substitution
result = await substitution_manager.process_pinch_hitter(
game_state,
entering_player_id=data.get("entering_player_id"),
exiting_player_id=data.get("exiting_player_id")
)
if result.success:
await manager.broadcast_to_game(game_id, "player_substituted", result.to_dict())
else:
await sio.emit("error", {"message": result.error}, to=sid)
```
### Step 6: Add Lock Timeout (15 min)
Prevent deadlocks with timeout:
```python
@asynccontextmanager
async def game_lock(self, game_id: UUID, timeout: float = 30.0):
"""Acquire game lock with timeout."""
lock = self._get_game_lock(game_id)
try:
await asyncio.wait_for(lock.acquire(), timeout=timeout)
yield
except asyncio.TimeoutError:
logger.error(f"Lock timeout for game {game_id}")
raise RuntimeError(f"Could not acquire lock for game {game_id}")
finally:
if lock.locked():
lock.release()
```
### Step 7: Write Concurrency Tests (30 min)
Create `backend/tests/unit/websocket/test_handler_locking.py`:
```python
import pytest
import asyncio
from unittest.mock import AsyncMock, patch
class TestHandlerLocking:
"""Tests for WebSocket handler concurrency."""
@pytest.mark.asyncio
async def test_concurrent_rolls_serialized(self):
"""Two concurrent roll_dice calls should be serialized."""
results = []
async def mock_roll_dice(sid, data):
async with state_manager.game_lock(game_id):
results.append(f"start_{sid}")
await asyncio.sleep(0.1) # Simulate work
results.append(f"end_{sid}")
# Launch concurrent handlers
await asyncio.gather(
mock_roll_dice("sid1", {}),
mock_roll_dice("sid2", {})
)
# Should be serialized: start_X, end_X, start_Y, end_Y
assert results[0].startswith("start")
assert results[1].startswith("end")
assert results[0].split("_")[1] == results[1].split("_")[1]
@pytest.mark.asyncio
async def test_lock_timeout_raises_error(self):
"""Lock acquisition should timeout if held too long."""
@pytest.mark.asyncio
async def test_different_games_not_blocked(self):
"""Locks for different games should not block each other."""
```
## Verification Checklist
- [ ] All state-modifying handlers use `game_lock`
- [ ] Lock timeout prevents deadlocks
- [ ] Different games can proceed concurrently
- [ ] Same game operations are serialized
- [ ] Concurrency tests pass
- [ ] Manual test: rapid-fire decisions don't corrupt state
## Performance Considerations
- Lock contention should be minimal (turn-based game)
- 30-second timeout is generous for any operation
- Per-game locks allow parallel games to proceed
## Rollback Plan
If issues arise:
1. Remove lock calls from handlers
2. Add rate limiting as temporary mitigation
3. Investigate specific race condition scenarios
## Dependencies
- None (can be implemented independently)
- Recommended: Implement after Plan 001 (Authorization)
## Notes
- Consider adding lock metrics for monitoring
- May want to add retry logic for timeout scenarios
- Future: Consider optimistic locking for read-heavy operations