strat-gameplay-webapp/.claude/plans/002-websocket-locking.md

# Plan 002: WebSocket Handler Locking

**Priority**: CRITICAL
**Effort**: 2-3 hours
**Status**: NOT STARTED
**Risk Level**: HIGH - Data corruption

---

## Problem Statement

WebSocket handlers mutate `GameState` without acquiring per-game locks, allowing concurrent handlers to corrupt game state.

**Example Race Condition**:
```
Time    Client A (roll_dice)              Client B (roll_dice)
----    --------------------              --------------------
T1      Read pending_roll = None
T2                                        Read pending_roll = None
T3      Set pending_roll = RollA
T4                                        Set pending_roll = RollB  (overwrites!)
T5      Process with RollA
T6                                        Process with RollB (uses wrong roll!)
```

## Impact

- **Data Corruption**: Invalid game states
- **Invalid Outcomes**: Wrong play resolutions
- **User Trust**: Inconsistent game behavior

## Files to Modify

| File | Changes |
|------|---------|
| `backend/app/websocket/handlers.py` | Add lock acquisition to handlers |
| `backend/app/core/state_manager.py` | Expose lock acquisition method |

## Current Lock Implementation

The `StateManager` already has per-game locks:

```python
# backend/app/core/state_manager.py
class StateManager:
    def __init__(self):
        self._game_locks: dict[UUID, asyncio.Lock] = {}

    def _get_game_lock(self, game_id: UUID) -> asyncio.Lock:
        if game_id not in self._game_locks:
            self._game_locks[game_id] = asyncio.Lock()
        return self._game_locks[game_id]
```

**Problem**: Locks exist but handlers don't use them!

## Implementation Steps

### Step 1: Expose Lock Context Manager (15 min)

Update `backend/app/core/state_manager.py`:

```python
from contextlib import asynccontextmanager

class StateManager:
    # ... existing code ...

    @asynccontextmanager
    async def game_lock(self, game_id: UUID):
        """
        Acquire exclusive lock for game operations.
        Use this for any handler that modifies game state.

        Usage:
            async with state_manager.game_lock(game_id):
                # Perform state modifications
        """
        lock = self._get_game_lock(game_id)
        async with lock:
            yield
```

### Step 2: Identify Handlers Requiring Locks (15 min)

Handlers that modify state (MUST lock):
- [x] `submit_defensive_decision` - modifies pending decisions
- [x] `submit_offensive_decision` - modifies pending decisions
- [x] `roll_dice` - modifies pending_manual_roll
- [x] `submit_manual_outcome` - modifies game state
- [x] `request_pinch_hitter` - modifies lineup
- [x] `request_defensive_replacement` - modifies lineup
- [x] `request_pitching_change` - modifies lineup

Handlers that only read (no lock needed):
- [ ] `join_game` - read only
- [ ] `leave_game` - read only
- [ ] `get_lineup` - read only
- [ ] `get_box_score` - read only

### Step 3: Update Decision Handlers (30 min)

```python
@sio.event
async def submit_defensive_decision(sid, data):
    game_id = UUID(data.get("game_id"))

    async with state_manager.game_lock(game_id):
        # All state modifications inside lock
        game_state = state_manager.get_game_state(game_id)
        if not game_state:
            await sio.emit("error", {"message": "Game not found"}, to=sid)
            return

        # Process decision
        decision = DefensiveDecision(**data.get("decision"))
        game_state.pending_defensive_decision = decision

        # Check if both decisions received
        if game_state.pending_offensive_decision:
            # Resolve play within lock
            result = await game_engine.resolve_play(game_state)
            await manager.broadcast_to_game(game_id, "play_resolved", result)
```

### Step 4: Update Roll/Outcome Handlers (30 min)

```python
@sio.event
async def roll_dice(sid, data):
    game_id = UUID(data.get("game_id"))
    roll_type = data.get("roll_type")

    async with state_manager.game_lock(game_id):
        game_state = state_manager.get_game_state(game_id)

        # Validate no pending roll
        if game_state.pending_manual_roll:
            await sio.emit("error", {"message": "Roll already pending"}, to=sid)
            return

        # Perform roll and set pending
        roll = dice.roll(roll_type, game_state.league_id)
        game_state.pending_manual_roll = roll

        await manager.broadcast_to_game(game_id, "dice_rolled", roll.to_dict())
```

### Step 5: Update Substitution Handlers (30 min)

```python
@sio.event
async def request_pinch_hitter(sid, data):
    game_id = UUID(data.get("game_id"))

    async with state_manager.game_lock(game_id):
        game_state = state_manager.get_game_state(game_id)

        # Validate and process substitution
        result = await substitution_manager.process_pinch_hitter(
            game_state,
            entering_player_id=data.get("entering_player_id"),
            exiting_player_id=data.get("exiting_player_id")
        )

        if result.success:
            await manager.broadcast_to_game(game_id, "player_substituted", result.to_dict())
        else:
            await sio.emit("error", {"message": result.error}, to=sid)
```

### Step 6: Add Lock Timeout (15 min)

Prevent deadlocks with timeout:

```python
@asynccontextmanager
async def game_lock(self, game_id: UUID, timeout: float = 30.0):
    """Acquire game lock with timeout."""
    lock = self._get_game_lock(game_id)
    try:
        await asyncio.wait_for(lock.acquire(), timeout=timeout)
        yield
    except asyncio.TimeoutError:
        logger.error(f"Lock timeout for game {game_id}")
        raise RuntimeError(f"Could not acquire lock for game {game_id}")
    finally:
        if lock.locked():
            lock.release()
```

### Step 7: Write Concurrency Tests (30 min)

Create `backend/tests/unit/websocket/test_handler_locking.py`:

```python
import pytest
import asyncio
from unittest.mock import AsyncMock, patch

class TestHandlerLocking:
    """Tests for WebSocket handler concurrency."""

    @pytest.mark.asyncio
    async def test_concurrent_rolls_serialized(self):
        """Two concurrent roll_dice calls should be serialized."""
        results = []

        async def mock_roll_dice(sid, data):
            async with state_manager.game_lock(game_id):
                results.append(f"start_{sid}")
                await asyncio.sleep(0.1)  # Simulate work
                results.append(f"end_{sid}")

        # Launch concurrent handlers
        await asyncio.gather(
            mock_roll_dice("sid1", {}),
            mock_roll_dice("sid2", {})
        )

        # Should be serialized: start_X, end_X, start_Y, end_Y
        assert results[0].startswith("start")
        assert results[1].startswith("end")
        assert results[0].split("_")[1] == results[1].split("_")[1]

    @pytest.mark.asyncio
    async def test_lock_timeout_raises_error(self):
        """Lock acquisition should timeout if held too long."""

    @pytest.mark.asyncio
    async def test_different_games_not_blocked(self):
        """Locks for different games should not block each other."""
```

## Verification Checklist

- [ ] All state-modifying handlers use `game_lock`
- [ ] Lock timeout prevents deadlocks
- [ ] Different games can proceed concurrently
- [ ] Same game operations are serialized
- [ ] Concurrency tests pass
- [ ] Manual test: rapid-fire decisions don't corrupt state

## Performance Considerations

- Lock contention should be minimal (turn-based game)
- 30-second timeout is generous for any operation
- Per-game locks allow parallel games to proceed

## Rollback Plan

If issues arise:
1. Remove lock calls from handlers
2. Add rate limiting as temporary mitigation
3. Investigate specific race condition scenarios

## Dependencies

- None (can be implemented independently)
- Recommended: Implement after Plan 001 (Authorization)

## Notes

- Consider adding lock metrics for monitoring
- May want to add retry logic for timeout scenarios
- Future: Consider optimistic locking for read-heavy operations