diff --git a/.claude/WEBSOCKET_TROUBLESHOOTING.md b/.claude/WEBSOCKET_TROUBLESHOOTING.md new file mode 100644 index 0000000..434c0e1 --- /dev/null +++ b/.claude/WEBSOCKET_TROUBLESHOOTING.md @@ -0,0 +1,227 @@ +# WebSocket Connection Troubleshooting Guide + +**Last Updated**: 2025-01-29 +**Issue Resolved**: Intermittent WebSocket connection failures through reverse proxy + +## Problem Statement + +WebSocket connections would intermittently fail when accessing the application through the production URL (`gameplay-demo.manticorum.com`). The connection would show "Socket exists: no" and the browser wouldn't even attempt to make a WebSocket connection. Mysteriously, adding debug code or making minor changes would cause the connection to "randomly start working." + +## Architecture Overview + +``` +Browser → Cloudflare → Nginx Proxy Manager → Backend (FastAPI + Socket.io) + ↓ + Port 8000 +``` + +## Root Causes Identified + +### 1. Nginx Proxy Manager Configuration Issues + +**Symptoms**: +- Direct connection to backend (localhost:8000) returned HTTP 101 (success) +- Connection through NPM returned HTTP 400 or 500 + +**Causes Found**: +- HTTP/2 stripping WebSocket upgrade headers (Cloudflare enables HTTP/2 by default) +- Access lists blocking non-Cloudflare IPs when Cloudflare proxy was disabled for testing +- Custom location blocks not inheriting WebSocket headers from server-level config + +**Solution**: +- Disable HTTP/2 in NPM for WebSocket hosts (or use Cloudflare proxy which handles this) +- Remove restrictive access lists during development +- Use NPM's built-in "WebSocket Support" toggle rather than manual header configuration +- Route `/socket.io` to port 8000 via NPM GUI custom location + +### 2. Conflicting Socket.io Plugin + +**Symptoms**: +- "Socket exists: no" in UI debug output +- No `/socket.io` network requests appearing in logs +- Browser not even attempting WebSocket connection + +**Cause**: +File `frontend-sba/plugins/socket.client.ts` was using JWT token authentication, conflicting with the cookie-based authentication in `useWebSocket.ts`. Both were trying to manage the socket connection. + +**Solution**: +Deleted/disabled `plugins/socket.client.ts`. The `useWebSocket.ts` composable handles all WebSocket management with cookie-based auth. + +### 3. SSR/Hydration State Corruption (Primary Root Cause) + +**Symptoms**: +- Connection would "randomly work" after adding debug code or making changes +- No functional difference in code changes, but behavior changed +- Intermittent failures that seemed timing-dependent + +**Cause**: +Module-level singleton state in `useWebSocket.ts` was being initialized during SSR (server-side rendering), then persisting in a corrupted state during client hydration. The state variables: + +```typescript +// BEFORE (problematic) +let socketInstance: Socket | null = null +let reconnectionAttempts = 0 +let reconnectionTimeout: NodeJS.Timeout | null = null +``` + +These were initialized on the server during SSR, potentially set to intermediate values, then the same state object was "reused" on the client during hydration instead of being freshly initialized. + +**Solution**: +Refactored to use lazy client-only initialization: + +```typescript +// AFTER (fixed) +interface ClientState { + socketInstance: Socket | null + reconnectionAttempts: number + reconnectionTimeout: ReturnType | null + // ... other state + initialized: boolean +} + +let clientState: ClientState | null = null + +function getClientState(): ClientState { + if (import.meta.client && !clientState) { + clientState = { /* fresh initialization */ } + } + return clientState || { /* empty fallback for SSR */ } +} + +// Reset reactive state on client hydration +if (import.meta.client) { + isConnected.value = false + isConnecting.value = false + // ... +} +``` + +Added `import.meta.client` guards to all functions that interact with the socket. + +## Diagnostic Commands + +### Test Direct Backend Connection +```bash +curl -v -H "Upgrade: websocket" \ + -H "Connection: Upgrade" \ + -H "Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==" \ + -H "Sec-WebSocket-Version: 13" \ + "http://localhost:8000/socket.io/?EIO=4&transport=websocket" +# Should return HTTP 101 Switching Protocols +``` + +### Test Through Proxy +```bash +curl -v -H "Upgrade: websocket" \ + -H "Connection: Upgrade" \ + -H "Sec-WebSocket-Key: dGhlIHNhbXBsZSBub25jZQ==" \ + -H "Sec-WebSocket-Version: 13" \ + "https://gameplay-demo.manticorum.com/socket.io/?EIO=4&transport=websocket" +# Should also return HTTP 101 +``` + +### Check NPM Configuration +```bash +ssh homelab # or your NPM server +cat /data/nginx/proxy_host/*.conf | grep -A 20 "gameplay-demo" +``` + +### Monitor Backend Logs +```bash +tail -f /tmp/backend_live.log | grep -i socket +``` + +## NPM Configuration for WebSocket + +Working configuration for `gameplay-demo.manticorum.com`: + +1. **Proxy Host Settings**: + - Forward Hostname: `10.10.0.16` (or your backend host) + - Forward Port: `8000` + - WebSocket Support: **ENABLED** (toggle in GUI) + - HTTP/2 Support: **DISABLED** (for WebSocket compatibility) + +2. **Custom Location** (via GUI, not Advanced config): + - Location: `/socket.io` + - Forward Host: `10.10.0.16` + - Forward Port: `8000` + +3. **No Access Lists** blocking during development + +## Key Files + +| File | Purpose | +|------|---------| +| `frontend-sba/composables/useWebSocket.ts` | Main WebSocket management with SSR-safe patterns | +| `frontend-sba/composables/useGameActions.ts` | Game action wrappers using the socket | +| `backend/app/websocket/handlers.py` | Backend Socket.io event handlers | +| `backend/app/websocket/auth.py` | Cookie-based authentication for WebSocket | + +## SSR-Safe Patterns for WebSocket in Nuxt + +### DO: +```typescript +// Lazy initialization only on client +function getClientState(): ClientState { + if (import.meta.client && !clientState) { + clientState = { /* init */ } + } + return clientState || { /* fallback */ } +} + +// Guard all socket operations +function connect() { + if (!import.meta.client) return + // ... +} + +// Reset state on hydration +if (import.meta.client) { + isConnected.value = false +} + +// Use ReturnType for timer types (SSR compatible) +let timeout: ReturnType | null = null +``` + +### DON'T: +```typescript +// Module-level initialization (runs on server!) +let socket = io(url) // BAD + +// NodeJS types (not available in browser) +let timeout: NodeJS.Timeout // BAD + +// Immediate watchers without guards +watch(() => auth, () => connect(), { immediate: true }) // BAD +``` + +## Cloudflare Considerations + +- Cloudflare Pro/Business/Enterprise can disable HTTP/2 per-domain +- Free plan cannot disable HTTP/2, but Cloudflare proxy handles WebSocket upgrade correctly +- When testing without Cloudflare proxy (orange cloud off), ensure NPM handles HTTP/2 → HTTP/1.1 downgrade +- WebSocket connections through Cloudflare have a 100-second idle timeout + +## Debugging Tips + +1. **Check if socket instance exists**: Add `console.log('[WebSocket] Socket exists:', !!state.socketInstance)` + +2. **Verify SSR vs Client**: Add `console.log('[WebSocket] Running on:', import.meta.client ? 'client' : 'server')` + +3. **Monitor hydration**: Log in module scope `if (import.meta.client) { console.log('Module loaded on client') }` + +4. **Check network tab**: Filter by "WS" to see WebSocket connections, look for 101 status + +5. **Backend logs**: Look for "Socket.io connection" messages to confirm backend receives the connection + +## Related Commits + +- `CLAUDE: Improve service scripts and fix WebSocket plugin conflict` - Removed conflicting plugin +- `CLAUDE: Fix SSR/hydration issues in WebSocket composable` - Main SSR fix + +## References + +- [Nuxt 3 SSR Documentation](https://nuxt.com/docs/guide/concepts/rendering) +- [Socket.io Client Documentation](https://socket.io/docs/v4/client-initialization/) +- [Nginx Proxy Manager WebSocket Support](https://nginxproxymanager.com/advanced-config/)