diff --git a/major-domo/troubleshooting-gunicorn-worker-timeouts.md b/major-domo/troubleshooting-gunicorn-worker-timeouts.md new file mode 100644 index 0000000..b474b6a --- /dev/null +++ b/major-domo/troubleshooting-gunicorn-worker-timeouts.md @@ -0,0 +1,59 @@ +--- +title: "Fix: Gunicorn Worker Timeouts from Unbounded API Queries" +description: "External clients sent limit=99999 and empty filter params through the reverse proxy, causing API workers to timeout and get killed." +type: troubleshooting +domain: major-domo +tags: [troubleshooting, major-domo, database, deployment, docker] +--- + +# Fix: Gunicorn Worker Timeouts from Unbounded API Queries + +**Date:** 2026-04-01 +**PR:** cal/major-domo-database#99 +**Issues:** #98 (main), #100 (fieldingstats count bug), #101 (totalstats count overwrite, pre-existing) +**Severity:** Critical — active production instability during Season 12, 12 worker timeouts in 2 days and accelerating + +## Problem + +The monitoring app kept flagging the SBA API container (`sba_db_api`) as unhealthy and restarting it. Container logs showed repeated `CRITICAL WORKER TIMEOUT` and `WARNING Worker was sent SIGABRT` messages from Gunicorn. The container itself wasn't restarting (0 Docker restarts, up 2 weeks), but individual workers were being killed and respawned, causing brief API unavailability windows. + +## Root Cause + +External clients (via nginx-proxy-manager at `172.25.0.3`) were sending requests with `limit=99999` and empty filter parameters (e.g., `?game_id=&pitcher_id=`). The API had no defenses: + +- **No max limit cap** on any endpoint except `/players/search` (which had `le=50`). Clients could request 99,999 rows. +- **Empty string params passed validation** — FastAPI parsed `game_id=` as `['']`, which passed `if param is not None` checks but generated wasteful full-table-scan queries. +- **`/transactions` had no limit parameter at all** — always returned every matching row with recursive serialization (`model_to_dict(recurse=True)`). +- **Recursive serialization amplified cost** — each row triggered additional DB lookups for FK relations (player, team, etc.). + +Combined, these caused queries to exceed the 120-second Gunicorn timeout, killing the worker. + +### IP Attribution Gotcha + +Initial assumption was the Discord bot was the source (IP `172.25.0.3` was assumed to be the bot container). Docker IP mapping revealed `172.25.0.3` was actually **nginx-proxy-manager** — the queries came from external clients through the reverse proxy. The Discord bot is at `172.18.0.2` on a completely separate Docker network and generates none of these queries. + +```bash +# Command to map container IPs +docker inspect --format='{{.Name}} {{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' $(docker ps -q) +``` + +## Fix + +PR #99 merged into main with the following changes (27 files, 503 insertions): + +1. **`MAX_LIMIT=500` and `DEFAULT_LIMIT=200` constants** in `app/dependencies.py`, enforced with `le=MAX_LIMIT` across all list endpoints +2. **`strip_empty_query_params` middleware** in `app/main.py` — strips empty string values from query params before FastAPI parses them, so `?game_id=` is treated as absent +3. **`limit`/`offset` added to `/transactions`** — previously returned all rows; now defaults to 200, max 500, with `total_count` computed before pagination +4. **11 existing limit params capped** with `le=MAX_LIMIT` +5. **13 endpoints with no limit** received `limit`/`offset` params +6. **Manual `if limit < 1` guards removed** — now handled by FastAPI's `ge=1` validation +7. **5 unit tests** covering limit validation (422 on exceeding max, zero, negative), transaction response shape, and empty string stripping +8. **fieldingstats count bug fixed** — `.count()` was being called after `.limit()`, capping the reported count at the page size instead of total matching rows (#100) + +## Lessons + +- **Always verify container IP attribution** before investigating the wrong service. `docker inspect` with format string is the canonical way to map IPs to container names. Don't assume based on Docker network proximity. +- **APIs should never trust client-provided limits** — enforce `le=MAX_LIMIT` on every list endpoint. The only safe endpoint was `/players/search` which had been properly capped at `le=50`. +- **Empty string params are a silent danger** — FastAPI parses `?param=` as `['']`, not `None`. A global middleware is the right fix since it protects all endpoints including future ones. +- **Recursive serialization (`model_to_dict(recurse=True)`) is O(n * related_objects)** — dangerous on unbounded queries. Consider forcing `short_output=True` for large result sets. +- **Heavy reformatting mixed with functional changes obscures bugs** — the fieldingstats count bug was missed in review because the file had 262 lines of diff from quote/formatting changes. Separate cosmetic and functional changes into different commits.