4.5 KiB
| title | description | type | domain | tags | |||||
|---|---|---|---|---|---|---|---|---|---|
| Fix: Gunicorn Worker Timeouts from Unbounded API Queries | External clients sent limit=99999 and empty filter params through the reverse proxy, causing API workers to timeout and get killed. | troubleshooting | major-domo |
|
Fix: Gunicorn Worker Timeouts from Unbounded API Queries
Date: 2026-04-01 PR: cal/major-domo-database#99 Issues: #98 (main), #100 (fieldingstats count bug), #101 (totalstats count overwrite, pre-existing) Severity: Critical — active production instability during Season 12, 12 worker timeouts in 2 days and accelerating
Problem
The monitoring app kept flagging the SBA API container (sba_db_api) as unhealthy and restarting it. Container logs showed repeated CRITICAL WORKER TIMEOUT and WARNING Worker was sent SIGABRT messages from Gunicorn. The container itself wasn't restarting (0 Docker restarts, up 2 weeks), but individual workers were being killed and respawned, causing brief API unavailability windows.
Root Cause
External clients (via nginx-proxy-manager at 172.25.0.3) were sending requests with limit=99999 and empty filter parameters (e.g., ?game_id=&pitcher_id=). The API had no defenses:
- No max limit cap on any endpoint except
/players/search(which hadle=50). Clients could request 99,999 rows. - Empty string params passed validation — FastAPI parsed
game_id=as[''], which passedif param is not Nonechecks but generated wasteful full-table-scan queries. /transactionshad no limit parameter at all — always returned every matching row with recursive serialization (model_to_dict(recurse=True)).- Recursive serialization amplified cost — each row triggered additional DB lookups for FK relations (player, team, etc.).
Combined, these caused queries to exceed the 120-second Gunicorn timeout, killing the worker.
IP Attribution Gotcha
Initial assumption was the Discord bot was the source (IP 172.25.0.3 was assumed to be the bot container). Docker IP mapping revealed 172.25.0.3 was actually nginx-proxy-manager — the queries came from external clients through the reverse proxy. The Discord bot is at 172.18.0.2 on a completely separate Docker network and generates none of these queries.
# Command to map container IPs
docker inspect --format='{{.Name}} {{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' $(docker ps -q)
Fix
PR #99 merged into main with the following changes (27 files, 503 insertions):
MAX_LIMIT=500andDEFAULT_LIMIT=200constants inapp/dependencies.py, enforced withle=MAX_LIMITacross all list endpointsstrip_empty_query_paramsmiddleware inapp/main.py— strips empty string values from query params before FastAPI parses them, so?game_id=is treated as absentlimit/offsetadded to/transactions— previously returned all rows; now defaults to 200, max 500, withtotal_countcomputed before pagination- 11 existing limit params capped with
le=MAX_LIMIT - 13 endpoints with no limit received
limit/offsetparams - Manual
if limit < 1guards removed — now handled by FastAPI'sge=1validation - 5 unit tests covering limit validation (422 on exceeding max, zero, negative), transaction response shape, and empty string stripping
- fieldingstats count bug fixed —
.count()was being called after.limit(), capping the reported count at the page size instead of total matching rows (#100)
Lessons
- Always verify container IP attribution before investigating the wrong service.
docker inspectwith format string is the canonical way to map IPs to container names. Don't assume based on Docker network proximity. - APIs should never trust client-provided limits — enforce
le=MAX_LIMITon every list endpoint. The only safe endpoint was/players/searchwhich had been properly capped atle=50. - Empty string params are a silent danger — FastAPI parses
?param=as[''], notNone. A global middleware is the right fix since it protects all endpoints including future ones. - Recursive serialization (
model_to_dict(recurse=True)) is O(n * related_objects) — dangerous on unbounded queries. Consider forcingshort_output=Truefor large result sets. - Heavy reformatting mixed with functional changes obscures bugs — the fieldingstats count bug was missed in review because the file had 262 lines of diff from quote/formatting changes. Separate cosmetic and functional changes into different commits.