claude-home/major-domo/troubleshooting-gunicorn-worker-timeouts.md

4.5 KiB

title description type domain tags
Fix: Gunicorn Worker Timeouts from Unbounded API Queries External clients sent limit=99999 and empty filter params through the reverse proxy, causing API workers to timeout and get killed. troubleshooting major-domo
troubleshooting
major-domo
database
deployment
docker

Fix: Gunicorn Worker Timeouts from Unbounded API Queries

Date: 2026-04-01 PR: cal/major-domo-database#99 Issues: #98 (main), #100 (fieldingstats count bug), #101 (totalstats count overwrite, pre-existing) Severity: Critical — active production instability during Season 12, 12 worker timeouts in 2 days and accelerating

Problem

The monitoring app kept flagging the SBA API container (sba_db_api) as unhealthy and restarting it. Container logs showed repeated CRITICAL WORKER TIMEOUT and WARNING Worker was sent SIGABRT messages from Gunicorn. The container itself wasn't restarting (0 Docker restarts, up 2 weeks), but individual workers were being killed and respawned, causing brief API unavailability windows.

Root Cause

External clients (via nginx-proxy-manager at 172.25.0.3) were sending requests with limit=99999 and empty filter parameters (e.g., ?game_id=&pitcher_id=). The API had no defenses:

  • No max limit cap on any endpoint except /players/search (which had le=50). Clients could request 99,999 rows.
  • Empty string params passed validation — FastAPI parsed game_id= as [''], which passed if param is not None checks but generated wasteful full-table-scan queries.
  • /transactions had no limit parameter at all — always returned every matching row with recursive serialization (model_to_dict(recurse=True)).
  • Recursive serialization amplified cost — each row triggered additional DB lookups for FK relations (player, team, etc.).

Combined, these caused queries to exceed the 120-second Gunicorn timeout, killing the worker.

IP Attribution Gotcha

Initial assumption was the Discord bot was the source (IP 172.25.0.3 was assumed to be the bot container). Docker IP mapping revealed 172.25.0.3 was actually nginx-proxy-manager — the queries came from external clients through the reverse proxy. The Discord bot is at 172.18.0.2 on a completely separate Docker network and generates none of these queries.

# Command to map container IPs
docker inspect --format='{{.Name}} {{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' $(docker ps -q)

Fix

PR #99 merged into main with the following changes (27 files, 503 insertions):

  1. MAX_LIMIT=500 and DEFAULT_LIMIT=200 constants in app/dependencies.py, enforced with le=MAX_LIMIT across all list endpoints
  2. strip_empty_query_params middleware in app/main.py — strips empty string values from query params before FastAPI parses them, so ?game_id= is treated as absent
  3. limit/offset added to /transactions — previously returned all rows; now defaults to 200, max 500, with total_count computed before pagination
  4. 11 existing limit params capped with le=MAX_LIMIT
  5. 13 endpoints with no limit received limit/offset params
  6. Manual if limit < 1 guards removed — now handled by FastAPI's ge=1 validation
  7. 5 unit tests covering limit validation (422 on exceeding max, zero, negative), transaction response shape, and empty string stripping
  8. fieldingstats count bug fixed.count() was being called after .limit(), capping the reported count at the page size instead of total matching rows (#100)

Lessons

  • Always verify container IP attribution before investigating the wrong service. docker inspect with format string is the canonical way to map IPs to container names. Don't assume based on Docker network proximity.
  • APIs should never trust client-provided limits — enforce le=MAX_LIMIT on every list endpoint. The only safe endpoint was /players/search which had been properly capped at le=50.
  • Empty string params are a silent danger — FastAPI parses ?param= as [''], not None. A global middleware is the right fix since it protects all endpoints including future ones.
  • Recursive serialization (model_to_dict(recurse=True)) is O(n * related_objects) — dangerous on unbounded queries. Consider forcing short_output=True for large result sets.
  • Heavy reformatting mixed with functional changes obscures bugs — the fieldingstats count bug was missed in review because the file had 262 lines of diff from quote/formatting changes. Separate cosmetic and functional changes into different commits.