docs: sync KB — troubleshooting-gunicorn-worker-timeouts.md
This commit is contained in:
parent
50125d8b39
commit
98c69617ff
59
major-domo/troubleshooting-gunicorn-worker-timeouts.md
Normal file
59
major-domo/troubleshooting-gunicorn-worker-timeouts.md
Normal file
@ -0,0 +1,59 @@
|
||||
---
|
||||
title: "Fix: Gunicorn Worker Timeouts from Unbounded API Queries"
|
||||
description: "External clients sent limit=99999 and empty filter params through the reverse proxy, causing API workers to timeout and get killed."
|
||||
type: troubleshooting
|
||||
domain: major-domo
|
||||
tags: [troubleshooting, major-domo, database, deployment, docker]
|
||||
---
|
||||
|
||||
# Fix: Gunicorn Worker Timeouts from Unbounded API Queries
|
||||
|
||||
**Date:** 2026-04-01
|
||||
**PR:** cal/major-domo-database#99
|
||||
**Issues:** #98 (main), #100 (fieldingstats count bug), #101 (totalstats count overwrite, pre-existing)
|
||||
**Severity:** Critical — active production instability during Season 12, 12 worker timeouts in 2 days and accelerating
|
||||
|
||||
## Problem
|
||||
|
||||
The monitoring app kept flagging the SBA API container (`sba_db_api`) as unhealthy and restarting it. Container logs showed repeated `CRITICAL WORKER TIMEOUT` and `WARNING Worker was sent SIGABRT` messages from Gunicorn. The container itself wasn't restarting (0 Docker restarts, up 2 weeks), but individual workers were being killed and respawned, causing brief API unavailability windows.
|
||||
|
||||
## Root Cause
|
||||
|
||||
External clients (via nginx-proxy-manager at `172.25.0.3`) were sending requests with `limit=99999` and empty filter parameters (e.g., `?game_id=&pitcher_id=`). The API had no defenses:
|
||||
|
||||
- **No max limit cap** on any endpoint except `/players/search` (which had `le=50`). Clients could request 99,999 rows.
|
||||
- **Empty string params passed validation** — FastAPI parsed `game_id=` as `['']`, which passed `if param is not None` checks but generated wasteful full-table-scan queries.
|
||||
- **`/transactions` had no limit parameter at all** — always returned every matching row with recursive serialization (`model_to_dict(recurse=True)`).
|
||||
- **Recursive serialization amplified cost** — each row triggered additional DB lookups for FK relations (player, team, etc.).
|
||||
|
||||
Combined, these caused queries to exceed the 120-second Gunicorn timeout, killing the worker.
|
||||
|
||||
### IP Attribution Gotcha
|
||||
|
||||
Initial assumption was the Discord bot was the source (IP `172.25.0.3` was assumed to be the bot container). Docker IP mapping revealed `172.25.0.3` was actually **nginx-proxy-manager** — the queries came from external clients through the reverse proxy. The Discord bot is at `172.18.0.2` on a completely separate Docker network and generates none of these queries.
|
||||
|
||||
```bash
|
||||
# Command to map container IPs
|
||||
docker inspect --format='{{.Name}} {{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' $(docker ps -q)
|
||||
```
|
||||
|
||||
## Fix
|
||||
|
||||
PR #99 merged into main with the following changes (27 files, 503 insertions):
|
||||
|
||||
1. **`MAX_LIMIT=500` and `DEFAULT_LIMIT=200` constants** in `app/dependencies.py`, enforced with `le=MAX_LIMIT` across all list endpoints
|
||||
2. **`strip_empty_query_params` middleware** in `app/main.py` — strips empty string values from query params before FastAPI parses them, so `?game_id=` is treated as absent
|
||||
3. **`limit`/`offset` added to `/transactions`** — previously returned all rows; now defaults to 200, max 500, with `total_count` computed before pagination
|
||||
4. **11 existing limit params capped** with `le=MAX_LIMIT`
|
||||
5. **13 endpoints with no limit** received `limit`/`offset` params
|
||||
6. **Manual `if limit < 1` guards removed** — now handled by FastAPI's `ge=1` validation
|
||||
7. **5 unit tests** covering limit validation (422 on exceeding max, zero, negative), transaction response shape, and empty string stripping
|
||||
8. **fieldingstats count bug fixed** — `.count()` was being called after `.limit()`, capping the reported count at the page size instead of total matching rows (#100)
|
||||
|
||||
## Lessons
|
||||
|
||||
- **Always verify container IP attribution** before investigating the wrong service. `docker inspect` with format string is the canonical way to map IPs to container names. Don't assume based on Docker network proximity.
|
||||
- **APIs should never trust client-provided limits** — enforce `le=MAX_LIMIT` on every list endpoint. The only safe endpoint was `/players/search` which had been properly capped at `le=50`.
|
||||
- **Empty string params are a silent danger** — FastAPI parses `?param=` as `['']`, not `None`. A global middleware is the right fix since it protects all endpoints including future ones.
|
||||
- **Recursive serialization (`model_to_dict(recurse=True)`) is O(n * related_objects)** — dangerous on unbounded queries. Consider forcing `short_output=True` for large result sets.
|
||||
- **Heavy reformatting mixed with functional changes obscures bugs** — the fieldingstats count bug was missed in review because the file had 262 lines of diff from quote/formatting changes. Separate cosmetic and functional changes into different commits.
|
||||
Loading…
Reference in New Issue
Block a user