API allows unbounded queries causing Gunicorn worker timeouts #98

New Issue

cal · 2026-04-01T21:46:04Z

cal commented

2026-04-01 21:46:04 +00:00

Problem

Gunicorn workers are repeatedly timing out (120s) and getting SIGABRT'd — 12 occurrences in the last 2 days. The monitoring stack detects the unhealthy window during worker recycling and restarts the container.

Root Cause

Requests arriving with limit=99999 or empty required filter parameters cause the API to attempt returning enormous result sets, tying up workers past the timeout threshold.

Source: External clients via reverse proxy

UPDATED: The source IP 172.25.0.3 is nginx-proxy-manager, NOT the Discord bot. These queries come from external clients (website users or direct API callers) through the reverse proxy. The Discord bot (172.18.0.2) is on a separate Docker network and does not generate these queries.

Observed bad requests (from container logs)

Request	Issue
`GET /plays/?limit=99999&game_id=&short_output=FALSE` → 500	Empty `game_id` = fetch ALL plays
`GET /plays/?game_id=&pitcher_id=&sort=oldest` → 500	Empty filters = entire plays table
`GET /players?season=9&team_id=&team_id=391` → 500	Duplicate/empty `team_id` params
`GET /plays/batting?limit=99999&...`	99,999 row limit on batting plays
`GET /transactions?limit=99999&...`	99,999 row limit on transactions

Current State of the API

Only /players/search has a max limit cap (le=50). Every other endpoint trusts the client.
Empty string params silently pass validation — FastAPI parses game_id= as [''], which passes if param is not None checks but generates wasteful queries.
/transactions has no limit parameter at all — always returns all matching rows with recursive serialization.
model_to_dict(recurse=True) amplifies the problem — each row triggers additional DB lookups for FK relations.

Proposed Fix

Priority 1 — Immediate (stop the crashes)

Add a global MAX_LIMIT constant (e.g., 500) and enforce le=500 across all list endpoints
Strip empty strings from list params — convert [''] to None via a shared dependency
Add a limit param to /transactions with default=200, max=500

Priority 2 — Short-term hardening

Add PostgreSQL statement_timeout = 30s as a safety net below the Gunicorn timeout
Force short_output=True when limit > 100 to prevent recursive serialization on large result sets
Log when clients request limit > 500 to identify problematic callers

Priority 3 — Architectural

Standardize pagination with a shared PaginationParams dependency across all list endpoints

Impact

Workers timeout and get killed every few hours
Brief API unavailability during worker respawn — can affect any client, including transaction submissions
Monitoring flags container as unhealthy, triggers unnecessary restarts
P1 priority — active production instability during Season 12

## Problem Gunicorn workers are repeatedly timing out (120s) and getting SIGABRT'd — 12 occurrences in the last 2 days. The monitoring stack detects the unhealthy window during worker recycling and restarts the container. ## Root Cause Requests arriving with `limit=99999` or empty required filter parameters cause the API to attempt returning enormous result sets, tying up workers past the timeout threshold. ### Source: External clients via reverse proxy **UPDATED:** The source IP `172.25.0.3` is **nginx-proxy-manager**, NOT the Discord bot. These queries come from external clients (website users or direct API callers) through the reverse proxy. The Discord bot (`172.18.0.2`) is on a separate Docker network and does not generate these queries. ### Observed bad requests (from container logs) | Request | Issue | |---|---| | `GET /plays/?limit=99999&game_id=&short_output=FALSE` → 500 | Empty `game_id` = fetch ALL plays | | `GET /plays/?game_id=&pitcher_id=&sort=oldest` → 500 | Empty filters = entire plays table | | `GET /players?season=9&team_id=&team_id=391` → 500 | Duplicate/empty `team_id` params | | `GET /plays/batting?limit=99999&...` | 99,999 row limit on batting plays | | `GET /transactions?limit=99999&...` | 99,999 row limit on transactions | ## Current State of the API - **Only `/players/search` has a max limit cap** (`le=50`). Every other endpoint trusts the client. - **Empty string params silently pass validation** — FastAPI parses `game_id=` as `['']`, which passes `if param is not None` checks but generates wasteful queries. - **`/transactions` has no limit parameter at all** — always returns all matching rows with recursive serialization. - **`model_to_dict(recurse=True)`** amplifies the problem — each row triggers additional DB lookups for FK relations. ## Proposed Fix ### Priority 1 — Immediate (stop the crashes) 1. **Add a global `MAX_LIMIT` constant** (e.g., 500) and enforce `le=500` across all list endpoints 2. **Strip empty strings from list params** — convert `['']` to `None` via a shared dependency 3. **Add a `limit` param to `/transactions`** with default=200, max=500 ### Priority 2 — Short-term hardening 4. **Add PostgreSQL `statement_timeout = 30s`** as a safety net below the Gunicorn timeout 5. **Force `short_output=True` when limit > 100** to prevent recursive serialization on large result sets 6. **Log when clients request limit > 500** to identify problematic callers ### Priority 3 — Architectural 7. **Standardize pagination** with a shared `PaginationParams` dependency across all list endpoints ## Impact - Workers timeout and get killed every few hours - Brief API unavailability during worker respawn — can affect any client, including transaction submissions - Monitoring flags container as unhealthy, triggers unnecessary restarts - **P1 priority** — active production instability during Season 12

cal added the

ai-working

label 2026-04-01 21:55:40 +00:00

cal referenced this issue from a commit

2026-04-01 22:23:35 +00:00

Fix unbounded API queries causing Gunicorn worker timeouts

cal referenced a pull request that will close this issue

2026-04-01 22:23:50 +00:00

Fix unbounded API queries causing worker timeouts #99

cal closed this issue

2026-04-01 22:44:41 +00:00

Sign in to join this conversation.