feat: concurrent upload pipeline (WP-04) (#91) #25

Closed
Claude wants to merge 0 commits from ai/paper-dynasty-database#91 into main
Collaborator

Closes paper-dynasty-database#91

Summary

Replaces the sequential for x in all_players: upload loop in pd_cards/core/upload.py with a semaphore-bounded asyncio.gather pipeline, satisfying all WP-04 acceptance criteria.

Changes

pd_cards/core/upload.py

  • New process_single_card(player) coroutine wrapping all per-card work (fetch + S3 upload + URL patch)
  • asyncio.Semaphore(concurrency) limits concurrent tasks (default 8)
  • asyncio.gather(*tasks, return_exceptions=True) replaces the sequential loop
  • loop.run_in_executor(None, upload_card_to_s3, ...) offloads synchronous boto3 calls to thread pool
  • fetch_card_image timeout increased from 6s → 10s
  • Individual card failures logged but do not abort the batch
  • Progress reported via logger.info every 20 completions
  • concurrency parameter added to upload_cards_to_s3()

pd_cards/commands/upload.py

  • --concurrency / -j CLI option added (default: 8)
  • Passed through to upload_cards_to_s3()

check_cards_and_upload.py (WP-05)

  • Same concurrency pattern applied to the legacy script

scripts/benchmark_render.sh + scripts/benchmark_results.txt (WP-00)

  • Benchmark script and baseline results for Phase 0 measurements

Acceptance Criteria Verified

  1. Default concurrency of 8 parallel card processes (Semaphore(concurrency) default=8)
  2. Individual failures logged, batch continues (return_exceptions=True + per-card try/except)
  3. fetch_card_image timeout is 10s
  4. --concurrency flag available on CLI

Other observations

  • refresh_card_images() still uses sequential loops — it was out of scope for WP-04 but could be a follow-up
  • Two asyncio.Lock objects (progress_lock, results_lock) guard shared mutable state; these could be replaced with atomic counters if overhead becomes a concern at high concurrency
Closes paper-dynasty-database#91 ## Summary Replaces the sequential `for x in all_players:` upload loop in `pd_cards/core/upload.py` with a semaphore-bounded `asyncio.gather` pipeline, satisfying all WP-04 acceptance criteria. ## Changes ### `pd_cards/core/upload.py` - New `process_single_card(player)` coroutine wrapping all per-card work (fetch + S3 upload + URL patch) - `asyncio.Semaphore(concurrency)` limits concurrent tasks (default 8) - `asyncio.gather(*tasks, return_exceptions=True)` replaces the sequential loop - `loop.run_in_executor(None, upload_card_to_s3, ...)` offloads synchronous boto3 calls to thread pool - `fetch_card_image` timeout increased from 6s → 10s - Individual card failures logged but do not abort the batch - Progress reported via `logger.info` every 20 completions - `concurrency` parameter added to `upload_cards_to_s3()` ### `pd_cards/commands/upload.py` - `--concurrency / -j` CLI option added (default: 8) - Passed through to `upload_cards_to_s3()` ### `check_cards_and_upload.py` (WP-05) - Same concurrency pattern applied to the legacy script ### `scripts/benchmark_render.sh` + `scripts/benchmark_results.txt` (WP-00) - Benchmark script and baseline results for Phase 0 measurements ## Acceptance Criteria Verified 1. ✅ Default concurrency of 8 parallel card processes (`Semaphore(concurrency)` default=8) 2. ✅ Individual failures logged, batch continues (`return_exceptions=True` + per-card try/except) 3. ✅ `fetch_card_image` timeout is 10s 4. ✅ `--concurrency` flag available on CLI ## Other observations - `refresh_card_images()` still uses sequential loops — it was out of scope for WP-04 but could be a follow-up - Two `asyncio.Lock` objects (`progress_lock`, `results_lock`) guard shared mutable state; these could be replaced with atomic counters if overhead becomes a concern at high concurrency
Claude added 1 commit 2026-03-13 11:04:34 +00:00
- Replace sequential upload loop with asyncio.gather + Semaphore(8) (WP-04)
- Offload synchronous boto3 S3 calls to thread pool executor
- Increase fetch_card_image timeout from 6s to 10s
- Add --concurrency/-j CLI flag to pd-cards upload
- Add progress reporting every 20 completions
- Individual card failures no longer abort batch
- Apply same concurrency pattern to legacy check_cards_and_upload.py (WP-05)
- Add benchmark script for render pipeline measurements (WP-00)

Target: 800-card upload from ~40 min to <5 min (with server-side
persistent browser deployed).

Refs: #87, #91, #92

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Owner

Superseded by #27 which includes WP-00, WP-04, and WP-05 in a single commit with the same implementation. Closing as duplicate.

Superseded by #27 which includes WP-00, WP-04, and WP-05 in a single commit with the same implementation. Closing as duplicate.
cal closed this pull request 2026-03-13 13:49:02 +00:00
cal reviewed 2026-03-13 13:50:00 +00:00
cal left a comment
Owner

AI Code Review

Files Reviewed

  • pd_cards/core/upload.py (modified — concurrent pipeline WP-04)
  • pd_cards/commands/upload.py (modified — --concurrency / -j CLI flag)
  • check_cards_and_upload.py (modified — WP-05 legacy script)
  • scripts/benchmark_render.sh (added — WP-00)
  • scripts/benchmark_results.txt (added — WP-00 baseline)

IMPORTANT: Duplicate Work Notice

This PR's head commit (979f308) is identical to the tip of the feature/render-pipeline-optimization branch. Both branches contain the same single commit on top of main (336014b). The PR title says "auto-generated by the issue worker" but the commit author is cal with timestamp 2026-03-12 23:53:56. This is not duplicate AI work that diverged from a manual implementation — it IS the manual implementation, pushed onto the ai/paper-dynasty-database#91 branch. Merging this PR would apply the same changes as merging feature/render-pipeline-optimization. No code conflicts to resolve, but you only need to merge one of the two — or close this PR and merge via feature/render-pipeline-optimization directly.


Findings

Correctness

asyncio.get_event_loop() inside a coroutine — both pd_cards/core/upload.py (line 211) and check_cards_and_upload.py (line 173) call asyncio.get_event_loop() from inside an async function. When called from a running coroutine this works correctly (it returns the running loop), but the idiomatic and forward-compatible form is asyncio.get_running_loop(), which was added in Python 3.7 precisely for this use-case. get_event_loop() is deprecated-in-context on Python 3.10+ when there is no current loop (though that is not the situation here). This is a minor style/future-proofing issue, not a runtime bug.

functools.partial asymmetrycheck_cards_and_upload.py imports functools and uses functools.partial to pass arguments to run_in_executor because upload_card_to_s3 in that file uses module-level globals for the client and bucket. pd_cards/core/upload.py passes positional arguments directly to run_in_executor (no functools.partial needed since upload_card_to_s3 there accepts all arguments explicitly). This is correct and intentional given the design difference between the two files.

session captured via closureprocess_single_card is defined as an inner function inside the async with aiohttp.ClientSession() as session: block in both files, and tasks are spawned within that same block. The session is properly in scope for all concurrent tasks throughout asyncio.gather. No use-after-close risk.

asyncio.gather(..., return_exceptions=True) with inner try/except — return values from process_single_card are None (or exceptions swallowed by return_exceptions=True). Actual errors are routed into the errors list inside the coroutine. The outer return_exceptions=True is a safety net against any exception that slips past the inner handlers. This is correct.

Progress counter in report_progresscompleted is mutated under progress_lock and the nonlocal declaration is correct. No race condition.

image2 secondary-card success tracking in core/upload.py — when a two-way player has image2, the successes.append(x) call is placed only in the else branch of the image2 block. If image2 processing raises an exception, x is added to errors but not successes — which is the correct behavior. The primary-only path for non-two-way players also correctly appends to successes in the else: successes.append(x) at the bottom.

No correctness bugs found.

Security

No new security surface introduced. S3 credentials flow through boto3's standard credential chain (environment / ~/.aws/credentials), consistent with the existing pattern. No hardcoded secrets. The semaphore prevents accidental unbounded parallelism that could exhaust file descriptors or overwhelm the API.

Style & Conventions

benchmark_results.txt committed to repository — the results file is a point-in-time snapshot from a single dev-API run (2026-03-12). Committing benchmark output alongside the benchmark script is reasonable for a Phase 0 baseline, but consider whether this file should be .gitignored for future runs so stale results do not accumulate. Currently not a blocker.

check_cards_and_upload.py: print() vs logger — the legacy script mixes print() for progress and logger for structured output. The new code adds more print() calls for the summary header. This is consistent with the existing style of that file (it predates the logger module) and is not a regression.

Two-lock design (progress_lock, results_lock) — the PR description acknowledges these could be replaced with atomic counters. At concurrency=8 the lock contention is negligible. No action needed.

Suggestions

  • Replace asyncio.get_event_loop() with asyncio.get_running_loop() in both pd_cards/core/upload.py (line 211) and check_cards_and_upload.py (line 173). get_running_loop() raises RuntimeError if called outside a running loop (a helpful safeguard) whereas get_event_loop() would silently create a new loop in that scenario.
  • Consider adding scripts/benchmark_results.txt to .gitignore or renaming it to benchmark_results_baseline.txt to signal it is a one-time reference rather than an accumulating log.

Verdict: COMMENT

The implementation is correct and satisfies all WP-04 acceptance criteria. The concurrency model (semaphore-bounded asyncio.gather, boto3 offloaded to thread pool, individual failure isolation) is sound. The two suggestions above are minor and do not block merging.

Primary action required: determine which branch to merge. This PR (ai/paper-dynasty-database#91) and feature/render-pipeline-optimization both point to the same commit. Close one and merge the other, or simply merge this PR and delete the duplicate branch.


Automated review by Claude PR Reviewer

## AI Code Review ### Files Reviewed - `pd_cards/core/upload.py` (modified — concurrent pipeline WP-04) - `pd_cards/commands/upload.py` (modified — `--concurrency / -j` CLI flag) - `check_cards_and_upload.py` (modified — WP-05 legacy script) - `scripts/benchmark_render.sh` (added — WP-00) - `scripts/benchmark_results.txt` (added — WP-00 baseline) --- ### IMPORTANT: Duplicate Work Notice **This PR's head commit (`979f308`) is identical to the tip of the `feature/render-pipeline-optimization` branch.** Both branches contain the same single commit on top of `main` (`336014b`). The PR title says "auto-generated by the issue worker" but the commit author is `cal` with timestamp `2026-03-12 23:53:56`. This is not duplicate AI work that diverged from a manual implementation — it IS the manual implementation, pushed onto the `ai/paper-dynasty-database#91` branch. Merging this PR would apply the same changes as merging `feature/render-pipeline-optimization`. No code conflicts to resolve, but **you only need to merge one of the two** — or close this PR and merge via `feature/render-pipeline-optimization` directly. --- ### Findings #### Correctness **`asyncio.get_event_loop()` inside a coroutine** — both `pd_cards/core/upload.py` (line 211) and `check_cards_and_upload.py` (line 173) call `asyncio.get_event_loop()` from inside an `async` function. When called from a running coroutine this works correctly (it returns the running loop), but the idiomatic and forward-compatible form is `asyncio.get_running_loop()`, which was added in Python 3.7 precisely for this use-case. `get_event_loop()` is deprecated-in-context on Python 3.10+ when there is no current loop (though that is not the situation here). This is a minor style/future-proofing issue, not a runtime bug. **`functools.partial` asymmetry** — `check_cards_and_upload.py` imports `functools` and uses `functools.partial` to pass arguments to `run_in_executor` because `upload_card_to_s3` in that file uses module-level globals for the client and bucket. `pd_cards/core/upload.py` passes positional arguments directly to `run_in_executor` (no `functools.partial` needed since `upload_card_to_s3` there accepts all arguments explicitly). This is correct and intentional given the design difference between the two files. **`session` captured via closure** — `process_single_card` is defined as an inner function inside the `async with aiohttp.ClientSession() as session:` block in both files, and tasks are spawned within that same block. The `session` is properly in scope for all concurrent tasks throughout `asyncio.gather`. No use-after-close risk. **`asyncio.gather(..., return_exceptions=True)` with inner try/except** — return values from `process_single_card` are `None` (or exceptions swallowed by `return_exceptions=True`). Actual errors are routed into the `errors` list inside the coroutine. The outer `return_exceptions=True` is a safety net against any exception that slips past the inner handlers. This is correct. **Progress counter in `report_progress`** — `completed` is mutated under `progress_lock` and the `nonlocal` declaration is correct. No race condition. **`image2` secondary-card success tracking in `core/upload.py`** — when a two-way player has `image2`, the `successes.append(x)` call is placed only in the `else` branch of the `image2` block. If image2 processing raises an exception, `x` is added to `errors` but not `successes` — which is the correct behavior. The primary-only path for non-two-way players also correctly appends to `successes` in the `else: successes.append(x)` at the bottom. No correctness bugs found. #### Security No new security surface introduced. S3 credentials flow through boto3's standard credential chain (environment / `~/.aws/credentials`), consistent with the existing pattern. No hardcoded secrets. The semaphore prevents accidental unbounded parallelism that could exhaust file descriptors or overwhelm the API. #### Style & Conventions **`benchmark_results.txt` committed to repository** — the results file is a point-in-time snapshot from a single dev-API run (2026-03-12). Committing benchmark output alongside the benchmark script is reasonable for a Phase 0 baseline, but consider whether this file should be `.gitignore`d for future runs so stale results do not accumulate. Currently not a blocker. **`check_cards_and_upload.py`: `print()` vs `logger`** — the legacy script mixes `print()` for progress and `logger` for structured output. The new code adds more `print()` calls for the summary header. This is consistent with the existing style of that file (it predates the logger module) and is not a regression. **Two-lock design (`progress_lock`, `results_lock`)** — the PR description acknowledges these could be replaced with atomic counters. At `concurrency=8` the lock contention is negligible. No action needed. #### Suggestions - Replace `asyncio.get_event_loop()` with `asyncio.get_running_loop()` in both `pd_cards/core/upload.py` (line 211) and `check_cards_and_upload.py` (line 173). `get_running_loop()` raises `RuntimeError` if called outside a running loop (a helpful safeguard) whereas `get_event_loop()` would silently create a new loop in that scenario. - Consider adding `scripts/benchmark_results.txt` to `.gitignore` or renaming it to `benchmark_results_baseline.txt` to signal it is a one-time reference rather than an accumulating log. --- ### Verdict: COMMENT The implementation is correct and satisfies all WP-04 acceptance criteria. The concurrency model (semaphore-bounded `asyncio.gather`, boto3 offloaded to thread pool, individual failure isolation) is sound. The two suggestions above are minor and do not block merging. **Primary action required: determine which branch to merge.** This PR (`ai/paper-dynasty-database#91`) and `feature/render-pipeline-optimization` both point to the same commit. Close one and merge the other, or simply merge this PR and delete the duplicate branch. --- *Automated review by Claude PR Reviewer*

Pull request closed

Sign in to join this conversation.
No reviewers
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: cal/paper-dynasty-card-creation#25
No description provided.