docs: sync KB — docker-buildx-cache-400-error.md

2026-03-23 22:00:43 -05:00 · 2026-03-23 22:00:43 -05:00 · 36aa78e591
commit 36aa78e591
parent 7bea39b39b
1 changed files with 76 additions and 5 deletions
--- a/development/docker-buildx-cache-400-error.md
+++ b/development/docker-buildx-cache-400-error.md
@ -1,6 +1,6 @@
 ---
-title: "Fix: Docker buildx cache 400 error on CI builds"
-description: "Stale buildx_buildkit_builder containers on Gitea Actions runner cause 400 Bad Request when exporting cache to Docker Hub."
+title: "Fix: Docker buildx cache 400 error — migrated to local volume cache"
+description: "Registry buildx cache caused 400 errors; permanent fix is local volume cache on the Gitea Actions runner."
 type: troubleshooting
 domain: development
 tags: [troubleshooting, docker, gitea, ci]
@ -40,6 +40,77 @@ git push origin :refs/tags/<tag> && git push origin <tag>

 ## Lessons

- Monitor buildx builder container accumulation on the Gitea runner — if more than 2-3 are lingering, clean them up proactively
- Consider adding a cleanup step to the CI workflow that prunes old builders after successful builds
- The `cache-to: type=registry` directive in the workflow is the trigger — without registry caching this wouldn't happen, but removing it would slow builds significantly
+- `type=registry` cache is unreliable on a single-runner setup — stale builders accumulate and corrupt cache state
+- Killing stale builders is a temporary fix only
+
+---
+
+## Permanent Fix: Local Volume Buildx Cache (2026-03-24)
+
+**Severity:** N/A — preventive infrastructure change
+
+**Problem:** The `type=registry` cache kept failing with 400 errors. Killing stale builders was a manual band-aid.
+
+**Root Cause:** Each CI build creates a new buildx builder container. On a single persistent runner (`gitea/act_runner`, `--restart unless-stopped`), these accumulate and corrupt the Docker Hub registry cache.
+
+**Fix:** Switched all workflows from `type=registry` to `type=local` backed by a named Docker volume.
+
+### Setup (one-time, on gitea runner host)
+
+```bash
+# Create named volume
+docker volume create pd-buildx-cache
+
+# Update /etc/gitea/runner-config.yaml
+#   valid_volumes:
+#     - pd-buildx-cache
+
+# Recreate runner container with new volume mount
+docker run -d --name gitea-runner --restart unless-stopped \
+  -v /etc/gitea/runner-config.yaml:/config.yaml:ro \
+  -v /var/run/docker.sock:/var/run/docker.sock \
+  -v gitea-runner-data:/data \
+  -v pd-buildx-cache:/opt/buildx-cache \
+  gitea/act_runner:latest
+```
+
+### Workflow changes
+
+1. Add `container.volumes` to mount the named volume into job containers:
+```yaml
+jobs:
+  build:
+    runs-on: ubuntu-latest
+    container:
+      volumes:
+        - pd-buildx-cache:/opt/buildx-cache
+```
+
+2. Replace cache directives (each repo uses its own subdirectory):
+```yaml
+cache-from: type=local,src=/opt/buildx-cache/<repo-name>
+cache-to: type=local,dest=/opt/buildx-cache/<repo-name>-new,mode=max
+```
+
+3. Add cache rotation step (prevents unbounded growth):
+```yaml
+- name: Rotate cache
+  run: |
+    rm -rf /opt/buildx-cache/<repo-name>
+    mv /opt/buildx-cache/<repo-name>-new /opt/buildx-cache/<repo-name>
+```
+
+### Key details
+
+- `type=gha` does NOT work on Gitea act_runner (requires GitHub's cache service API)
+- Named volumes (not bind mounts) are required because job containers are sibling containers spawned via Docker socket
+- `mode=max` caches all intermediate layers, not just final — important for multi-stage builds
+- First build after migration is cold; subsequent builds hit local cache
+- Cache size is bounded by the rotation step (~200-600MB per repo)
+- Applied to: Paper Dynasty database, Paper Dynasty discord. Major Domo repos still use registry cache (follow-up)
+
+### Repos using local cache
+| Repo | Cache subdirectory |
+|---|---|
+| paper-dynasty-database | `/opt/buildx-cache/pd-database` |
+| paper-dynasty-discord | `/opt/buildx-cache/pd-discord` |