docs: sync KB — kb-rag-system.md
All checks were successful
Reindex Knowledge Base / reindex (push) Successful in 2s

This commit is contained in:
Cal Corum 2026-03-17 22:44:09 -05:00
parent be896b4c2a
commit 1ca0458a66

View File

@ -86,7 +86,7 @@ The KB data lives at `~/docker/md-kb-rag/data/repo/` on manticore as a proper gi
1. Push `.md` files to `main` branch on Gitea
2. Gitea Actions workflow (`.gitea/workflows/kb-reindex.yml`) fires
3. Workflow sends HMAC-SHA256 signed POST to `http://10.10.0.226:8001/hooks/reindex`
4. md-kb-rag receives webhook → runs `git pull --ff-only` → runs incremental reindex
4. md-kb-rag receives webhook → runs `git fetch` + `git merge --ff-only` (using `GIT_PULL_TOKEN`) → runs incremental reindex
5. Only changed files are re-embedded (content hash comparison via SQLite state DB)
### Webhook Authentication
@ -185,6 +185,7 @@ File: `~/docker/md-kb-rag/.env`
| `MCP_PORT` | MCP server port (8001) |
| `MCP_BEARER_TOKEN` | Auth token for MCP endpoint |
| `WEBHOOK_SECRET` | HMAC secret for webhook auth (shared with Gitea repo secret) |
| `GIT_PULL_TOKEN` | Gitea token for authenticated git fetch during webhook reindex |
| `RUST_LOG` | Log level (info) |
## Troubleshooting
@ -223,3 +224,29 @@ The kb-rag service has these non-obvious requirements:
- `user: "1000:1000"` — must match the uid/gid that owns `data/repo/` for git pull to work
- `config.yaml` mount — provides `source.git_url` and `branch` so the webhook handler knows to run `git pull`
- `.gitconfig` mount + `GIT_CONFIG_GLOBAL` env var — git needs `safe.directory = /data` since the volume owner differs from the container's default user
## Changelog
### 2026-03-17 — Image Update + Config Fixes
**Image pull**: Updated `ghcr.io/st0nefish/md-kb-rag:latest` (8 upstream commits since initial deploy on 2026-03-11).
**Key upstream changes applied:**
- **`GIT_PULL_TOKEN` support** — Webhook-triggered reindex now uses explicit `git fetch` + `git merge --ff-only` with a token injected into the HTTPS URL. Previously the git pull inside Docker was silently failing (no SSH client, dubious ownership errors).
- **Auto-clone on startup** — Setting `source.git_url` allows the container to shallow-clone the repo into an empty volume on first boot. Not adopted (we use a bind-mount), but available.
- **`EMBEDDING_API_KEY` support** — Optional env var for authenticated embedding providers. Not needed for local llama.cpp.
- **Custom MCP instructions** — New `mcp.instructions` config field sets the server-level instructions block sent to MCP clients. Server auto-appends discovered filter metadata (domains, types, tags).
- **Bug fixes** — Webhook rate limiter gap, globset deny-all fallback, RwLock panic in MCP startup, HTTP 429/503 retry logic for embedding API.
**Config changes made:**
- Added `GIT_PULL_TOKEN` env var to `.env` (Gitea token with repo read access)
- Added `GIT_PULL_TOKEN=${GIT_PULL_TOKEN:-}` to `docker-compose.yml` environment section
- Added `mcp.instructions` to `config.yaml` with proactive search trigger keywords matching the claude-home topic areas
**Env vars table update:**
| Variable | Purpose |
|----------|---------|
| `GIT_PULL_TOKEN` | Gitea token for authenticated git fetch during webhook reindex |
**Result**: Webhook reindex pipeline now works end-to-end (push → Gitea Action → webhook → git fetch with auth → incremental reindex). Verified with live push test.