claude-memory/embedding-model-size-barely-affects-speed-gpu-memory-bandwid-329d3c.md at 38d7d85339ddf45ae8ec734f955f39d37672b72d

cal/claude-memory

Fork 0

Cal Corum 99114ea561 store: Embedding model size barely affects speed — GPU memory bandwidth is the bottleneck

2026-02-19 14:53:16 -06:00

835 B

Raw Blame History

type

title

tags

importance

confidence

created

updated

329d3c3d-4cb5-4274-8613-df8bdfa9e3b2

insight

Embedding model size barely affects speed — GPU memory bandwidth is the bottleneck

ollama

embedding

performance

gpu

insight

0.7

0.8

2026-02-19T20:53:16.955487+00:00

nomic-embed-text (137M, F16) and qwen3-embedding:8b (7.6B, Q4_K_M) embed 430 memories in roughly the same time (~27-30s) on RTX 4080 SUPER. Reason: embedding is a single forward pass per batch (not autoregressive generation), texts are short (50-100 tokens), batched 50 at a time (only ~9 batches). GPU memory bandwidth, not compute, is the bottleneck. Quantized 8B model fits in ~5.7GB VRAM. This means there's no practical speed penalty for using the highest quality embedding model that fits in VRAM.

835 B Raw Blame History

835 B

Raw Blame History