From 99114ea561113966962a1a249cefdb48d5582b55 Mon Sep 17 00:00:00 2001 From: Cal Corum Date: Thu, 19 Feb 2026 14:53:16 -0600 Subject: [PATCH] =?UTF-8?q?store:=20Embedding=20model=20size=20barely=20af?= =?UTF-8?q?fects=20speed=20=E2=80=94=20GPU=20memory=20bandwidth=20is=20the?= =?UTF-8?q?=20bottleneck?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...barely-affects-speed-gpu-memory-bandwid-329d3c.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) create mode 100644 graph/insights/embedding-model-size-barely-affects-speed-gpu-memory-bandwid-329d3c.md diff --git a/graph/insights/embedding-model-size-barely-affects-speed-gpu-memory-bandwid-329d3c.md b/graph/insights/embedding-model-size-barely-affects-speed-gpu-memory-bandwid-329d3c.md new file mode 100644 index 00000000000..f61057a4f73 --- /dev/null +++ b/graph/insights/embedding-model-size-barely-affects-speed-gpu-memory-bandwid-329d3c.md @@ -0,0 +1,12 @@ +--- +id: 329d3c3d-4cb5-4274-8613-df8bdfa9e3b2 +type: insight +title: "Embedding model size barely affects speed — GPU memory bandwidth is the bottleneck" +tags: [ollama, embedding, performance, gpu, insight] +importance: 0.7 +confidence: 0.8 +created: "2026-02-19T20:53:16.955487+00:00" +updated: "2026-02-19T20:53:16.955487+00:00" +--- + +nomic-embed-text (137M, F16) and qwen3-embedding:8b (7.6B, Q4_K_M) embed 430 memories in roughly the same time (~27-30s) on RTX 4080 SUPER. Reason: embedding is a single forward pass per batch (not autoregressive generation), texts are short (50-100 tokens), batched 50 at a time (only ~9 batches). GPU memory bandwidth, not compute, is the bottleneck. Quantized 8B model fits in ~5.7GB VRAM. This means there's no practical speed penalty for using the highest quality embedding model that fits in VRAM.