store: Embedding model size barely affects speed — GPU memory bandwidth is the bottleneck
This commit is contained in:
parent
8250be4b9c
commit
99114ea561
@ -0,0 +1,12 @@
|
||||
---
|
||||
id: 329d3c3d-4cb5-4274-8613-df8bdfa9e3b2
|
||||
type: insight
|
||||
title: "Embedding model size barely affects speed — GPU memory bandwidth is the bottleneck"
|
||||
tags: [ollama, embedding, performance, gpu, insight]
|
||||
importance: 0.7
|
||||
confidence: 0.8
|
||||
created: "2026-02-19T20:53:16.955487+00:00"
|
||||
updated: "2026-02-19T20:53:16.955487+00:00"
|
||||
---
|
||||
|
||||
nomic-embed-text (137M, F16) and qwen3-embedding:8b (7.6B, Q4_K_M) embed 430 memories in roughly the same time (~27-30s) on RTX 4080 SUPER. Reason: embedding is a single forward pass per batch (not autoregressive generation), texts are short (50-100 tokens), batched 50 at a time (only ~9 batches). GPU memory bandwidth, not compute, is the bottleneck. Quantized 8B model fits in ~5.7GB VRAM. This means there's no practical speed penalty for using the highest quality embedding model that fits in VRAM.
|
||||
Loading…
Reference in New Issue
Block a user