I Benchmarked Every Embedding Model Worth Running. Here's What I'd Actually Deploy.

Our AI memory system couldn’t find “Prefers TypeScript over Python” when asked “what programming language does Justin prefer?”

That’s a near-paraphrase. The words overlap. Any decent embedding model should nail it. Ours missed it entirely.

We were running nomic-embed-text via Ollama on a Hetzner VPS. 768 dimensions, ~137M parameters, the model every YouTube tutorial tells you to use. It was fast, it was small, and it was producing search results that ranged from “occasionally useful” to “why did you return that?”

So we switched to OpenAI’s text-embedding-3-small via OpenRouter. 1536 dimensions, $0.02 per million tokens. The TypeScript query finally returned a result — at a 0.49 similarity score.

0.49 is not good. That’s barely above noise. For a direct semantic match, you want 0.7 or higher. We’d moved from “doesn’t work” to “technically works but not well.”

That prompted a deeper question: what should we actually be running?

The Landscape in 2026

The embedding model space has quietly gotten very good while most of us were paying attention to chat models. Here’s what exists, organized by the four use cases that matter.

English-Only Text

If your data is English and text-only, these are the current leaders:

Model	MTEB Score	Dims	Params	VRAM	License
NV-Embed-v2	72.31	384-1024	7.8B	~30 GB	CC-BY-NC-4.0
GTE-Qwen2-7B-instruct	70.24	3584	7.6B	~16 GB	Apache 2.0
ModernBERT-Embed Large	~65	768	395M	~0.8 GB	Apache 2.0
Snowflake Arctic Embed L v2.0	55.6	1024	303M	~0.6 GB	Apache 2.0

NV-Embed-v2 is the MTEB champion but requires a data center GPU and has a non-commercial license. GTE-Qwen2-7B is the best commercially-usable option if you have the VRAM.

General Text (All-Rounders)

Models that work well across retrieval, classification, clustering, and semantic similarity:

Model	MTEB Score	Dims	Params	VRAM	License
Qwen3-Embedding-8B	70.58	up to 4096	8B	~16 GB	Apache 2.0
Qwen3-Embedding-4B	~68	up to 2048	4B	~8 GB	Apache 2.0
GTE-Qwen2-1.5B-instruct	67.20	1536	1.5B	~3 GB	Apache 2.0
Nomic Embed Text v2 (MoE)	~65	768	305M active	~1 GB	Apache 2.0
Qwen3-Embedding-0.6B	~63	1024	0.6B	~1.2 GB	Apache 2.0

Qwen3-Embedding-8B is the current #1 on the MTEB leaderboard. The family scales well — the 0.6B version still beats OpenAI’s text-embedding-3-small while running on a CPU.

Nomic v2 is interesting: it’s a Mixture-of-Experts model, so only 305M of its 475M parameters activate per inference. Extremely efficient, but limited to 512 token context.

Multilingual

If your data includes non-English text:

Model	MTEB Score	Dims	Params	VRAM	Languages	License
Qwen3-Embedding-8B	70.58	up to 4096	8B	~16 GB	100+	Apache 2.0
Jina Embeddings v4	66.49	2048	3.8B	~8 GB	Multi	Apache 2.0
BGE-M3	63-64	1024	568M	~1.1 GB	100+	MIT
Nomic Embed Text v2	~65	768	305M active	~1 GB	~100	Apache 2.0

BGE-M3 deserves special mention. At 568M parameters and 1.1 GB of VRAM, it supports dense, sparse, and ColBERT retrieval in a single model. MIT licensed. 8K context. 100+ languages. It’s the most versatile embedding model you can run on virtually anything.

Multimodal (Text + Image)

When you need to embed images, PDFs, or charts alongside text:

Model	Benchmark	Dims	Params	VRAM	Modalities	License
Jina Embeddings v4	66.49 MMTEB	2048	3.8B	~8 GB	Text/Image/Docs	Apache 2.0
Nomic Embed Multimodal 7B	58.8 Vidore-v2	768	7B	~14 GB	Text/Image/PDF	Apache 2.0
SigLIP 2 (So400m)	SOTA ImageNet	1152	400M	~1 GB	Text/Image	Apache 2.0
Nomic Embed Vision v1.5	Aligned to text	768	137M	~0.3 GB	Image	Apache 2.0

Jina v4 is the Swiss Army knife here — text, images, and visual documents in one model. SigLIP 2 is the lightweight option for pure text-image alignment.

If your data includes video or audio: these self-hosted models don’t cover that. Google’s Gemini Embedding 2 (API-only, $0.25/1M tokens) embeds all four modalities in a unified vector space — no transcription pipeline required. When that’s worth it is a different post.

Where OpenAI Sits

Here’s the uncomfortable truth about paying for embeddings:

MTEB Score Scale:

72 | #### NV-Embed-v2 (64GB, non-commercial)
71 | #### Qwen3-8B (32GB)
70 | #### GTE-Qwen2-7B (64GB)
69 |
68 | #### Qwen3-4B (16GB)
67 | #### GTE-Qwen2-1.5B (16GB)
66 | #### Jina v4 (32GB)
65 | #### Nomic v2 MoE (8GB) / ModernBERT (8GB)
   | - - - text-embedding-3-large ($0.13/1M) - - -  64.6
64 | #### BGE-M3 (8GB)
63 | #### Qwen3-0.6B (8GB)
   | - - - text-embedding-3-small ($0.02/1M) - - -  62.3
62 |

OpenAI’s embedding models sit at roughly the bottom of what you can self-host for free. Even the cheapest, smallest open-source models match or exceed text-embedding-3-small. The larger text-embedding-3-large gets beaten by models that run on consumer hardware.

The only reason to pay OpenAI for embeddings is zero infrastructure overhead. Not quality.

Picking by Hardware Budget

Most of us don’t have a dedicated GPU for embeddings. We’re running them alongside other workloads — an LLM, a database, the rest of our stack. Here’s what fits when you account for that:

8 GB (e.g., RTX 4060, M1/M2 base, small VPS)

~3-4 GB available for embeddings after other workloads.

Model	VRAM	MTEB	Notes
Nomic Embed Vision v1.5	~0.3 GB	Aligned	Image embeddings
Snowflake Arctic Embed L v2.0	~0.6 GB	55.6	Lightweight retrieval
ModernBERT-Embed Large	~0.8 GB	~65	Solid English-only
Nomic Embed Text v2 MoE	~1 GB	~65	General text, very efficient
BGE-M3	~1.1 GB	63-64	Best value at this tier
Qwen3-Embedding-0.6B	~1.2 GB	~63	32K context window

Sweet spot: BGE-M3 (1.1 GB) + Nomic Vision (0.3 GB) = 1.4 GB total for text + image, leaving 6+ GB for everything else.

16 GB (e.g., RTX 4080, M2 Pro)

~6-8 GB available.

Everything above, plus:

Model	VRAM	MTEB	Notes
SigLIP 2 (ViT-g)	~2 GB	Best accuracy	Premium text-image
GTE-Qwen2-1.5B-instruct	~3 GB	67.20	Big quality jump

Sweet spot: GTE-Qwen2-1.5B (3 GB) — a significant jump over the budget tier while leaving room for a 7B LLM.

32 GB (e.g., RTX 4090, M2 Max)

~12-16 GB available.

Model	VRAM	MTEB	Notes
Jina Embeddings v4	~8 GB	66.49	Best multimodal
Qwen3-Embedding-8B	~16 GB	70.58	Current #1 overall

Sweet spot: Qwen3-Embedding-8B if you’re all-in on text. Jina v4 if you need multimodal and want room for a LLM alongside it.

64 GB (e.g., A100, M3 Ultra)

Model	VRAM	MTEB	Notes
NV-Embed-v2	~30 GB	72.31	Absolute best (non-commercial)

Sweet spot: Qwen3-Embedding-8B (16 GB) + Jina v4 (8 GB) = 24 GB for best-in-class text + multimodal, with 40 GB left for inference.

Why Influencers Keep Recommending nomic-embed-text

You’ll notice nomic-embed-text (v1) still dominates tutorials and YouTube walkthroughs. It’s the “install Ubuntu” of embedding models — recommended everywhere because it was first and easy, not because it’s best.

The reasons:

First-mover on Ollama. It was one of the earliest quality embedding models available via ollama pull. Most tutorials were written when it was genuinely the best local option.
Tiny and fast. At 137M parameters and 0.3 GB, it runs on anything. “Runs on a Raspberry Pi” gets more views than “scores 3 points higher on MTEB.”
Good enough for demos. MTEB 62 vs 65 doesn’t matter when you’re embedding 20 documents for a tutorial. The gap shows up at scale.
Most content creators don’t benchmark. They pull the model, embed a few docs, get results, say “works great.” They never compare retrieval precision against alternatives on the same dataset.

If you’re following a tutorial from 2024, update the model. Nomic v2 MoE, BGE-M3, or Qwen3-Embedding-0.6B are all strictly better and just as easy to run.

Our Decision: Why We’re (Still) on text-embedding-3-small

We run two AI agents with shared memory on a Hetzner VPS (8 GB RAM, no GPU). Our embedding model handles mem0 semantic search — writing and retrieving memories across both agents via a shared Qdrant collection.

We started with nomic-embed-text at 768 dimensions. Search quality was bad enough that basic paraphrase queries missed entirely. We switched to text-embedding-3-small via OpenRouter at 1536 dimensions. It costs effectively nothing (fractions of a cent per month at our volume) and the search quality improved meaningfully — though a 0.49 score on a near-paraphrase still isn’t impressive.

We haven’t switched to a self-hosted alternative yet for a boring reason: it works, it’s cheap, and migrating means recreating the Qdrant collection and re-embedding all memories. The juice hasn’t been worth the squeeze.

But if we were starting fresh today, we’d run BGE-M3 at 1024 dimensions. It fits in 1.1 GB, scores higher than what we’re paying for, supports dense+sparse+ColBERT retrieval, and eliminates the OpenRouter dependency entirely. On our 8 GB VPS with ~4 GB free, it runs comfortably alongside everything else.

The cost savings are marginal — we’re talking pennies per month. The quality improvement and elimination of a network dependency are the real reasons to switch.

The Quick Decision Matrix

Your Budget	Text Pick	Multimodal Pick	Total VRAM	Headroom
8 GB	BGE-M3 (1.1 GB)	Nomic Vision (0.3 GB)	1.4 GB	6.6 GB
16 GB	GTE-Qwen2-1.5B (3 GB)	SigLIP 2 So400m (1 GB)	4 GB	12 GB
32 GB	Qwen3-Embedding-8B (16 GB)	SigLIP 2 So400m (1 GB)	17 GB	15 GB
64 GB	Qwen3-Embedding-8B (16 GB)	Jina v4 (8 GB)	24 GB	40 GB

All Apache 2.0 or MIT licensed. All self-hostable via Ollama, Hugging Face TEI, or sentence-transformers.

Stop paying for embeddings that open-source models beat on a laptop.

We’re Tensaku Labs. We build the tools we wish our AI agents had.