I Benchmarked Every Embedding Model Worth Running. Here's What I'd Actually Deploy.

Our AI memory system couldn’t find “Prefers TypeScript over Python” when asked “what programming language does Justin prefer?”

That’s a near-paraphrase. The words overlap. Any decent embedding model should nail it. Ours missed it entirely.

We were running nomic-embed-text via Ollama on a Hetzner VPS. 768 dimensions, ~137M parameters, the model every YouTube tutorial tells you to use. It was fast, it was small, and it was producing search results that ranged from “occasionally useful” to “why did you return that?”

So we switched to OpenAI’s text-embedding-3-small via OpenRouter. 1536 dimensions, $0.02 per million tokens. The TypeScript query finally returned a result — at a 0.49 similarity score.

0.49 is not good. That’s barely above noise. For a direct semantic match, you want 0.7 or higher. We’d moved from “doesn’t work” to “technically works but not well.”

That prompted a deeper question: what should we actually be running?

The Landscape in 2026

The embedding model space has quietly gotten very good while most of us were paying attention to chat models. Here’s what exists, organized by the four use cases that matter.

English-Only Text

If your data is English and text-only, these are the current leaders:

ModelMTEB ScoreDimsParamsVRAMLicense
NV-Embed-v272.31384-10247.8B~30 GBCC-BY-NC-4.0
GTE-Qwen2-7B-instruct70.2435847.6B~16 GBApache 2.0
ModernBERT-Embed Large~65768395M~0.8 GBApache 2.0
Snowflake Arctic Embed L v2.055.61024303M~0.6 GBApache 2.0

NV-Embed-v2 is the MTEB champion but requires a data center GPU and has a non-commercial license. GTE-Qwen2-7B is the best commercially-usable option if you have the VRAM.

General Text (All-Rounders)

Models that work well across retrieval, classification, clustering, and semantic similarity:

ModelMTEB ScoreDimsParamsVRAMLicense
Qwen3-Embedding-8B70.58up to 40968B~16 GBApache 2.0
Qwen3-Embedding-4B~68up to 20484B~8 GBApache 2.0
GTE-Qwen2-1.5B-instruct67.2015361.5B~3 GBApache 2.0
Nomic Embed Text v2 (MoE)~65768305M active~1 GBApache 2.0
Qwen3-Embedding-0.6B~6310240.6B~1.2 GBApache 2.0

Qwen3-Embedding-8B is the current #1 on the MTEB leaderboard. The family scales well — the 0.6B version still beats OpenAI’s text-embedding-3-small while running on a CPU.

Nomic v2 is interesting: it’s a Mixture-of-Experts model, so only 305M of its 475M parameters activate per inference. Extremely efficient, but limited to 512 token context.

Multilingual

If your data includes non-English text:

ModelMTEB ScoreDimsParamsVRAMLanguagesLicense
Qwen3-Embedding-8B70.58up to 40968B~16 GB100+Apache 2.0
Jina Embeddings v466.4920483.8B~8 GBMultiApache 2.0
BGE-M363-641024568M~1.1 GB100+MIT
Nomic Embed Text v2~65768305M active~1 GB~100Apache 2.0

BGE-M3 deserves special mention. At 568M parameters and 1.1 GB of VRAM, it supports dense, sparse, and ColBERT retrieval in a single model. MIT licensed. 8K context. 100+ languages. It’s the most versatile embedding model you can run on virtually anything.

Multimodal (Text + Image)

When you need to embed images, PDFs, or charts alongside text:

ModelBenchmarkDimsParamsVRAMModalitiesLicense
Jina Embeddings v466.49 MMTEB20483.8B~8 GBText/Image/DocsApache 2.0
Nomic Embed Multimodal 7B58.8 Vidore-v27687B~14 GBText/Image/PDFApache 2.0
SigLIP 2 (So400m)SOTA ImageNet1152400M~1 GBText/ImageApache 2.0
Nomic Embed Vision v1.5Aligned to text768137M~0.3 GBImageApache 2.0

Jina v4 is the Swiss Army knife here — text, images, and visual documents in one model. SigLIP 2 is the lightweight option for pure text-image alignment.

If your data includes video or audio: these self-hosted models don’t cover that. Google’s Gemini Embedding 2 (API-only, $0.25/1M tokens) embeds all four modalities in a unified vector space — no transcription pipeline required. When that’s worth it is a different post.

Where OpenAI Sits

Here’s the uncomfortable truth about paying for embeddings:

MTEB Score Scale:

72 | #### NV-Embed-v2 (64GB, non-commercial)
71 | #### Qwen3-8B (32GB)
70 | #### GTE-Qwen2-7B (64GB)
69 |
68 | #### Qwen3-4B (16GB)
67 | #### GTE-Qwen2-1.5B (16GB)
66 | #### Jina v4 (32GB)
65 | #### Nomic v2 MoE (8GB) / ModernBERT (8GB)
   | - - - text-embedding-3-large ($0.13/1M) - - -  64.6
64 | #### BGE-M3 (8GB)
63 | #### Qwen3-0.6B (8GB)
   | - - - text-embedding-3-small ($0.02/1M) - - -  62.3
62 |

OpenAI’s embedding models sit at roughly the bottom of what you can self-host for free. Even the cheapest, smallest open-source models match or exceed text-embedding-3-small. The larger text-embedding-3-large gets beaten by models that run on consumer hardware.

The only reason to pay OpenAI for embeddings is zero infrastructure overhead. Not quality.

Picking by Hardware Budget

Most of us don’t have a dedicated GPU for embeddings. We’re running them alongside other workloads — an LLM, a database, the rest of our stack. Here’s what fits when you account for that:

8 GB (e.g., RTX 4060, M1/M2 base, small VPS)

~3-4 GB available for embeddings after other workloads.

ModelVRAMMTEBNotes
Nomic Embed Vision v1.5~0.3 GBAlignedImage embeddings
Snowflake Arctic Embed L v2.0~0.6 GB55.6Lightweight retrieval
ModernBERT-Embed Large~0.8 GB~65Solid English-only
Nomic Embed Text v2 MoE~1 GB~65General text, very efficient
BGE-M3~1.1 GB63-64Best value at this tier
Qwen3-Embedding-0.6B~1.2 GB~6332K context window

Sweet spot: BGE-M3 (1.1 GB) + Nomic Vision (0.3 GB) = 1.4 GB total for text + image, leaving 6+ GB for everything else.

16 GB (e.g., RTX 4080, M2 Pro)

~6-8 GB available.

Everything above, plus:

ModelVRAMMTEBNotes
SigLIP 2 (ViT-g)~2 GBBest accuracyPremium text-image
GTE-Qwen2-1.5B-instruct~3 GB67.20Big quality jump

Sweet spot: GTE-Qwen2-1.5B (3 GB) — a significant jump over the budget tier while leaving room for a 7B LLM.

32 GB (e.g., RTX 4090, M2 Max)

~12-16 GB available.

ModelVRAMMTEBNotes
Jina Embeddings v4~8 GB66.49Best multimodal
Qwen3-Embedding-8B~16 GB70.58Current #1 overall

Sweet spot: Qwen3-Embedding-8B if you’re all-in on text. Jina v4 if you need multimodal and want room for a LLM alongside it.

64 GB (e.g., A100, M3 Ultra)

ModelVRAMMTEBNotes
NV-Embed-v2~30 GB72.31Absolute best (non-commercial)

Sweet spot: Qwen3-Embedding-8B (16 GB) + Jina v4 (8 GB) = 24 GB for best-in-class text + multimodal, with 40 GB left for inference.

Why Influencers Keep Recommending nomic-embed-text

You’ll notice nomic-embed-text (v1) still dominates tutorials and YouTube walkthroughs. It’s the “install Ubuntu” of embedding models — recommended everywhere because it was first and easy, not because it’s best.

The reasons:

  1. First-mover on Ollama. It was one of the earliest quality embedding models available via ollama pull. Most tutorials were written when it was genuinely the best local option.
  2. Tiny and fast. At 137M parameters and 0.3 GB, it runs on anything. “Runs on a Raspberry Pi” gets more views than “scores 3 points higher on MTEB.”
  3. Good enough for demos. MTEB 62 vs 65 doesn’t matter when you’re embedding 20 documents for a tutorial. The gap shows up at scale.
  4. Most content creators don’t benchmark. They pull the model, embed a few docs, get results, say “works great.” They never compare retrieval precision against alternatives on the same dataset.

If you’re following a tutorial from 2024, update the model. Nomic v2 MoE, BGE-M3, or Qwen3-Embedding-0.6B are all strictly better and just as easy to run.

Our Decision: Why We’re (Still) on text-embedding-3-small

We run two AI agents with shared memory on a Hetzner VPS (8 GB RAM, no GPU). Our embedding model handles mem0 semantic search — writing and retrieving memories across both agents via a shared Qdrant collection.

We started with nomic-embed-text at 768 dimensions. Search quality was bad enough that basic paraphrase queries missed entirely. We switched to text-embedding-3-small via OpenRouter at 1536 dimensions. It costs effectively nothing (fractions of a cent per month at our volume) and the search quality improved meaningfully — though a 0.49 score on a near-paraphrase still isn’t impressive.

We haven’t switched to a self-hosted alternative yet for a boring reason: it works, it’s cheap, and migrating means recreating the Qdrant collection and re-embedding all memories. The juice hasn’t been worth the squeeze.

But if we were starting fresh today, we’d run BGE-M3 at 1024 dimensions. It fits in 1.1 GB, scores higher than what we’re paying for, supports dense+sparse+ColBERT retrieval, and eliminates the OpenRouter dependency entirely. On our 8 GB VPS with ~4 GB free, it runs comfortably alongside everything else.

The cost savings are marginal — we’re talking pennies per month. The quality improvement and elimination of a network dependency are the real reasons to switch.

The Quick Decision Matrix

Your BudgetText PickMultimodal PickTotal VRAMHeadroom
8 GBBGE-M3 (1.1 GB)Nomic Vision (0.3 GB)1.4 GB6.6 GB
16 GBGTE-Qwen2-1.5B (3 GB)SigLIP 2 So400m (1 GB)4 GB12 GB
32 GBQwen3-Embedding-8B (16 GB)SigLIP 2 So400m (1 GB)17 GB15 GB
64 GBQwen3-Embedding-8B (16 GB)Jina v4 (8 GB)24 GB40 GB

All Apache 2.0 or MIT licensed. All self-hostable via Ollama, Hugging Face TEI, or sentence-transformers.

Stop paying for embeddings that open-source models beat on a laptop.


We’re Tensaku Labs. We build the tools we wish our AI agents had.

← All posts