Our AI memory system couldn’t find “Prefers TypeScript over Python” when asked “what programming language does Justin prefer?”
That’s a near-paraphrase. The words overlap. Any decent embedding model should nail it. Ours missed it entirely.
We were running nomic-embed-text via Ollama on a Hetzner VPS. 768 dimensions, ~137M parameters, the model every YouTube tutorial tells you to use. It was fast, it was small, and it was producing search results that ranged from “occasionally useful” to “why did you return that?”
So we switched to OpenAI’s text-embedding-3-small via OpenRouter. 1536 dimensions, $0.02 per million tokens. The TypeScript query finally returned a result — at a 0.49 similarity score.
0.49 is not good. That’s barely above noise. For a direct semantic match, you want 0.7 or higher. We’d moved from “doesn’t work” to “technically works but not well.”
That prompted a deeper question: what should we actually be running?
The Landscape in 2026
The embedding model space has quietly gotten very good while most of us were paying attention to chat models. Here’s what exists, organized by the four use cases that matter.
English-Only Text
If your data is English and text-only, these are the current leaders:
| Model | MTEB Score | Dims | Params | VRAM | License |
|---|---|---|---|---|---|
| NV-Embed-v2 | 72.31 | 384-1024 | 7.8B | ~30 GB | CC-BY-NC-4.0 |
| GTE-Qwen2-7B-instruct | 70.24 | 3584 | 7.6B | ~16 GB | Apache 2.0 |
| ModernBERT-Embed Large | ~65 | 768 | 395M | ~0.8 GB | Apache 2.0 |
| Snowflake Arctic Embed L v2.0 | 55.6 | 1024 | 303M | ~0.6 GB | Apache 2.0 |
NV-Embed-v2 is the MTEB champion but requires a data center GPU and has a non-commercial license. GTE-Qwen2-7B is the best commercially-usable option if you have the VRAM.
General Text (All-Rounders)
Models that work well across retrieval, classification, clustering, and semantic similarity:
| Model | MTEB Score | Dims | Params | VRAM | License |
|---|---|---|---|---|---|
| Qwen3-Embedding-8B | 70.58 | up to 4096 | 8B | ~16 GB | Apache 2.0 |
| Qwen3-Embedding-4B | ~68 | up to 2048 | 4B | ~8 GB | Apache 2.0 |
| GTE-Qwen2-1.5B-instruct | 67.20 | 1536 | 1.5B | ~3 GB | Apache 2.0 |
| Nomic Embed Text v2 (MoE) | ~65 | 768 | 305M active | ~1 GB | Apache 2.0 |
| Qwen3-Embedding-0.6B | ~63 | 1024 | 0.6B | ~1.2 GB | Apache 2.0 |
Qwen3-Embedding-8B is the current #1 on the MTEB leaderboard. The family scales well — the 0.6B version still beats OpenAI’s text-embedding-3-small while running on a CPU.
Nomic v2 is interesting: it’s a Mixture-of-Experts model, so only 305M of its 475M parameters activate per inference. Extremely efficient, but limited to 512 token context.
Multilingual
If your data includes non-English text:
| Model | MTEB Score | Dims | Params | VRAM | Languages | License |
|---|---|---|---|---|---|---|
| Qwen3-Embedding-8B | 70.58 | up to 4096 | 8B | ~16 GB | 100+ | Apache 2.0 |
| Jina Embeddings v4 | 66.49 | 2048 | 3.8B | ~8 GB | Multi | Apache 2.0 |
| BGE-M3 | 63-64 | 1024 | 568M | ~1.1 GB | 100+ | MIT |
| Nomic Embed Text v2 | ~65 | 768 | 305M active | ~1 GB | ~100 | Apache 2.0 |
BGE-M3 deserves special mention. At 568M parameters and 1.1 GB of VRAM, it supports dense, sparse, and ColBERT retrieval in a single model. MIT licensed. 8K context. 100+ languages. It’s the most versatile embedding model you can run on virtually anything.
Multimodal (Text + Image)
When you need to embed images, PDFs, or charts alongside text:
| Model | Benchmark | Dims | Params | VRAM | Modalities | License |
|---|---|---|---|---|---|---|
| Jina Embeddings v4 | 66.49 MMTEB | 2048 | 3.8B | ~8 GB | Text/Image/Docs | Apache 2.0 |
| Nomic Embed Multimodal 7B | 58.8 Vidore-v2 | 768 | 7B | ~14 GB | Text/Image/PDF | Apache 2.0 |
| SigLIP 2 (So400m) | SOTA ImageNet | 1152 | 400M | ~1 GB | Text/Image | Apache 2.0 |
| Nomic Embed Vision v1.5 | Aligned to text | 768 | 137M | ~0.3 GB | Image | Apache 2.0 |
Jina v4 is the Swiss Army knife here — text, images, and visual documents in one model. SigLIP 2 is the lightweight option for pure text-image alignment.
If your data includes video or audio: these self-hosted models don’t cover that. Google’s Gemini Embedding 2 (API-only, $0.25/1M tokens) embeds all four modalities in a unified vector space — no transcription pipeline required. When that’s worth it is a different post.
Where OpenAI Sits
Here’s the uncomfortable truth about paying for embeddings:
MTEB Score Scale:
72 | #### NV-Embed-v2 (64GB, non-commercial)
71 | #### Qwen3-8B (32GB)
70 | #### GTE-Qwen2-7B (64GB)
69 |
68 | #### Qwen3-4B (16GB)
67 | #### GTE-Qwen2-1.5B (16GB)
66 | #### Jina v4 (32GB)
65 | #### Nomic v2 MoE (8GB) / ModernBERT (8GB)
| - - - text-embedding-3-large ($0.13/1M) - - - 64.6
64 | #### BGE-M3 (8GB)
63 | #### Qwen3-0.6B (8GB)
| - - - text-embedding-3-small ($0.02/1M) - - - 62.3
62 |
OpenAI’s embedding models sit at roughly the bottom of what you can self-host for free. Even the cheapest, smallest open-source models match or exceed text-embedding-3-small. The larger text-embedding-3-large gets beaten by models that run on consumer hardware.
The only reason to pay OpenAI for embeddings is zero infrastructure overhead. Not quality.
Picking by Hardware Budget
Most of us don’t have a dedicated GPU for embeddings. We’re running them alongside other workloads — an LLM, a database, the rest of our stack. Here’s what fits when you account for that:
8 GB (e.g., RTX 4060, M1/M2 base, small VPS)
~3-4 GB available for embeddings after other workloads.
| Model | VRAM | MTEB | Notes |
|---|---|---|---|
| Nomic Embed Vision v1.5 | ~0.3 GB | Aligned | Image embeddings |
| Snowflake Arctic Embed L v2.0 | ~0.6 GB | 55.6 | Lightweight retrieval |
| ModernBERT-Embed Large | ~0.8 GB | ~65 | Solid English-only |
| Nomic Embed Text v2 MoE | ~1 GB | ~65 | General text, very efficient |
| BGE-M3 | ~1.1 GB | 63-64 | Best value at this tier |
| Qwen3-Embedding-0.6B | ~1.2 GB | ~63 | 32K context window |
Sweet spot: BGE-M3 (1.1 GB) + Nomic Vision (0.3 GB) = 1.4 GB total for text + image, leaving 6+ GB for everything else.
16 GB (e.g., RTX 4080, M2 Pro)
~6-8 GB available.
Everything above, plus:
| Model | VRAM | MTEB | Notes |
|---|---|---|---|
| SigLIP 2 (ViT-g) | ~2 GB | Best accuracy | Premium text-image |
| GTE-Qwen2-1.5B-instruct | ~3 GB | 67.20 | Big quality jump |
Sweet spot: GTE-Qwen2-1.5B (3 GB) — a significant jump over the budget tier while leaving room for a 7B LLM.
32 GB (e.g., RTX 4090, M2 Max)
~12-16 GB available.
| Model | VRAM | MTEB | Notes |
|---|---|---|---|
| Jina Embeddings v4 | ~8 GB | 66.49 | Best multimodal |
| Qwen3-Embedding-8B | ~16 GB | 70.58 | Current #1 overall |
Sweet spot: Qwen3-Embedding-8B if you’re all-in on text. Jina v4 if you need multimodal and want room for a LLM alongside it.
64 GB (e.g., A100, M3 Ultra)
| Model | VRAM | MTEB | Notes |
|---|---|---|---|
| NV-Embed-v2 | ~30 GB | 72.31 | Absolute best (non-commercial) |
Sweet spot: Qwen3-Embedding-8B (16 GB) + Jina v4 (8 GB) = 24 GB for best-in-class text + multimodal, with 40 GB left for inference.
Why Influencers Keep Recommending nomic-embed-text
You’ll notice nomic-embed-text (v1) still dominates tutorials and YouTube walkthroughs. It’s the “install Ubuntu” of embedding models — recommended everywhere because it was first and easy, not because it’s best.
The reasons:
- First-mover on Ollama. It was one of the earliest quality embedding models available via
ollama pull. Most tutorials were written when it was genuinely the best local option. - Tiny and fast. At 137M parameters and 0.3 GB, it runs on anything. “Runs on a Raspberry Pi” gets more views than “scores 3 points higher on MTEB.”
- Good enough for demos. MTEB 62 vs 65 doesn’t matter when you’re embedding 20 documents for a tutorial. The gap shows up at scale.
- Most content creators don’t benchmark. They pull the model, embed a few docs, get results, say “works great.” They never compare retrieval precision against alternatives on the same dataset.
If you’re following a tutorial from 2024, update the model. Nomic v2 MoE, BGE-M3, or Qwen3-Embedding-0.6B are all strictly better and just as easy to run.
Our Decision: Why We’re (Still) on text-embedding-3-small
We run two AI agents with shared memory on a Hetzner VPS (8 GB RAM, no GPU). Our embedding model handles mem0 semantic search — writing and retrieving memories across both agents via a shared Qdrant collection.
We started with nomic-embed-text at 768 dimensions. Search quality was bad enough that basic paraphrase queries missed entirely. We switched to text-embedding-3-small via OpenRouter at 1536 dimensions. It costs effectively nothing (fractions of a cent per month at our volume) and the search quality improved meaningfully — though a 0.49 score on a near-paraphrase still isn’t impressive.
We haven’t switched to a self-hosted alternative yet for a boring reason: it works, it’s cheap, and migrating means recreating the Qdrant collection and re-embedding all memories. The juice hasn’t been worth the squeeze.
But if we were starting fresh today, we’d run BGE-M3 at 1024 dimensions. It fits in 1.1 GB, scores higher than what we’re paying for, supports dense+sparse+ColBERT retrieval, and eliminates the OpenRouter dependency entirely. On our 8 GB VPS with ~4 GB free, it runs comfortably alongside everything else.
The cost savings are marginal — we’re talking pennies per month. The quality improvement and elimination of a network dependency are the real reasons to switch.
The Quick Decision Matrix
| Your Budget | Text Pick | Multimodal Pick | Total VRAM | Headroom |
|---|---|---|---|---|
| 8 GB | BGE-M3 (1.1 GB) | Nomic Vision (0.3 GB) | 1.4 GB | 6.6 GB |
| 16 GB | GTE-Qwen2-1.5B (3 GB) | SigLIP 2 So400m (1 GB) | 4 GB | 12 GB |
| 32 GB | Qwen3-Embedding-8B (16 GB) | SigLIP 2 So400m (1 GB) | 17 GB | 15 GB |
| 64 GB | Qwen3-Embedding-8B (16 GB) | Jina v4 (8 GB) | 24 GB | 40 GB |
All Apache 2.0 or MIT licensed. All self-hostable via Ollama, Hugging Face TEI, or sentence-transformers.
Stop paying for embeddings that open-source models beat on a laptop.
We’re Tensaku Labs. We build the tools we wish our AI agents had.