Bigger Context Windows Don't Make Better AI

Every few months, a lab announces a new context window record. 128K. 200K. 1M tokens. The implicit promise is always the same: more context means smarter AI. Just put everything in. No need to curate.

This assumption is wrong. Chroma’s research team made the case rigorously last summer. Labs responded by shipping bigger context windows anyway. The research since then has only made the case stronger.

What Context Rot Actually Is

The Chroma research tested 18 LLMs across 194,480 calls. The central finding:

“Models do not use their context uniformly; instead, their performance grows increasingly unreliable as input length grows.”

This isn’t a theoretical concern. The researchers found concrete, measurable degradation across every model family tested - Anthropic, OpenAI, Google, Alibaba. Performance dropped. Hallucinations increased. Some models started producing nonsensical output: Gemini 2.5 Pro generated random character sequences starting around 500–750 words of input. GPT-4.1 nano produced unexpected character variations. Qwen3-8B went off-topic at 5,000+ words.

The problem isn’t that models can’t handle long context. It’s that they don’t handle it uniformly - and that non-uniformity grows worse as context grows longer.

The Failure Modes

Two findings stand out.

Distractors hurt non-uniformly. Adding irrelevant context doesn’t degrade performance evenly - specific chunks are far more damaging than others. GPT models showed the highest hallucination rates when distractors were present, “often generating confident but incorrect responses.” You can’t predict which content in your context is acting as a distractor. You just know some of it is.

This is what makes large context windows actively dangerous: every token you add beyond what’s necessary is a potential distractor. A 1M context window doesn’t give you more signal—it gives you a larger budget for noise. The labs selling you bigger windows are selling you more room to hurt yourself. The answer to “my model can’t find the relevant information” is almost never “add more context.” It’s almost always “remove what’s confusing it.”

Coherent structure makes things worse. All 18 models performed better on shuffled haystacks than on logically structured ones. When the context preserves a coherent narrative flow, models follow the narrative and miss the target. Structured, well-organized context - the kind you’d naturally produce - is more likely to produce retrieval failures than randomized chunks.

If you’re building a RAG system that carefully preserves document structure, you’re actively hurting retrieval accuracy compared to one that mixes chunks from different sources.

A follow-up paper published at EMNLP 2025 pushed the finding further: even when models can perfectly retrieve all relevant evidence, performance still degrades 14–85% as input length grows. The problem isn’t retrieval quality. It’s the context itself.

The Benchmark Problem

The industry has been measuring long-context performance with Needle-in-a-Haystack (NIAH) evaluations: hide a fact in a long document, ask the model to find it. The Chroma research shows NIAH is a poor proxy for real-world performance.

It tests “direct lexical matching” - the answer is almost always phrased similarly to the question. Real retrieval tasks require semantic reasoning, not just token matching.

And 72.4% of needle-question pairs from a common benchmark called NoLiMa require external knowledge - meaning the benchmark is partially testing what the model already knows, not whether it can retrieve from context.

Models have been benchmarked on a task that doesn’t represent what they’ll actually be asked to do. The improvements we’ve been celebrating may not transfer.

The Thing Nobody Is Talking About

Context window size is a hardware spec. It tells you the upper bound - how many tokens the model can technically accept. It says nothing about what happens to reasoning quality as you approach that bound.

The Chroma team’s conclusion cuts through the arms race directly:

“Whether relevant information is present in a model’s context is not all that matters; what matters more is how that information is presented.”

This is context engineering: the deliberate construction and management of what goes into the context window, not just how large that window is. What you include. What you exclude. What order things appear in. How similar the surrounding content is to the answer you need. Whether stale or contradictory information has been removed.

None of this gets solved by increasing the token limit. A larger token limit makes it easier to avoid doing this work - and easier to accidentally stuff in content that actively degrades the answer.

Research from February 2026 explains why: attention dilution is architectural, not a training problem. Soft-attention Transformers have fixed capacity — as context grows, attention spreads thinner across more tokens. A 1M context window doesn’t give you 1M tokens of usable attention. Effective context scales sub-linearly with nominal length. The labs know this. They’re selling you a number.

What This Means in Practice

If you’re building on top of an LLM today, stop asking “does our context fit?” Ask these instead:

  • What’s in this context that shouldn’t be?
  • Is any of this content semantically similar to the answer in a way that creates distractor noise?
  • Is the most relevant content positioned where the model is most likely to attend to it?
  • Are we preserving document structure out of habit when randomized chunking might perform better?
  • When did this content last change, and is any of it now contradictory to other content in the window?

The industry is building larger buckets. The work is figuring out what not to put in them.


References

← All posts