Context Rot: How increasing input tokens impacts LLM performance

research.trychroma.com

123 points by kellyhongsn 10 hours ago

I work on research at Chroma, and I just published our latest technical report on context rot.

TLDR: Model performance is non-uniform across context lengths, including state-of-the-art GPT-4.1, Claude 4, Gemini 2.5, and Qwen3 models.

This highlights the need for context engineering. Whether relevant information is present in a model’s context is not all that matters; what matters more is how that information is presented.

Here is the complete open-source codebase to replicate our results: https://github.com/chroma-core/context-rot

posnet 7 hours ago

I've definitely noticed this anecdotally.

Especially with Gemini Pro when providing long form textual references, providing many documents in a single context windows gives worse answers than having it summarize documents first, ask a question about the summary only, then provide the full text of the sub-documents on request (rag style or just simple agent loop).

Similarly I've personally noticed that Claude Code with Opus or Sonnet gets worse the more compactions happen, it's unclear to me whether it's just the summary gets worse, or if its the context window having a higher percentage of less relevant data, but even clearing the context and asking it to re-read the relevant files (even if they were mentioned and summarized in the compaction) gives better results.

  • zwaps 7 hours ago

    Gemini loses coherence and reasoning ability well before the chat hits the context limitations, and according to this report, it is the best model on several dimensions.

    Long story short: Context engineering is still king, RAG is not dead

    • tvshtr 6 hours ago

      Yep, it can decohere really badly with bigger context. It's not only context related though. Sometimes it can lose focus early on in a way that is impossible to get it back on track.

    • Inviz 4 hours ago

      Cursor lifted "Start a new chat" limitation on gemini and i'm actually now enjoying keeping longer sessions within one window, becuase it's still very reasonable at recall, but doesnt need to restate everything each time

    • deadbabe 5 hours ago

      RAG was never going away, the people who say that are the same types who say software engineers will be totally replaced with AI.

      LLMs will need RAG one way or another, you can hide it from the user, but it still must be there.

    • risyachka 7 hours ago

      Yep. The easiest way to tell someone has no experience with LLMs is if they say “RAG is dead”

      • apwell23 5 hours ago

        > someone has no experience with LLMs

        Thats 99% of coders. No need to gatekeep.

  • irskep an hour ago

    "Compactions" are just reducing the transcript to a summary of the transcript, right? So it makes sense that it would get worse because the agent is literally losing information, but it wouldn't be due to context rot.

    The thing that would signal context rot is when you approach the auto-compact threshold. Am I thinking about this right?

  • bayesianbot 5 hours ago

    I feel like the optimal coding agent would do this automatically - collect and (sometimes) summarize the required parts of code, MCP responses, repo maps etc., then combine the results into a new message in a new 'chat' that would contain all the required parts and nothing else. It's basically what I already do with aider, and I feel the performance (in situations with a lot of context) is way better than any agentic / more automated workflow I've tried so far, but it is a lot of work.

  • tough 7 hours ago

    Have you tried NotebookLM which basically does this as an app on the bg (chunking and summarising many docs) and you can -chat- with the full corpus using RAG

lukev 6 hours ago

This effect is well known but not well documented so far, so great job here.

It's actually even more significant than it's possible to benchmark easily (though I'm glad this paper has done so.)

Truly useful LLM applications live at the boundaries of what the model can do. That is, attending to some aspect of the context that might be several logical "hops" away from the actual question or task.

I suspect that the context rot problem gets much worse for these more complex tasks... in fact, exponentially so for each logical "hop" which is required to answer successfully. Each hop compounds the "attention difficulty" which is increased by long/distracting contexts.

Workaccount2 an hour ago

What's really needed is a way to easily prune context. If I could go and manually manage the entire chat with a model, I could squeeze way more juice out of a typical ~200k token coding session.

Instead I have a good instance going, but the model fumbles for 20k tokens and then that session heavily rotted. Let me cut it out!

  • aaronblohowiak an hour ago

    Even just a rollback to previous checkpoint would be killer frsture

  • lordswork an hour ago

    /compress is the command to do this in most cli agents

lifthrasiir 3 hours ago

I recently wrote several novels using Gemini 2.5 Flash and the context rot is noticable but happens far later than what this report implies. In my experience, 50K to 100K tokens were required for it to start to disregard the initial context (e.g. the output language). Maybe a complex task like creative writing makes the impact harder to measure or observe; in any case it remained okay enough for me, once I supplied missing contexts from time to time.

tjkrusinski 8 hours ago

Interesting report. Are there recommended sizes for different models? How do I know what works or doesn't for my use case?

zwaps 7 hours ago

Very cool results, very comprehensive article, many insights!

Media literacy disclaimer: Chroma is a vectorDB company.

  • philip1209 7 hours ago

    Chroma does vector, full-text, and regex search. And, it's designed for multitenant workloads typical of AI applications. So, not just a "vectorDB company"

    • firejake308 2 hours ago

      yeah, but they benefit from convincing people not to dump everything in context, because the alternative is to dump everything in a db (like Chroma) and then retrieve only the relevant parts (whether that's using vector search or regex search or full-text search or whatever). I still think their thesis is correct, but readers should be aware of the author's bias and make their own judgment.

magicalhippo 5 hours ago

Is this due to lack of specific long-context training, or is it more limitations of encoding or similar?

I've noticed this issue as well with smaller local models that have relatively long contexts, say a 8B model with 128k context.

I imagined they performed special recall training for these long context models, but the results seem... not so great.

tough 7 hours ago

this felt intuitively true, great to see some research putting hard numbers on that