Claude 2 and why context windows are the new RAM

Anthropic shipped Claude 2 yesterday with a 100K-token context window. The capability is genuinely interesting; the architectural shift it forces is the bigger story.

Sid Smith

12 Jul 2023 • 6 min read

Anthropic released Claude 2 yesterday. Two things stand out from the announcement: the model is now generally available without a waitlist (Claude 1 was invite-only and felt like it), and the context window is 100,000 tokens. The capability story is real; the architectural story is bigger.

For comparison: GPT-4 ships with an 8K context (32K available to some accounts), GPT-3.5 with 16K. Claude 2 at 100K is more than a 6x jump over what most production systems are working against today. Six-X is not a generational improvement on a single dimension. Six-X is a category change in what you can do with the prompt.

What 100K tokens actually buys you

A few practical anchors for how big 100K tokens really is:

A typical novel is 80K–120K words, and one English word averages about 1.3 tokens. So a moderately-sized novel fits, with room for the prompt and the response.
A medium-sized codebase (50–100 files, a few thousand lines each) can be concatenated and fit in the window, with budget left over for analysis.
A full quarter of a typical Slack channel's history, or a year of a busy email thread, fits.
The complete documentation for most software products fits.

That changes the integration patterns. The standard RAG pipeline (chunk the corpus, embed, retrieve top-K, inject) exists because you can't fit the whole corpus in the prompt. With 100K, for many corpora, you can. The retrieval step becomes optional, or shifts from "find the relevant chunks" to "find the relevant files."

This is why I keep saying context windows are the new RAM. The way you architect a system depends on what you can fit in working memory. When working memory was 4K (the original GPT-3.5 default), you had to be aggressive about chunking and selection, you were paging things in and out constantly. When working memory is 100K, you can hold the whole problem at once for a lot of common cases. That's a different programming model.

The patterns that change

Walking through the integration patterns that have to be rethought when the context window is this big:

Retrieval

The standard RAG pattern (embed query, retrieve top-K chunks, inject) was a workaround for small context windows. (The encoding-a-person stack from May treated retrieval as a load-bearing layer for exactly this reason.) With 100K, three things change.

First, for small-to-medium corpora (say, under 50K tokens), retrieval is no longer needed. You can just stuff the whole corpus into the prompt and let the model find what's relevant. This is faster, simpler, and avoids the failure mode where the embedding model picks the wrong chunks and the downstream model has no way to know.

Second, for larger corpora, retrieval shifts from "fine-grained chunk selection" to "coarse document selection." Find the right ten files, dump them in, let the model do the in-context reasoning. The fine-grained chunk-and-overlap dance becomes less important.

Third, for very large corpora, you still need retrieval, but the selection criteria can be more generous. You can retrieve a hundred chunks instead of five, because there's room.

Multi-step reasoning

A lot of the multi-step reasoning patterns that have shown up over the last six months, chain-of-thought prompting, tree-of-thought search, agent loops with intermediate state, exist partly because the model couldn't keep enough state in its context to reason coherently across steps. With 100K, you can keep the entire reasoning trace in context. You can re-read it. You can ask the model to critique its own intermediate steps.

I don't yet know what this does to the multi-step reasoning patterns in practice. Some of them are about more than just context (the search structure of tree-of-thought is genuinely useful even with infinite context). Some of them might collapse to "just give the model the whole problem and let it think out loud at length."

Code understanding

The most immediate practical use case I keep coming back to is code understanding. Today, working with a model on a non-trivial codebase means selecting the relevant files manually and pasting them into the prompt. With 100K, you can paste the whole project for most projects. That changes the kinds of questions you can ask.

"Why does this function behave this way" becomes answerable in a way it wasn't before, because the model can see all the call sites, all the type definitions, all the configuration. "What would change if I refactored X" becomes answerable, because the model can trace through the impact across files. None of this is reliably solved (the model will still hallucinate, miss things, or get distracted) but the ceiling on what's possible has moved up substantially.

Document analysis

Long documents (contracts, RFCs, research papers, regulatory filings) fit in the window now. You don't have to chunk them. You don't have to summarize them down to fit. You can paste the whole thing and ask questions across the whole structure.

This is the use case that's going to land in production fastest, I think. Lawyers analyzing contracts. Compliance teams reading regulations. Researchers digesting papers. The integration is shallow ("paste the document, ask the question") and the value is immediate.

What's still unsolved

Three things to be careful about with the 100K context.

Cost scales with input. Anthropic charges per token. A 100K-token prompt is roughly 12x more expensive than an 8K-token prompt. For low-volume use this is fine; for high-volume use it adds up fast. The question of whether to paste the whole codebase or do retrieval is now partly an economic question, not just a capability question.

Latency scales too. A bigger prompt takes longer to process. Anthropic's published numbers suggest the time-to-first-token on a 100K prompt is several seconds. For interactive use, that's noticeable.

Quality on long context is uneven. The "lost in the middle" finding from a recent paper out of Stanford and Berkeley is real, models tend to attend most strongly to the start and end of their context, with degradation in the middle. A 100K context where the relevant fact is at token 47,000 may not perform as well as a 4K context where it's at token 2,000. This is a research-active problem and the next generation of models will probably address it, but for now it's worth knowing.

The competitive position

The 100K context window puts Anthropic in a different competitive position than they were in a month ago. GPT-4 is more capable on most benchmarks; Claude 2 has a structural advantage on any task where the input is large. For document analysis, code understanding, long-form reasoning over substantial corpora. Claude 2 is now the obvious choice. For pure capability per token on smaller inputs. GPT-4 is still ahead.

OpenAI will respond. The 32K GPT-4 variant exists; presumably bigger context is on the roadmap. Whether they catch up, leapfrog, or settle into a market split is going to depend on the next two or three release cycles.

For practical use today, the right pattern is probably: GPT-4 for high-quality reasoning on bounded problems, Claude 2 for problems where the input is large or the analysis spans a lot of context. Both, in the same product, routed by problem shape. That's how my own setup is going to work for a while.

The longer pattern

Stepping back: every major capability release in the last six months has expanded the surface of what an integrated AI system can do. Bigger models, longer contexts, plugins, function calling, retrieval. Each release individually feels incremental. Cumulatively, the foundation has changed dramatically. The architectures we'll be building against in twelve months will look meaningfully different from the ones we're building today.

100K context windows are part of that. So is open-weights model availability. So is the agent foundation. The pieces are converging. The integrations are getting easier. The constraints are loosening.

The interesting question, which I don't have an answer to yet, is what stops being interesting. RAG has been interesting partly because of context-window constraints. As those loosen, RAG becomes a more specialized tool. Agent frameworks have been interesting partly because the model couldn't hold enough state. As context grows, some agent patterns simplify to "just keep the state in context." The foundation is moving fast enough that anything you optimize for now is the wrong thing to optimize for in eighteen months.

For the immediate work: 100K context is genuinely useful, and worth restructuring some workflows around. Just don't carve them in stone.