What "thinking models" do to your context-window math

Reasoning tokens have to live somewhere. The somewhere is your context window. Worth working through what the budget actually looks like once you stop pretending output is the only thing that costs.

Sid Smith

26 Feb 2025 • 4 min read

$A laboratory beaker filled with stacked translucent token shapes, some labeled as thinking displacing answer tokens$

When Claude 2 shipped a 100k context window in 2023, the framing was that context was the new RAM, the thing that decides whether a workload fits in one pass or has to be sliced. That framing held up well through 2024. Then reasoning models showed up and changed what "fitting" means.

The change is small enough to ignore for most use cases and big enough to matter for the ones it matters for. Worth being explicit about how the math actually works once a model is allowed to think before it speaks.

The pre-reasoning model of the budget

For a non-reasoning model, the context-window budget is straightforward to plan against. Total context is some advertised number, 200k for Claude 3.5 Sonnet, 128k for GPT-4o, 1M for Gemini. The budget gets spent on three things:

System prompt, fixed per application, usually a few hundred to a few thousand tokens.
User and assistant turns up to this point, grows linearly with conversation length.
The next answer, bounded by max_tokens on the request.

The arithmetic is system + history + max_tokens ≤ context_window. You can plan against this. You can predict the cost. You can decide ahead of time whether a long document fits or has to be summarized. The whole "context window engineering" practice that emerged through 2023–24 was built on this model.

Reasoning models don't fit it cleanly.

What changed

When you enable extended thinking (or hit a reasoning model like o1 or R1), the model produces a block of internal reasoning tokens before producing the user-visible answer. Those reasoning tokens are output tokens for billing and (crucially) for context-budget purposes. They occupy real space in the window, they count against max_tokens, and they stay in the conversation history if the API surfaces them (Claude 3.7 does, OpenAI's o-series hides them but the budget is still spent).

The arithmetic now reads system + history + reasoning + answer ≤ context_window, and reasoning is the variable you have the least control over. The model decides how much it wants to think. You set the ceiling; the model picks somewhere underneath it. For an easy question it might spend 200 tokens reasoning. For a hard one it can spend tens of thousands.

Context-window budget partitioning showing how reasoning tokens consume budget that would otherwise be available for the visible answer, comparing a non-reasoning request to the same prompt with extended thinking enabled.

The illustration is schematic (the actual reasoning-token spend on a hard problem can be larger or smaller) but the pattern holds. A request that previously left ~50k of headroom can land you uncomfortably close to the wall once thinking is on. And the failure mode at the wall is not always graceful: depending on the model and the API, the answer can get truncated, the reasoning can get truncated, or the request can fail outright.

What to actually plan against

A few practical changes to how you size requests once reasoning is in play:

Reserve real budget for thinking, separately from max_tokens for the answer. The naive "set max_tokens to whatever the longest answer might need" doesn't account for the reasoning prefix. A useful starting heuristic: if the task is reasoning-shaped, reserve 2–3× the expected answer length for thinking, on top of the answer budget. Tune from there based on the actual telemetry.

Don't carry reasoning tokens forward in conversation history if you don't have to. Claude 3.7 surfaces the reasoning blocks in API responses; if you're storing the full conversation server-side, you can choose to strip them before sending the next turn. The model doesn't need its own previous reasoning to inform the next turn, only the visible answers.

Treat extended thinking as an opt-in for hard turns. If you're routing turns through a reasoning model by default, you're paying the budget cost on every turn including the ones that don't need it. The toggleable nature of Claude 3.7's hybrid model is built to let you make this decision per-request. Use it.

Watch for the silent ceiling on reasoning effort. Some implementations cap how much thinking the model is allowed even if your max_tokens would permit more. The cap is sometimes invisible from the API. If you see capability drop on a hard problem with thinking enabled, the answer is sometimes "the budget the model actually used for thinking was lower than the budget you thought you'd allocated."

The case where this matters most

The combination of large context window + reasoning enabled is where you actually run out of room. For a 200k-token document analysis task with extended thinking enabled and a long expected answer, you can plausibly spend 130k on the input, 40–80k on reasoning, and have the answer truncate at the wall. Same request without thinking enabled fits comfortably; with thinking it doesn't.

That's a category of failure that didn't exist as a planning concern in 2024. The advertised context window number is now upper-bounded by the assumption that you're not also asking the model to think. Once you are, the effective window is smaller (sometimes substantially smaller) and how much smaller depends on the problem you handed it.

Context is still the new RAM, in the sense that running out of it is still the boundary that matters. The arithmetic for staying inside it just got an extra term.

The pre-reasoning model of the budget

What changed

What to actually plan against

The case where this matters most

Subscribe to Echoes of the machine