Cloud

Cost-modeling AI workloads with FinOps eyes

The per-token price is the easy line. Everything else, the retries, the context overhead, the agentic tool calls, the egress, the GPU reservation underneath the API, is where the actual bill comes from.

Sid Smith

07 May 2025 • 5 min read

The per-token price of a hosted model is the easy line on the cost model. It’s also the line that gets quoted in every vendor pitch, every comparison post, every back-of-envelope estimate. It is not, in any FinOps modeling I’ve done, the dominant line on the bill.

The actual bill is built from a stack of cost dimensions that most cost models don’t capture. Worth being explicit about what an honest cost model looks like in 2025, because the planning math at the top doesn’t survive contact with the operational reality at the bottom.

finops-cost-waterfall

The cost dimensions that actually matter

Six layers, in roughly the order they show up on the bill:

1. Raw model inference (per-token). The headline price. $2/$8 per million for GPT-4.1: $3/$15 for Claude 3.7 Sonnet: $0.27/$1.10 for hosted DeepSeek V3, etc. Easy to model from the user-facing volume. This is where most cost estimates start and stop.

2. Reasoning-token overhead. Reasoning models burn tokens before producing output. A typical extended-thinking turn might spend 5-10× the visible-output tokens on reasoning that the user never sees but the bill includes. For workloads where reasoning is on by default, this is a real multiplier on the per-token cost.

3. Retries and validation passes. Production systems retry. They retry on rate-limit errors, on timeout, on schema-validation failures, on the model’s first attempt being wrong. They do parallel sampling for ensemble validation. Each of these multiplies the per-call token spend by some factor that varies by workload, typically 1.2× to 2× for well-tuned systems, 3× or more for systems that haven’t been tuned.

4. Context overhead. The same prompt with retrieved context is dramatically more expensive than the same prompt without. RAG pipelines retrieve 5-50 chunks per request; each chunk is a few hundred to a few thousand tokens; system prompts add their own overhead. The user sees a one-line query and an answer; the model sees thousands of input tokens. The cost model needs to count the actually-sent input volume, not the user-typed volume.

5. Agentic tool-call multiplication. An agent that takes ten tool-call turns to complete a request is paying for ten model calls, each with the cumulative conversation context. The per-user-request cost on agentic workloads can be 10-50× the cost of a single non-agentic completion. This is the dimension that breaks naive cost models the most badly. The agentic-vs-non-agentic distinction is the largest cost-multiplier in the stack right now.

6. Platform overhead (network, storage, observability). Egress charges on cloud providers when responses leave the perimeter. Storage costs for conversation history, RAG indexes, audit logs. Observability infrastructure, logging the prompts and responses for debugging is itself a measurable cost at scale. None of these are the model bill; all of them are part of the AI workload bill.

A worked example

A concrete agentic AI workload, say, a customer-support agent that handles inquiries by calling tools to look up account state, search documentation, and execute account changes. Modeled at honest cost:

Per user request: roughly 8 model turns (5 tool-using, 3 summarization/reasoning), 5k input tokens average per turn (cumulative context grows), 200 output tokens average per visible response, 500 reasoning tokens average per turn that uses extended thinking (4 of 8).
Token totals per request: 40k input, 1.6k output, 2k reasoning. Total billable: 40k input + 3.6k output (output + reasoning, both billed at output rate).
At GPT-4.1 prices ($2/$8 per million): $0.080 input + $0.029 output = $0.109 per user request.
At Claude 3.7 Sonnet prices ($3/$15): $0.120 input + $0.054 output = $0.174 per user request.
At DeepSeek V3 prices ($0.27/$1.10): $0.011 input + $0.004 output = $0.015 per user request.

That’s the model-only cost. Add to it:

Retry overhead (15% on average): ×1.15
Vector store queries (5 retrievals per request, each costing roughly $0.0001): +$0.0005
Storage for conversation logs and audit: roughly $0.001/request amortized
Observability infrastructure: roughly $0.001/request amortized
Egress on cloud-hosted variant: roughly $0.0005/request

The all-in cost per request lands around $0.13 on GPT-4.1: $0.20 on Claude 3.7 Sonnet: $0.02 on DeepSeek V3. The model bill is 80-90% of the all-in for the closed-frontier options and a smaller fraction for the cheap-tier option (because the platform overhead is roughly fixed across model choice).

A workload of 100,000 user requests per month lands at roughly $13K: $20K, or $2K per month depending on model. Those are real numbers and the gap between them is real. The shape of the distribution is what most cost models miss, especially the retry, reasoning, and tool-call multipliers that aren’t visible from the headline pricing page.

What a useful cost model includes

A cost model that’s actually predictive needs at minimum:

Input and output token volume per user request, measured rather than estimated, with retries and reasoning counted separately.
Tool-call multiplier per request type. Different request shapes have different agentic depths.
Context-overhead estimate based on actual RAG retrieval sizes, not theoretical maximums.
Platform overhead for storage, egress, observability, these are small per-request but compound at scale.
Vendor-specific pricing nuances. Bedrock vs direct, regional pricing differences, prompt-caching discounts where they apply.
Failure-mode costs, what happens when the model errors and the system retries; what happens when the system falls back to a more expensive model on failure.

The ones that don’t get enough attention in most cost models in public reporting: the agentic-tool-call multiplier (which is huge) and the reasoning-token overhead (which is meaningful and growing as more workloads default to reasoning-on).

What this means for planning

Three implications worth being concrete about:

The vendor with the lowest per-token price isn’t always the cheapest in production. Switching costs, integration overhead, and the per-vendor reliability story all matter. A model that’s 3× cheaper but fails 20% of the time on your workload (driving retries) ends up more expensive than a “more expensive” model with 99% reliability.

Agentic systems need a fundamentally different cost lens than chat systems. The per-user-request cost on a multi-turn agent is dominated by the cumulative context across turns. Optimizing the input prompt size of any single turn is rounding error compared to optimizing the depth and shape of the agentic loop.

Reasoning-on-by-default needs explicit budget control. Models with extended thinking enabled by default will spend reasoning tokens on every turn unless you explicitly turn it off for the turns that don’t need it. The cost overhead of “always on” can be 3-5× the cost of “on when needed.” For workloads where most turns don’t benefit from reasoning, the right architecture is per-turn routing rather than blanket enable.

The honest cost model doesn’t fit on the back of an envelope, and that’s the actual point. The headlines say AI inference is getting dramatically cheaper. That’s true at the per-token layer. The all-in cost of a production AI workload is moving more slowly than the headlines suggest because the layers above per-token are growing faster than the per-token layer is shrinking. Worth knowing the difference when you’re modeling spend.

The cost dimensions that actually matter

A worked example

What a useful cost model includes

What this means for planning

Subscribe to Echoes of the machine