Spend-by-conversation: tracking AI cost without going crazy

The right unit for AI cost tracking isn't tokens, isn't requests, isn't users. It's conversations. The cost story makes sense when you measure it that way and gets impossible when you don't.

A vintage brass adding machine on a dark wooden desk with a long curling paper receipt and US coins scattered nearby

The default unit for AI cost tracking is per-token, because that's what the vendors bill you in. The default unit for AI cost reporting in public guidance is per-month, because that's what shows up on the invoice. Neither is the unit that makes the spend story actionable.

The unit that does is the conversation. After a few quarters of doing FinOps thinking on AI workloads in my own stack and across the public modeling I read, here's the pattern that keeps emerging: conversation is the natural atomic unit of cost, the smallest unit where the spend is meaningful enough to act on, the largest unit where the spend is still attributable to a specific shape of work.

Cost rolls up. Most teams track the wrong level. Conversation $1.34 total Tool call: search $0.42 Tool call: fetch $0.28 Plain message $0.64 3,200 tokens in + out 2,100 tokens in + out 4,800 tokens in + out Track tokens, but report dollars per conversation. That's the unit your finance person can act on.
Cost rolls up. Track tokens, report dollars per conversation.

cost-unit-comparison

Worth being plain about why and what tracking-by-conversation actually requires.

Why per-token doesn't work

Per-token cost is real for billing. It's useless for management. The reason is that the per-token volume varies wildly by workload type, and adding tokens across workload types washes out the signal you need. A team's monthly token spend tells you nothing about whether they're using the AI well, they could be doing a lot of low-value work cheaply or a little high-value work expensively, and the per-token combine looks the same.

The other failure mode of per-token: it doesn't account for the retry, reasoning, and tool-call multipliers that compound real cost. A "200-token query" can be a 200-token cost or a 2000-token cost depending on how many turns of agentic work it spawned. The per-token unit mashes very different workloads together.

Why per-request doesn't work

Slightly better than per-token, still wrong. A request in an agentic workflow is one of many turns; the meaningful unit is the user-facing query that triggered the workflow, not the individual model call. Tracking per-request makes the agentic-vs-non-agentic distinction invisible, which is the single most important cost dimension in 2025 AI workloads.

Why per-user doesn't work

Better grouping level for management; still wrong unit. Users do different kinds of work in different conversations. A user who runs ten "quick question" conversations and one big "help me think through this complex thing" conversation has very different costs from a user who runs eleven of the same medium-complexity conversation. Per-user spend mashes the workload diversity within a user.

Why per-conversation works

The conversation is the right unit because:

  • It's the unit the user understands. "I had a conversation with the AI about X" is the user's natural mental model. The cost of that conversation is the cost they intuitively attribute to the work.
  • It scopes the agentic multipliers. All the turns within a conversation share a goal; adding their cost gives you the actual cost of accomplishing that goal.
  • It's actionable. A conversation that cost $0.03 is fine; a conversation that cost $4.20 is worth investigating. The unit gives you a meaningful signal at a meaningful threshold.
  • It's portable across providers. Whether the conversation ran on Claude or GPT-5 or Gemini, the conversation-level cost is comparable. Per-token costs are not directly comparable across providers because the token economics differ.

The pattern that emerges: track every conversation, attribute costs to it, surface the cost at conversation-end. The metrics layer adds conversations together into useful slices (by user, by team, by use case, by model).

What it takes to instrument

The instrumentation isn't complicated; it's not free either.

Conversation IDs. Every conversation needs a stable identifier. The user-facing UI usually has one already; the backend needs to thread it through every model call. Most platforms expose a conversation_id or thread_id; if your platform doesn't, you have to add it.

Per-call cost capture. For each model call within a conversation, capture: model used, input token count (with retrieval context), output token count (visible plus reasoning), tools called and their costs. Store these against the conversation_id.

Combination at conversation-end. When the conversation ends (explicit close, timeout, or user-driven) sum the per-call costs. That's the conversation cost. Persist it.

Tagging. Each conversation gets metadata: user, team, use case, model preference, agentic vs non-agentic. The tags are what make the totals useful.

Reporting surface. A dashboard or report that surfaces conversation costs in the dimensions the org cares about. P50, P95, P99 of conversation cost. Cost-per-conversation by model. Cost-per-conversation by use case. Outliers. Trends.

That's the basic instrumentation. The work to set it up is real but bounded, a couple of weeks for a small org, longer for a larger one with more model surfaces.

What the data tells you

Once you have conversation-level cost data, a few patterns reliably show up.

The long-tail problem. Most conversations cost a few cents. A small fraction of conversations cost tens of dollars. The long tail dominates the bill. Optimization should focus on the long tail, not the median. The median conversation isn't a problem; the P99 is.

The agentic multiplier. Conversations that engage agentic loops cost 10-50× the non-agentic conversations. The data makes this visible immediately. Whether that multiplier is justified is a per-use-case decision.

The model-fit signal. Conversations on Opus that could have run on Sonnet (or vice versa) jump out. The per-conversation cost makes the model-routing decision auditable.

The user-pattern signal. Some users use the AI well (low cost per useful outcome); some use it badly (high cost, low outcome). The combine-by-user view shows you who needs which kind of training.

The use-case profitability signal. Some use cases are obviously paying back. Some look like they should be paying back but don't. The per-conversation data lets you separate the productive use cases from the cargo-cult ones.

What I do in my own setup

Per-conversation cost tracking for my home setup, even though it's a single-user setup with hosted-API spend in the low hundreds per month:

  • Conversation IDs threaded through every call. Both the local OpenAI-compatible endpoint and the hosted-API calls.
  • Per-call cost in a small Postgres table on the Synology, indexed by conversation_id.
  • A small Grafana dashboard that shows conversation-cost histogram, top-10 expensive conversations of the week, model-mix per conversation.
  • A weekly review of the long-tail conversations to see if they were worth it. Sometimes yes; sometimes no; the reflection itself improves my routing decisions.

The setup is overkill for a single user. It's also useful. I caught two routing decisions that were worth reversing within the first month of using it, and the discipline of looking at the cost data weekly has improved my use of the tools generally.

The bigger frame

The AI cost conversation is in a similar place to the cloud cost conversation in 2014. The instrumentation is immature, the standards are unsettled, the people writing about doing it well are the ones who built their own. The pattern that holds: spend visibility precedes spend control. You can't optimize what you can't see, and the per-token / per-month / per-user views are too granular or too coarse to be useful.

Conversation is the right unit. The orgs that build their cost story around it have the actionable data. The orgs that don't end up surprised by their AI bill in the same way orgs were surprised by their cloud bill a decade ago. The lesson carries over; the unit specifics are different.

The hidden costs that show up in SaaS-AI bundling hide in the same way: at the per-token level the spend is invisible, at the per-conversation level it would be obvious. The invisibility is the problem. The fix is the unit change.

Tracking AI cost by conversation is the small investment that pays back fastest in any meaningful AI deployment. Worth doing now rather than after the bill becomes a problem.