Atomic-unit architecture for AI workloads (how I think about it)
The atomic unit of an AI workload isn't the model call, isn't the request, isn't the user. It's the conversation. The architecture decisions that follow from that, caching, billing, governance, ops, all get cleaner when you start there.
The architecture decisions in any AI deployment cluster around what you treat as the atomic unit. Most teams pick implicitly (the model call, the API request, the user session) and the choice drives downstream design in ways the team doesn’t notice. The choice that holds up across the architectural decisions that actually matter, based on the public reporting and the testing I’ve done on my own stack, is the conversation. Not because conversations are obvious; because everything else gets cleaner when you start there.
Worth being explicit about why and what follows from it.
atomic-unit-decisions
The competing candidates
The reasonable atomic units for an AI workload, in increasing order of scope:
The token. The smallest unit the model handles. The unit the vendors bill in. The unit most cost models start with.
The model call. A single round trip to the model. One request, one response. The unit the inference layer thinks in.
The request. A user-initiated query that may spawn multiple model calls (in agentic workflows). The unit the application layer often defaults to.
The conversation. A session of related requests with shared context. The unit the user mentally tracks.
The user. All conversations by a particular person. The unit the org thinks in for governance.
The workload. All conversations of a particular type, across users. The unit the FinOps and reliability teams care about.
Each of these is a real abstraction. Architectures differ based on which one is treated as the atomic unit, the level at which you cache, bill, audit, scope, monitor, and reason about behavior.
Why conversation wins
Five concrete reasons the conversation is the right atomic unit for most production AI workloads:
It matches the user’s mental model. When the user thinks about “the work I did with the AI today,” they think in conversations. “The conversation about the contract.” “The conversation about debugging the test.” The unit aligns with the mental model; the architectural decisions made at this level make sense to the user.
It captures the agentic multipliers cleanly. Per-conversation cost tracking handles the 10-50× multiplier that agentic loops introduce. Per-token or per-call tracking obscures it; per-conversation surfaces it.
It scopes context cleanly. The 70/30 prompt-vs-context ratio operates at the conversation level, the context is per-turn within a conversation, the system prompt and persona apply across the conversation. Per-call thinking misses this; per-user thinking is too coarse.
It’s the right granularity for caching. Within a conversation, prompt caching pays back consistently because the prompt structure is stable. Across conversations, caching is more variable. Per-conversation cache scoping captures most of the gain without the complexity of cross-conversation cache management.
It’s the right granularity for governance. The agent design patterns (planner-executor, tool-scoped subagents, human-in-the-loop checkpoints) all operate within a conversation. Per-conversation policy decisions are coherent; per-call policy is fragmented; per-user policy is too coarse.
That’s the case for treating conversation as the atomic unit. The downstream architectural choices follow.
What follows architecturally
If conversation is the atomic unit, the architecture takes a particular shape:
Conversation IDs are first-class. Every request carries one, every model call inherits it, every log line includes it, and every cost record attributes to it. The infrastructure threads conversation IDs through the entire stack without dropping them.
State scopes to the conversation. The conversation’s state (accumulated context, working memory, intermediate results, tool-call history) lives at conversation scope. Not per call (which loses continuity), not per user (which conflates). The state lifecycle is bounded: conversation starts, accumulates, ends, garbage-collects.
Caching keys include the conversation context. Prompt caching, retrieval caching, model-output caching all key on the conversation context where appropriate. The cache hit rate goes up because the conversation context is the natural unit of cache locality.
Cost attribution rolls up by conversation. Per-call costs get tagged with the conversation ID and combined to per-conversation cost. Reporting rolls conversation costs up to user, team, workload. The combination hierarchy is well-defined; the attribution is unambiguous.
Audit trails span the conversation. When something goes wrong, the audit trail per conversation is the queryable unit. Not per call (which fragments the story); not per user (which conflates separate conversations). The conversation is the natural narrative unit.
Policy decisions cache per conversation. Some policy decisions are conversation-stable (this user is or isn’t allowed to use this kind of agent). Caching those at conversation scope avoids re-evaluating per call without losing per-call enforcement of the dynamic decisions.
Observability metrics report per conversation. P50, P95, P99 of conversation cost. Conversation completion rate. Conversation latency. Conversation error rate. The metrics are actionable at this granularity in a way per-call metrics aren’t.
These are all design choices. Each one is independently small. The combine is an architecture that lines up with how the workload actually behaves.
What changes if you pick a different unit
For comparison, what happens with the alternative atomic units:
Per-token architecture. Optimal for vendor billing but useless for actual decisions. You can’t act on “this user spent 4.2M tokens this month” without translating to a higher unit. Architectures that treat tokens as the atomic unit end up rebuilding the higher abstractions ad-hoc.
Per-call architecture. What most platforms default to. Loses the conversation continuity. Caching is harder. Cost attribution is fragmented. Works for non-agentic workloads; falls over for agentic ones.
Per-request architecture. Better than per-call. Still misses the multi-turn conversation case. Each user-initiated request is treated independently even when they’re clearly related, the AI doesn’t get the benefit of the prior context, the user has to repeat themselves, the costs miss the savings that conversation-level caching would provide.
Per-user architecture. Right for some kinds of governance and analytics; too coarse for operational decisions. A user’s behavior combines conversations of very different shapes; treating them as one bucket loses the signal that makes operational decisions tractable.
Per-workload architecture. Right for FinOps reporting and capacity planning; too coarse for individual operational decisions. Useful combination; not the atomic unit.
The conversation hits the right balance. Small enough to be meaningful per instance; large enough to capture the relevant continuity.
What the conversation isn’t
Worth being explicit about cases where conversation is the wrong atomic unit:
Pure batch workloads. When you’re processing millions of items in parallel and there’s no conversational structure, the conversation abstraction is a poor fit. Use per-job or per-batch as the atomic unit. The conversation pattern is for interactive and agentic workloads.
Stateless API gateways. When the system is simply forwarding model calls without any session state, there’s no conversation to track. The atomic unit is the call. Adding conversation tracking to a stateless gateway is overhead for no benefit.
Embedding-only services. When you’re computing embeddings on documents and not maintaining session state, conversation isn’t the right unit. Per-document or per-batch is.
These are the exceptions. Most user-facing AI workloads aren’t these.
What I do in my own setup
The home AI setup I’ve described follows this architecture in a small way:
- Every interaction with the assistant carries a conversation ID generated at session start.
- The local Postgres on the Synology indexes per-call records by conversation ID.
- The Grafana dashboard rolls per-call costs to per-conversation, then to per-day.
- Audit logs are queryable per conversation, which is what I actually want when something looks off.
- Memory hygiene operates at conversation scope, the periodic resets, the explicit memory commits, the persona reaffirmations all live within a conversation.
The setup is small. The architecture is the same shape it would be at production scale. The conversation-as-atomic-unit choice scales up cleanly because it was the right choice at the small scale.
The pattern in summary
Pick the right atomic unit. The conversation is right for most user-facing AI workloads. The architectural decisions that follow (caching, billing, governance, observability) all get cleaner when you start there. The teams that pick a different atomic unit end up rebuilding conversation as a higher-level abstraction; the teams that start there avoid the rebuild.
Worth being deliberate about this choice early. The downstream cost of changing the atomic unit later is meaningful; the cost of getting it right at the start is essentially zero.