Cloud

A short defense of the boring middleware

The interesting work in any AI system lives in the model and the application layer. The boring middleware between them, auth, rate limits, retries, logging, request shaping, is what makes the system actually work. Worth defending the boring part.

Sid Smith

27 Aug 2025 • 5 min read

The interesting work in any AI system lives in the model layer and the application layer. The model is the headline; the application is the user-facing value. Sitting between them (and rarely featured in any conversation about AI architecture) is the boring middleware. Auth and rate limits and retries and logging and request shaping and response normalization and the gateway that routes requests across vendors. The part that nobody puts in their architecture diagrams.

This is a short defense of that layer because the production AI systems that work well in 2025 are differentiated more by the boring middleware than by the model choice or the application logic, and the conversation among people who actually use this stuff hasn't caught up.

middleware-layer

What the boring middleware does

In any production AI system that's matured past the prototype phase, there's a layer that handles:

Authentication and authorization. Who's allowed to make this request. With which credentials. Against which model. With what scope.

Rate limiting and quota enforcement. Per user, per team, per model, per workload. Both for cost control and for vendor-quota management.

Retry logic with proper backoff. Vendor APIs fail. Some failures are retryable; some aren't. Telling them apart correctly without hammering the upstream is non-trivial.

Request shaping. Normalizing the application's request into the shape each vendor expects. Adding the right headers. Setting the right model parameters. Handling the differences between Anthropic's request shape and OpenAI's and Google's.

Response normalization. Vendor responses have different shapes. The middleware reshapes them into a common form so the application doesn't have to know which vendor served the request.

Logging and audit. Every request, every response, every model call, conversation IDs threaded through, structured in a way that's queryable later for debugging, cost analysis, and security.

Failover and routing. When the primary vendor is degraded, the middleware fails over to the secondary. When a specific model is overloaded, it routes to the equivalent. The application doesn't see the routing logic.

Caching where appropriate. Prompt caching, embedding caching, response caching for cases where the inputs are repeatable.

Cost attribution. Per-call cost capture against the right tags for conversation-level cost tracking.

Schema validation on responses. When the application expects structured output, validating that the response matches the expected schema before passing it on, and triggering re-prompts or fallbacks when it doesn't.

That's a non-trivial list. The systems that do this layer well end up reliable, observable, governable. Projects that skip it end up fragile and opaque.

Why this layer doesn't get the attention it deserves

Three reasons.

The vendors don't ship this for you. OpenAI sells you the model. Anthropic sells you the model. AWS sells you Bedrock with some middleware features baked in but not enough to skip building your own. The vendors' interest is in selling you the model; the middleware is your problem.

The middleware doesn't make for good demos. Nobody gives a keynote about their request-normalization layer. The interesting demos are about what the model does. The middleware is the unsexy reason the demo can be repeated reliably tomorrow.

The skill set is "platform engineering" not "AI engineering." The people who build great AI middleware are usually senior platform engineers who understand the model layer. The people who get hired into "AI engineer" roles are usually working on the application or the model side. The skill mismatch produces underbuilt middleware.

The result: the middleware exists, but it's frequently underbuilt or built ad-hoc rather than as a deliberate platform layer. The teams that recognize this and invest are differentiated; the teams that don't are stuck.

The differentiation that comes from this layer

A few things that mature middleware enables that lacking middleware prevents:

Multi-vendor routing without application changes. When you can swap the primary vendor at the middleware layer, vendor lock-in becomes manageable rather than crippling. Teams without a routing middleware are committed to whoever they integrated with first.

Reliable agentic workflows. Agent design patterns like planner-executor, retry-with-reflection, and tool-scoped subagents all require a middleware that can express them cleanly. Without that middleware, the agent code grows tangled trying to handle the patterns inline.

Cost visibility. The conversation-level cost tracking I wrote about depends on middleware that captures per-call cost against conversation IDs. Without that capture happening at the middleware layer, the cost data is fragmented or absent.

Failure recovery. The runaway tool-call class of failures gets caught by middleware that notices max-iteration conditions, surfaces them to the human, and prevents the agent from spiraling. Without that middleware layer. Every application has to handle this individually and most don't.

Audit and compliance. The audit-trail story for AI in regulated environments depends on middleware that logs every model interaction in queryable form. The audit story is essentially a middleware feature.

These differentiators all come from the layer that nobody's talking about. The teams that build them well have noticeably better AI systems; the difference compounds over time.

What the boring middleware looks like in practice

A typical mature middleware layer in 2025 is something like:

A small service sitting between the application and the model vendors. Stateless or near-stateless. Horizontally scalable.
A router that picks the right model/vendor based on the request shape, the configured policies, the current vendor health, and the workload tags.
A normalizer that handles the request and response shape conversion.
A retry handler with exponential backoff, telling retryable from non-retryable errors per vendor.
A logger that captures structured records into a queryable store.
A cost-attribution layer that tags each call with the conversation ID, user, team, workload type, and the resolved cost.
A schema-validator for the cases where the application expects structured output.
An auth layer hooked into the broader org auth.
A configuration surface (usually YAML or similar) that lets ops change routing, quotas, and policies without code changes.

That's not a wild pile of features. It's a reasonable platform-engineering project. A small team can build this in a few weeks; the value compounds for years.

The build-vs-buy

A few options exist commercially: LiteLLM and OpenLLM as open-source middleware bases, the various "AI gateway" SaaS products (Portkey, Helicone, Lakera, etc.), the cloud-vendor offerings (Bedrock has some of this; Vertex has some; OpenAI's own platform has bits of it). None of them is a complete answer; most of them get you 50-70% of the way there and you build the rest.

The honest pattern visible across the projects I follow and my own work: start with one of the open-source middleware bases (LiteLLM is the most mature for this purpose), extend it for the org-specific needs, and treat it as a platform component you own rather than as off-the-shelf SaaS. The build cost is modest; the operational benefit is substantial.

The defense, summarized

The boring middleware is what separates production-grade AI systems from prototypes. It's the layer that makes the model layer reliable, the application layer simple, and the operational story coherent. It doesn't make demos; it makes the demos repeatable.

Worth investing in. Worth treating as a first-class platform concern. Worth giving the engineers who build it well the same recognition as the engineers who build the visible features. The boring middle is where the actual reliability lives.