Observability for AI workloads: Prometheus, Grafana, Loki

Latency, token throughput, cost-per-request, queue depth, drift. Prometheus for the metrics, Loki for the prompts and responses with PII discipline, Tempo for the tracing across agent calls. The dashboard JSON itself as a Decisions as Code surface, projected per-environment.

Sid Smith

30 Oct 2025 • 6 min read

The first time you put a model into production, the dashboards you stand up are the dashboards you already have for any service: requests per second, p99 latency, error rate, CPU, memory, restart count. Those are still useful. They’re also missing the metrics that actually tell you whether the model is doing its job.

The observability shape for AI workloads is the regular service-observability shape with a handful of new metrics layered in, and one substantial new concern: drift. The tooling is already mature. Prometheus for metrics, Loki for logs, Tempo for traces, Grafana to put it all together. None of that is new. What’s new is the metric set, the log discipline, and the trace topology that emerges when an agent makes ten downstream calls per user request.

This is what I run on the homelab cluster. Prometheus and Grafana on engine-01, Loki on the same node with retention on the Synology, Tempo for the agent traces because the cluster has them now and the topology stops being legible without distributed tracing. The size is small. The shape is the same shape that scales.

The metric set that matters

Start with the regular service metrics. Then layer in the AI-specific ones.

Latency, multi-stage. A model serving request has at least three latency stages worth measuring separately: queue time (request arrived, hasn’t started inference), prefill time (the model is processing the prompt), and decode time (the model is generating tokens). Combine latency is misleading; per-stage latency tells you where the bottleneck is. vLLM, KServe, and Triton all expose these as separate Prometheus metrics. Wire them up.

Token throughput. Tokens-per-second on the prefill side and tokens-per-second on the decode side. These are the right capacity metrics. Requests-per-second is the wrong unit because requests vary wildly in size; tokens-per-second normalizes.

Cost per request. The metric most teams under-instrument. For a self-hosted model, cost is a function of GPU-seconds; for an API model, it’s a function of input and output token counts at the published rate. Either way, expose a counter that increments by the per-request cost. Wire a Grafana panel that shows daily cost per workload, per team, per model. The first time someone sees the daily total, the conversation about prompt size and context engineering changes.

Queue depth. How many requests are waiting at the serving foundation. The leading indicator for SLO breaches. The lagging indicator is p99 latency; queue depth tells you about the breach 30 seconds before latency does.

Cache hit rate. If you’re using a prefix cache (vLLM’s automatic prefix caching, anyone’s KV cache reuse), the hit rate is the metric that explains your cost per request. Track it.

Eval scorecard, periodic. Not every request, but a scheduled job that runs the eval suite against the live deployment and pushes the scores as Prometheus metrics. Accuracy, F1, retrieval recall@k. The drift detection metric.

Tool call counts. For agent workloads, the count of tool calls per request and the per-tool error rate. The runaway-tool-call story I’ve written about before becomes visible only if you’re counting them.

Loki for prompts and responses, with PII discipline

The temptation with AI logging is to log every prompt and every response in full because you’ll want them for debugging. The temptation is right. The discipline has to come with it.

Loki handles the volume fine. The discipline question is what’s in the logs. Prompts can contain PII. Responses can contain PII. Tool call payloads can contain PII. If you log them naively, you’ve created a PII store the size of your inference traffic. The PII-aware prompting piece covered the pattern at the application level; the observability extension is that the logging path needs to honor the same discipline.

The pattern that works:

A logging middleware in the serving stack that runs PII detection on prompts and responses before they hit Loki.
Detected PII gets tokenized to a hash; the hash and the token type get logged; the original PII does not.
A separate, restricted-access stream for “raw” logging when explicit debugging requires it, with a short retention.
Loki labels for the workload, the team, the model version, and the user-class (so you can filter by tenant for multi-tenant deploys without inventing a new index).

The PII filter is not perfect. It’s the same regex-plus-classifier shape every team eventually builds. The discipline is to run it before write rather than at query time.

Tempo for distributed tracing across agent calls

A single-call inference deploy doesn’t need distributed tracing. An agent does. The moment your “request” decomposes into a planning call, three retrieval calls, a reranker call, a tool call, and a final synthesis call, the only way to understand a slow response is to look at the trace.

Tempo plus OpenTelemetry instrumentation in the agent runtime gives you that. The OTel SDK auto-instruments the HTTP calls; the agent code adds spans for the planning, the tool selection, the tool execution. Each span carries the model used, the token counts, the latency. Grafana’s Tempo integration lets you click from a high-latency Loki line into the trace that produced it.

The discipline that makes traces useful is the trace ID propagation across the agent call graph. Tools that are called via MCP need to know how to forward the trace ID. Tools that are called via plain HTTP need the same. Without the propagation, you have a thousand independent spans and no graph.

Drift detection

Model drift is a slow failure mode. The model is healthy by every infra metric (latency is fine, error rate is zero, no OOMs) but the answers it’s producing are subtly worse than they were a month ago because the input distribution has shifted. The serving foundation doesn’t know. The infra dashboards don’t know. The user complaints will eventually arrive, with weeks of delay.

The drift detection metric set:

Embedding drift. Compute embeddings of recent inputs; compare the distribution to a reference distribution from the deploy time. Population stability index, KL divergence, or a simpler kind. Push as a Prometheus metric.
Output distribution drift. Sample outputs and run cheap classifiers (length, sentiment, toxicity, language). Compare to reference. Push as metrics.
Eval scorecard, against a held-out set. Same eval suite that gates the deploy, run on a schedule, with the score pushed to Prometheus. Alert when the score drops by more than a threshold.

None of these are perfect drift detectors. Together they’re enough to notice the slow failure mode before the user complaints route to the on-call.

Grafana JSON as the DaC surface

This is the part that keeps the K8s pieces in this batch coherent. Decisions as Code (DaC) is the methodology behind nearly every self-service and automation system I’ve designed: extract business decisions out of platform configuration into a small, curated layer (often five real decisions where the raw config exposed eighty-nine) and let the platform absorb the rest through templates and defaults. (I called this Property Toolkit during my OneFuse days; the foundation is different, the shape isn’t.)

A Grafana dashboard is JSON. A Grafana folder of dashboards is a directory of JSON files. Dashboards-as-code is a mature pattern with multiple loaders (Grafana Operator, Terraform Grafana provider, Grizzly, dashboard ConfigMaps via the sidecar). The dashboard definitions are the DaC surface for observability.

The methodology says: the dashboard definitions are organizational decisions. They live in a dashboards/ repo, parameterized by team / workload / environment. The standards repo is the source of truth. Each environment’s Grafana loads the dashboards from that source via the sidecar pattern; the values that vary per environment (the data source name, the team label) are projected in via Helm template substitution.

The shape is the same as the Helm values piece. Centralized decisions. Platform-aware shapes (Grafana JSON is the platform shape). Variable interpolation (template substitution at load). Composition (panels reused across dashboards via panel libraries). Discovery convention (the dashboards folder structure tells the loader what to load).

The benefit at any scale: the on-call’s first dashboard is the same dashboard the next team’s on-call sees, projected onto their workload. The “everyone reinvents the AI dashboard” problem disappears because the AI dashboard is a parameterized standard, not a per-team artifact.

What I keep coming back to

The observability stack for AI workloads doesn’t need new tools. Prometheus, Grafana, Loki, Tempo, the same CNCF-graduated set the rest of the platform is already running on. The discipline is in the metric set, the log hygiene, the trace topology, and the drift detection loop. None of those are exotic. All of them are skipped by teams that thought “AI is just another service” and ran out of time.

The DaC-shaped piece is treating the dashboards themselves as the curated decision surface. That’s the thing that makes the observability story scale across teams without each team reinventing it. Define the dashboard once. Project it onto every workload. Validate the data source bindings at deploy. Pair with alerting policies that inherit the same standard thresholds. The methodology applies to the observability layer the same way it applies to the workload layer. Once you see the pattern, you see it everywhere.