Model serving on Kubernetes: KServe, vLLM, and the substrate question

KServe, vLLM, Triton, BentoML, four different answers to the same question. Each layer trades operational complexity for serving features. Worth being concrete about which layer is right for which workload, tested on a single GTX 1080 Ti.

Model serving on Kubernetes: KServe, vLLM, and the substrate question

Stand up a model serving stack on Kubernetes in late 2025 and the first decision you make is the most consequential: which serving foundation are you putting in front of the model? The choices, roughly in order of how much they ship in the box:

  • A plain Deployment running vLLM (or llama.cpp, or TGI, or Ollama) as a Service. Minimum foundation. You own everything above the inference runtime.
  • BentoML / Bento Cloud’s open core. A serving framework with packaging, versioning, and some lifecycle plumbing. K8s-native via Yatai or your own manifests.
  • NVIDIA Triton. The classic multi-model, multi-framework inference server. K8s-friendly via the Triton operator or a Deployment. Strong on tensor-parallel scheduling and dynamic batching.
  • KServe. A full inference platform on K8s. Knative-based autoscaling, multi-model serving via ModelMesh, transformer / predictor / explainer separation, payload logging, the works.

Each layer adds plumbing. Each layer adds operational surface. The foundation question is which trade-off is right for the workload you actually have, not for the workload the vendor’s demo had.

I’ve been running a KServe + vLLM combo on engine-01 (the Linux box with the GTX 1080 Ti) partly to test how the patterns hold at small scale and partly because the foundation question is easier to feel when there’s one GPU and the costs are tangible. The 1080 Ti can’t run a frontier model, but it can run a 7B at Q4 quite comfortably, and that’s enough to exercise the moving parts.

What the foundation question is actually asking

When somebody says “we’ll just run vLLM on K8s,” the surface is:

  • Build a container with vLLM + the model weights mounted from a PVC.
  • Wrap it in a Deployment.
  • Put a Service in front.
  • Add an HPA on some metric.
  • Done.

That works. For one model, one team, one use case, that’s the right answer. Spend a day getting it right and move on. The foundation question stops mattering.

The foundation question matters when:

  • You have multiple models to serve, and you don’t want each one to consume a whole pod.
  • You have traffic that needs to autoscale to zero, because the GPU node-hours are expensive enough that idle is unacceptable.
  • You need request/response logging for eval, audit, or compliance, and you don’t want to bake it into every serving image.
  • You want a canary / shadow / A-B framework to test model versions in production without writing a custom router.
  • You’re running enough serving workloads that the operational pattern needs to be standardized.

At one model and one team, plain vLLM is fine. At fifteen models and four teams, plain vLLM is fifteen bespoke deployments and four different opinions about how to log inference traffic. The serving foundation is the standardization tool. The question is how much standardization you actually need.

What each layer brings

Plain vLLM on a Deployment

The minimum. vLLM is excellent at what it does, high-throughput, batched LLM inference with continuous batching and PagedAttention, the latter of which is the reason it’s the de-facto open-source frontier. Wrap it in a Deployment, mount the model weights from a PVC, expose an OpenAI-compatible endpoint. You get a service that handles tokens per second comparable to anything else open-source.

What you don’t get: built-in cold-start optimization, scale-to-zero, multi-model serving on a single pod, request logging, eval hooks, canary deploys. All of those are your problem.

This is the right answer when you’re serving one model, you control the traffic, and you don’t need the platform-level features. Most teams’ first deployment looks like this. Most teams’ second deployment looks like this too, before they hit the wall.

BentoML

A serving framework that wraps a model in a Python-class API, builds an OCI image, and deploys it. The serving feature set includes some adaptive batching, model versioning, and a packaging story for multi-stage pipelines (preprocessing → model → postprocessing). On K8s you either use Yatai (BentoML’s K8s operator) or roll your own manifests.

The right answer when you have heterogeneous model types (not just LLMs, image models, embedding models, classical ML, all in the same platform) and you want a unified packaging and serving story. BentoML’s strength is that it doesn’t assume your model is an LLM; KServe and vLLM both lean LLM-shaped in 2025.

The wrong answer when you only have LLMs and you want LLM-specific optimizations like continuous batching and prefix caching. BentoML can call vLLM under the hood, but you’re stacking two abstractions when one would do.

NVIDIA Triton

The grandparent of the modern inference servers. Triton handles multiple frameworks (TensorRT, PyTorch, ONNX, TensorFlow, Python backend), multiple models per server, dynamic batching, model ensembles, and gRPC + HTTP endpoints. On K8s, run it as a Deployment or use the Triton Inference Server operator.

Where Triton shines: mixed-model serving (an embedding model and a classifier and an LLM all in one server), performance tuning (the perf analyzer is genuinely good), and NVIDIA-stack integration (TensorRT, NIM, Riva, etc.).

Where Triton doesn’t shine: it’s not the natural home for the latest open-weights LLM trick of the month. The vLLM ecosystem moves faster on LLM-specific optimizations. Triton’s model is “any framework”; vLLM’s model is “LLMs, very well.”

NVIDIA has been folding Triton into NIM (NVIDIA Inference Microservices), and the line between “use Triton” and “use NIM” is getting fuzzy. For teams that have already committed to the NVIDIA stack. NIM is where this is going. For teams that haven’t, Triton is one of the substrates, not the foundation.

KServe

The K8s-native AI serving platform. KServe wraps your model in an InferenceService CRD, handles autoscaling (Knative-based, scale-to-zero out of the box), supports canary and shadow deploys, has payload logging, transformer/predictor/explainer separation for ML pipelines, and ModelMesh for multi-model serving when you have hundreds of small models.

KServe doesn’t run the model itself. It wraps a serving runtime, and one of those runtimes is vLLM. So the natural pattern in 2025 is KServe as the platform layer, vLLM as the runtime. KServe handles the K8s plumbing; vLLM handles the inference.

This is the foundation that justifies its complexity when you’re running serving as a platform for multiple teams. The platform team owns KServe, the deployment patterns, the autoscaling defaults, the logging story; product teams declare InferenceService manifests and consume the platform. It’s the model that scales beyond “one team, one model.”

The cost is real: Knative is not lightweight, the CRD surface is large, and the operational burden requires somebody to actually own the platform.

Cold start, autoscaling, multi-model

These are the three operational levers worth thinking about explicitly.

Cold start. Loading a 7B model from a PVC takes 20-40 seconds on a SATA-class disk; from a fast NVMe-backed parallel filesystem, single-digit seconds. Loading from object storage at pod startup is a non-starter for any model worth serving. The fix is the staged-PVC pattern I covered in the patterns piece: stage the weights once, mount them read-only, accept that the cold start is bounded by disk I/O.

KServe + Knative can scale to zero, which is great for the cost case but interacts badly with cold start. The pattern: keep a minimum replica count for hot models, scale to zero only for the long-tail models where a 30-second first-request latency is acceptable. ModelMesh is the alternative (many models multiplexed onto fewer servers) but the operational model is more complex.

Autoscaling. HPA on CPU is wrong for LLM workloads. CPU isn’t the bottleneck. The right metrics are GPU usage (via the NVIDIA DCGM exporter), queue depth, or tokens-per-second-in-flight. Custom metrics via the K8s metrics API plus a Prometheus adapter is the path; vLLM’s metrics endpoint already exposes most of what you need.

Multi-model. ModelMesh on KServe is the K8s-native answer for “many small models, not enough traffic each to justify their own pod.” Triton handles multi-model in-process. vLLM doesn’t natively (one model per process); you’d run several pods. The right foundation depends on the model count and the per-model traffic profile.

On the homelab

KServe + vLLM on engine-01 serves a quantized 7B model. The InferenceService manifest is fifty lines of YAML; the model weights are on a PVC backed by store-01’s NFS share; the GPU node taint pulls the pod onto engine-01; the HPA scales between 1 and 1 (single GPU, can’t do better). Scale-to-zero is off because the cold-start cost would dominate.

That setup is not what an enterprise would deploy. The shape of the manifests is. The InferenceService CRD, the storage spec, the runtime spec, those are identical at homelab and at fleet scale. The only thing that changes is the count.

That’s the test the homelab keeps passing for me: the patterns are portable. The shape that works at one GPU is the shape that works at a hundred, with different numbers in the values.yaml. (Which is the whole point of the Helm values piece earlier in this batch.)

What I keep coming back to

The foundation question doesn’t have a universal answer. It has an honest one: pick the layer that matches your operational maturity and your model count.

  • One model, small team, plain vLLM. Stop overthinking it.
  • A few models, one team, BentoML or vLLM behind a thin custom router. Still simple.
  • Many models, multiple teams, KServe with vLLM as the runtime, ModelMesh for the long tail. Platform layer earns its keep.
  • Mixed framework, NVIDIA-stack-committed, Triton or NIM. The right answer in that tooling.

The trap is picking the heavyweight foundation too early. KServe’s complexity makes sense at fifteen models across four teams. At one model and one team, it’s a tax. Match the foundation to the operational reality, not to the slide that made it look impressive.