GPU scheduling in Kubernetes: node taints, device plugins, the practical reality

The NVIDIA device plugin, node taints and tolerations, the multi-tenancy reality of one GPU per pod vs MIG vs vGPU. The scheduling primitives that work at homelab scale on a single 1080 Ti and the ones that change shape entirely at fleet scale.

Sid Smith

10 Nov 2025 • 7 min read

GPU scheduling in Kubernetes has the same property as IPv6: every five years the answer is supposedly close to finished, and every five years the actual fleet is still running half the old way and half the new way. The 2025 version is more mature than the 2020 version, but the gap between the documentation and the operational reality is still where teams lose weeks.

I’ve been running the NVIDIA device plugin and an open-source GPU operator on engine-01 (the Linux box with a single GTX 1080 Ti) for long enough to feel the parts of the story that work and the parts that don’t. The 1080 Ti is too old for MIG, too small for vGPU, and not exotic enough to justify any of the heavyweight scheduling primitives, which makes it a pretty honest test of the baseline pattern. The fleet-scale concerns I’m folding in from public reporting, conversations with practitioners, and the documented behavior of the device plugin under more interesting hardware.

The baseline: NVIDIA device plugin

The NVIDIA Kubernetes device plugin is the thing that makes a GPU node’s GPUs visible to the K8s scheduler. It runs as a DaemonSet on every node tagged for GPU work. It exposes nvidia.com/gpu as an allocatable resource. Pods request nvidia.com/gpu: 1 in their resource spec. The scheduler matches.

That’s the whole baseline. The plugin handles the device files (/dev/nvidia0, /dev/nvidia-uvm, etc.), the CUDA driver visibility into the pod, and the basic accounting. If you’ve ever installed the plugin and run a nvidia-smi from inside a pod, you’ve seen the baseline working.

The catch (and it’s a meaningful one) is that the baseline gives you one GPU per pod. There’s no fractional allocation in the box. A pod requesting nvidia.com/gpu: 1 gets the whole device. If your model needs half a GPU’s memory, the other half goes idle for the lifetime of the pod.

On engine-01 with one 1080 Ti, that’s the right model. There’s only one GPU. The 1080 Ti has 11 GB of VRAM, which is enough for a 7B at Q4 or two smaller models if you partition by hand, but the partition has to happen inside the serving process, vLLM doesn’t share the GPU with another vLLM instance gracefully, and the scheduler can’t help me there.

At fleet scale, the one-GPU-per-pod model gets expensive fast. An H100 has 80 GB of VRAM; if you’re serving a 7B model and consuming the whole device, you’re paying for 70 GB of unused VRAM. That’s the multi-tenancy problem.

The multi-tenancy problem and its three answers

The three answers, in increasing order of operational sophistication:

One GPU per pod (the baseline)

What I described above. It works. It’s wasteful for small models on big GPUs. It’s correct for large models on appropriately-sized GPUs.

The right answer when: your models are sized to fill the GPU (a 70B at FP16 on a 80 GB H100, that’s a tight fit, you want the whole device), or you only have one GPU and you don’t have a choice, or you’re in early days of platform maturity and the operational simplicity matters.

MIG (Multi-Instance GPU)

NVIDIA’s hardware-level partitioning, available on A100, H100, H200, and the newer B-series. The GPU is sliced into 2, 3, 4, or 7 instances (depending on the device), each with its own memory and compute slice, fully isolated at the hardware level. The K8s device plugin exposes each MIG instance as a separate allocatable resource.

The good: real hardware isolation. A noisy neighbor on one MIG instance can’t degrade another. The accounting is clean. The pricing model maps.

The bad: MIG configurations are static per node, you pick a partition profile (e.g. 7 × 10 GB on an A100-80, or 2 × 40 GB) and the node is locked into that profile until you reconfigure. Dynamic re-partitioning is possible on newer hardware, but it requires draining the node. The slicing is also coarse: you can’t carve out a 3 GB instance, only the partition sizes the hardware supports.

The right answer when: you have plenty of A100/H100 hardware, multiple workloads that are too small for a full device but too big for time-sliced multi-tenancy, and a platform team that can manage the per-node partition profile as a Kubernetes-aware choice.

vGPU and time-slicing

The software-level approaches. NVIDIA vGPU is a paid licensing model that does time-sliced sharing across vGPUs (originally a VDI feature, increasingly used for compute). The open-source equivalents, the device plugin’s time-slicing mode, plus various third-party schedulers like HAMi or KAI, give you fractional GPU allocation by time-multiplexing the device across pods.

The good: works on any GPU (including ancient ones like the 1080 Ti). Fine-grained allocation. Cheap to set up.

The bad: no hardware isolation. A pod with a runaway CUDA kernel can take over the device and starve its co-tenants. Performance variance is real. Memory isn’t isolated either, one pod can OOM the whole GPU and bring down everything sharing it.

The right answer when: you have older hardware that doesn’t support MIG, you can tolerate the isolation tradeoff, and the workloads being co-located are tolerant of variance (eval jobs, batch inference at low priority, dev workloads).

Node taints and tolerations

The pattern I’ve covered before but worth being specific about: every GPU node should be tainted, and every workload that wants a GPU should tolerate the taint.

# On the node
taints:
  - key: nvidia.com/gpu
    value: "true"
    effect: NoSchedule

# On the pod
tolerations:
  - key: nvidia.com/gpu
    operator: Equal
    value: "true"
    effect: NoSchedule

This is the basic hygiene. Without it, every random pod the scheduler places lands on your GPU nodes and consumes their CPU/memory, leaving the GPU underutilized and the node priced wrong. With it, only the workloads that explicitly want a GPU land there.

The complement is nodeSelector or nodeAffinity for GPU type. If you have a heterogeneous GPU fleet (some H100, some A100, some L40S) you want workloads to land on the right device. Labels like nvidia.com/gpu.product=H100-80GB (the GPU operator sets these) plus a nodeSelector keep the scheduling honest.

At homelab scale this is straightforward: one GPU node, one taint, one toleration. At fleet scale it’s a Helm values concern, and this is exactly where Decisions as Code (DaC) pays back. DaC is the methodology behind nearly every self-service and automation system I’ve designed: extract business decisions out of platform configuration into a small, curated layer (often five real decisions where the raw config exposed eighty-nine) and let the platform absorb the rest through templates and defaults. (I called this Property Toolkit during my OneFuse days; the foundation is different, the shape isn’t.) Centralize the toleration template in the standards chart. Every workload that wants a GPU calls {{ include "standards.gpu.tolerations" . }}. Change the standard convention, propagate. The Helm values piece walks through the chart shape.

The NVIDIA GPU Operator

Half the story above is “install the device plugin, the driver, the container runtime, the DCGM exporter, the MIG manager.” That’s a lot of moving parts. The NVIDIA GPU Operator bundles all of it as a Kubernetes operator: install one Helm chart, get a working GPU-enabled cluster.

On engine-01, the GPU Operator handles the device plugin, the NVIDIA container toolkit, the DCGM metrics exporter, and the node-feature-discovery integration. The model-aware bits (MIG config) aren’t relevant on a 1080 Ti, but the rest is exactly the same shape that runs on production fleets.

The Operator is one of those pieces of K8s infrastructure that you should not roll yourself. The integration surface (kernel drivers, runtime shims, CRDs, metrics) is large enough that the official Operator is the right starting point. Spend the time configuring it; don’t spend the time recreating it.

What changes at fleet scale

The patterns above are the same at homelab and at fleet. What changes:

Heterogeneity. The homelab has one GPU. The fleet has H100s, A100s, L40Ses, and maybe some older T4s for cheap inference. Node labels become load-bearing. Workloads declare preferences. Bin-packing matters in a way it doesn’t with one device.

Multi-tenant scheduling. With one tenant (me), the priority story is trivial. With twenty tenants, the question becomes whose workload runs when capacity is constrained. Kueue is the K8s-native answer, gang scheduling for batch, queues with quotas, fair-share or priority-based preemption. It’s worth standing up the moment you have multiple teams competing for the same GPU pool.

MIG management. The static-partition reality means somebody owns the partition profile per node. You want enough flexibility in the partition mix to absorb the workload shape, but you don’t want to reconfigure constantly. Some teams use a small set of nodes per profile; others use the dynamic MIG features on H100 to repartition on schedule.

Spot / preemptible GPUs. Fleet operators on AWS, GCP, or self-hosted with auctioned capacity use spot GPU instances for batch workloads. The K8s pattern is PriorityClass plus pod disruption budgets plus checkpointing in the workload itself. Homelab doesn’t have spot. Fleet does. Different operational story.

Cost attribution. Twenty teams sharing a GPU pool need cost attribution. Labels become the unit of accounting. The DaC centralization of labels matters here for the same reason it matters everywhere else, divergent label schemes destroy your cost attribution. Centralize it.

What I keep coming back to

GPU scheduling on K8s in 2025 is in a similar place to where general compute scheduling was on K8s in 2018: the primitives work, the operational pattern is real, the documentation lags the reality, and the second 80% of value comes from getting the multi-tenancy story right.

The homelab pattern (one GPU, one taint, one toleration, one workload) is the simplest version of the story. The fleet pattern adds MIG or time-slicing, heterogeneous fleets, Kueue for multi-team scheduling, and a real cost-attribution discipline. The shape of the K8s primitives doesn’t change. The numbers in the YAML do.

If you’re standing up GPU workloads on K8s for the first time: install the NVIDIA GPU Operator, taint your nodes, write your tolerations into a standards chart, start with one-GPU-per-pod, and only reach for MIG or time-slicing when the workload count or the device size makes the math hurt. The trap, as with most platform work, is reaching for the complex primitive before the simple one stops working.

The 1080 Ti in engine-01 is going to age out long before the patterns I’ve built around it do. That’s the point. The patterns are what travel.