CI/CD for AI models: the pipeline shape that holds up

Tekton, Argo CD, GitHub Actions, Jenkins X, four answers to model deploys. You can't unit-test a model, so eval suites become the test substitute. Versioning, rollback, blue-green serving. Pipeline config as the Decisions as Code surface, projected per environment.

Sid Smith

12 Jul 2025 • 6 min read

The first time a team ships a model the way they ship a microservice, the CI/CD pipeline they reach for is the same one they were already using for the rest of the platform. That's correct as a starting point and wrong as a finishing point. The shape of an AI deploy is different enough from a service deploy that the pipeline has to grow new joints. Good news: the joints are well-understood now. The 2025 question isn't "what's the AI CI/CD tool?", it's "how do I bend my existing CI/CD setup around the parts of the model lifecycle that don't fit?"

I've been running variants of this on engine-01 (the Linux box with the GTX 1080 Ti) partly to see how the patterns behave on a single GPU node and partly because the problems are easier to feel when you can't hide behind a fleet. The shape that's emerged across industry conversations and the homelab work looks roughly the same regardless of what you're running on.

Five stages, one signed trail.

The pipeline shape that holds upDatafrozen +signedEvalsfail fastBuildmodel +bundleReleaseGatepolicy +sign-offDeploycanary →fullevery stage records a signed audit eventFive stages, one signed trail. Release Gate is where humans veto.Five stages, one signed trail.

The four tool choices

Pick the tool that matches the rest of your platform. The differences aren't huge.

Tekton is the K8s-native pipeline runtime. CRDs all the way down. Pipelines, Tasks, PipelineRuns. Strong if your platform is K8s-first and you want pipeline definitions to live next to the workloads they ship. The Tekton Hub has reusable Tasks for most of what an AI pipeline needs (S3 upload, container build, conftest, kustomize). The pain point is the verbosity. Every Task is a Pod. Every step is a container, the YAML adds up.

Argo CD is the GitOps continuous-deployment side. Pair it with Argo Workflows for the pipeline side and you have a clean K8s-native combination. Argo CD watches the manifest repo; Workflows runs the pipeline jobs. For model deploys this works well because the deployed object (a KServe InferenceService, a Helm release for vLLM) is a manifest that GitOps loves.

GitHub Actions is the path most teams are already on. Hosted runners for the lightweight steps, self-hosted runners on GPU nodes for the eval steps that need a GPU. The pain point is that the model artifact is too big for the artifact store; you push to a model registry instead and pass a reference. The benefit is that everyone already knows it.

Jenkins X is the most opinionated of the four. Cloud-native Jenkins with built-in GitOps and preview environments. Strong if your team wants the workflow defined for them and weak if your team wants to define it themselves. I haven't seen it picked for new AI work in 2025; the teams that are on it inherited it.

The honest answer for most teams: stay on what you have. The differentiation is in what the pipeline does, not the runner.

You can't unit-test a model

This is the central asymmetry. A regular service has unit tests, integration tests, contract tests, discrete pass/fail signals on small inputs. A model has none of that. The model's behavior is a continuous distribution; "did it pass" is a property of how it does across a representative dataset, not a property of any single input.

The substitute is the eval suite. The pipeline runs the model against a held-out evaluation dataset and computes metrics. Accuracy, F1, BLEU, ROUGE, exact-match, retrieval recall@k, latency at p50/p95/p99, whatever's relevant to the task. Then it compares those metrics to a baseline (the previous deployed model, a fixed reference, or both). The eval result is the deploy gate. If accuracy regresses by more than X percent, the pipeline fails. If latency regresses by more than Y percent, the pipeline fails. If a small set of canary prompts produce different outputs than they did before, the pipeline fails.

The eval suite IS the test suite. Treat it that way. Version the eval dataset alongside the model. Run it on every PR that touches the model artifact, the prompt, the system message, the retrieval index, anything that can change behavior.

Model versioning

A model is an artifact like any other; it just lives in a model registry instead of a container registry. MLflow, Weights & Biases, BentoML's model store, Hugging Face's private hub, or a plain S3 bucket with a metadata file, the choice matters less than the discipline.

The minimum metadata per version: model name, version (semver or hash), training data version, base model and fine-tune deltas, eval scorecard, training config, signing key. The deploy artifact is a manifest that points at the model version, not a manifest that embeds the model bytes. The Helm chart for the serving stack takes a model.uri value; the pipeline updates that value; GitOps deploys the new pointer.

The benefit of treating the model as a versioned artifact is that rollback becomes trivial. The deployed object points at v1.4.7. Roll back the manifest to point at v1.4.6. The serving stack's pod re-pulls the older artifact. Done. No "rebuild the model" step in an emergency.

Blue-green for serving

Model rollouts are higher-risk than service rollouts because the regression mode is "subtly worse outputs," not "5xx everywhere." Subtle regressions don't show up in health checks. The pattern that's emerged is blue-green or shadow.

Blue-green: deploy the new model alongside the old one. Health checks pass on both. The router (Istio, Linkerd, KServe's traffic-splitting, an NGINX in front) shifts traffic from blue to green in stages. 10% for an hour. 50% for a day. 100% when the eval-on-live-traffic metrics confirm parity.

Shadow: send live traffic to both blue and green, but only return blue's response to the user. Compare outputs offline. Catch divergences before flipping. Costs double inference for the shadow window, but it's the only way to feel the regression on real traffic before users do.

KServe makes both patterns first-class. Other tools do them through the service mesh. Either way, the pipeline knows which mode it's in and gates the next traffic-shift step on the eval result.

The pipeline as a Decisions as Code surface

This is where the approach I've been threading through everything K8s-related shows up again. Decisions as Code (DaC) is the way I design nearly every self-service and automation system: pull business decisions out of platform config into a small, curated layer (often five real decisions where the raw config exposed eighty-nine) and let the platform absorb the rest through templates and defaults. (I called this Property Toolkit during my OneFuse days; the underlying tools changed, the shape didn't.)

A pipeline definition has the same problem. Each pipeline embeds the same standard concerns, the registry URL, the eval dataset bucket, the GPU node selector, the canary thresholds, the rollout strategy. If those are duplicated per pipeline, drift wins. If the decisions are centralized once and projected into every pipeline as parameters, DaC wins.

In Tekton. That's a Pipeline that takes Parameters from a centralized ConfigMap. In Argo Workflows. It's a WorkflowTemplate that downstream Workflows reference with parameter overrides. In GitHub Actions. It's a reusable workflow with workflow_call and inputs. In Jenkins X, it's a shared library.

The shape is the same one I describe in the Helm values article: define the standard pipeline shape once, project it into every consuming pipeline through parameters that map to the consuming pipeline's primitives. Change the standard eval threshold. Every pipeline picks it up. No PR storm.

This is the same pattern the plan-output-as-data work in 2024 was building toward, pipeline output and pipeline input as structured data the rest of the platform can consume. The model-deploy pipeline emits an eval scorecard JSON; downstream automation consumes it; the policy gate reads it. Same data shape, different consumer.

Rollback discipline

The rollback story is half of the deploy story and gets half the attention. The minimum bar:

The model artifact for the previous version is still in the registry, retained for at least 30 days.
The deploy manifest for the previous version is still in the GitOps repo, recoverable from the last green commit.
The eval scorecard for the previous version is still in the eval-results bucket, queryable to confirm parity.
The router knows how to shift traffic back to the previous serving deployment within seconds, not minutes.

If any of those four are missing, the rollback story is a story you'll tell on a postmortem.

What I keep coming back to

The CI/CD-for-AI conversation in 2025 has settled on a shape that looks a lot like the CI/CD-for-services conversation looked in 2018: pick the tool that matches your platform, treat the artifact as versioned, gate the deploy on a meaningful test, build the rollback story before you need it, and centralize the pipeline definition so changes propagate by reference. The differences from a service pipeline are real but local: the test is an eval suite instead of a unit suite, the artifact lives in a model registry instead of a container registry, the rollout is blue-green or shadow because the regression mode is subtle.

The pipeline shape that holds up isn't novel. It's the same shape you already know, with the AI-specific joints bent into place. Don't over-engineer it. Don't under-test it. And treat the pipeline definition itself as a Decisions as Code surface, projected into every consumer. The approach is older than the workload it's running. It still works.