Argo Workflows for AI pipelines: RAG indexing, fine-tuning, eval suites

Argo Workflows for the long-running, branching, fan-out AI ops pipelines that don't fit a CI runner. RAG indexing jobs, fine-tuning runs, eval suite execution. WorkflowTemplates as the Decisions as Code surface, same pipeline shape, different inputs per project.

Argo Workflows for AI pipelines: RAG indexing, fine-tuning, eval suites

There’s a class of work in any AI platform that doesn’t fit a CI runner. RAG indexing for a corpus that takes six hours to chunk and embed. A fine-tune run with a checkpoint loop and a periodic eval. An eval suite that fans out into a hundred parallel inference calls and reduces the results into a single scorecard. None of those are CI jobs. They’re long-running, stateful, branching, fan-out workflows.

Argo Workflows is the K8s-native answer for that shape of work. It’s a step further than the CI/CD-for-AI-models discussion, which was about the deploy pipeline. This piece is about the orchestration pipeline, the AI ops work that runs separately from the deploy and produces the artifacts the deploy consumes.

I run Argo on the homelab cluster, with the workflows hitting engine-01 for the GPU work and the Synology for the durable inputs and outputs. The size doesn’t stress the foundation. The shape does, which is the part that’s worth being concrete about.

Why Argo and not the others

The contenders for this shape of work in 2025 are roughly:

  • Argo Workflows. K8s-native, CRD-driven, DAG-and-step-defined, mature. The Pod-per-step model is heavy for short jobs and exactly right for long ones.
  • Tekton. Better as a CI foundation, weaker as a long-running orchestration foundation. The Pipeline / Task primitives don’t compose into the fan-out shapes Argo does naturally.
  • Apache Airflow. Mature, opinionated, Python-native. Strong if your team thinks in DAGs of Python operators. Weaker as a K8s-native primitive. Airflow lives outside the K8s scheduler more than inside it.
  • Prefect / Dagster. Modern Python-first orchestrators. Strong for data engineering. Less idiomatic for the K8s-native, container-per-step pattern AI workloads tend to want.
  • Kubeflow Pipelines. Built on Argo Workflows under the hood. The wrapper is opinionated toward ML pipelines specifically.

For AI ops on K8s. Argo Workflows is the foundation I keep recommending. The Pod-per-step model is the right grain for steps that vary wildly in resource needs (a chunker pod is cheap and small; an embedding pod is GPU; a fine-tune pod is large GPU and long-lived). The DAG primitive composes well. The artifact-passing model (output of one step is input to the next, materialized via S3 or a PVC) is the shape AI workloads need.

The three workflow shapes

The three workflows I keep seeing in AI platforms have stable shapes worth being concrete about.

RAG indexing

The DAG:

  1. Source discovery. A step that reads the manifest of source documents, diffs against the last successful run, and emits a list of changed documents.
  2. Fan-out chunk-and-embed. A withItems over the changed documents. Each item runs a Pod that chunks the document, embeds the chunks, and emits a JSONL of {vector, payload} records to the artifact store.
  3. Fan-in upsert. A reducer step that pulls the JSONL artifacts and upserts them into the vector DB (Qdrant, Weaviate, or Milvus, depending on what’s underneath).
  4. Validate. A step that runs a small set of canary queries against the updated index and asserts the expected documents are findable.
  5. Notify. A step that updates the deploy manifest with the new index version and pings the relevant Slack channel.

The fan-out is what makes Argo earn its keep here. Embedding a thousand documents serially is a six-hour job; embedding them with a parallelism of fifty against a GPU node is a fifteen-minute job. Argo handles the fan-out with a few lines of withItems and a parallelism cap.

Fine-tuning

The DAG:

  1. Dataset prep. Pull the dataset from object storage, validate the schema, optionally split for train / eval.
  2. Tokenize. A separate step because tokenization is CPU-bound and the GPU shouldn’t wait on it.
  3. Train. The expensive step. Pod with a GPU node selector, the training framework, the dataset and tokenizer mounted via PVC. Long-running. Checkpoints written to object storage on a schedule.
  4. Periodic eval. A side branch that pulls the latest checkpoint, runs the eval suite, and pushes a scorecard. Runs every N steps via a parallel sub-DAG.
  5. Promote. When the eval scorecard hits a threshold, the workflow promotes the checkpoint to the model registry as a new candidate version.
  6. Trigger downstream. A webhook step that triggers the deploy pipeline for the new candidate.

The branching shape is what Argo handles well. The training step runs for hours; the eval branch runs against checkpoints in parallel; the promote step gates on the eval. None of that fits a CI runner’s “ten minute job” model.

Eval suite execution

The DAG:

  1. Load eval dataset. Pull from object storage with a version pin.
  2. Fan-out inference. A withItems over the eval examples. Each item hits the model serving endpoint and records the output.
  3. Fan-in scoring. A reducer step that runs the scoring functions over the outputs (accuracy, F1, BLEU, custom rubric scores, LLM-as-judge for the harder rubrics).
  4. Compare to baseline. A step that pulls the previous eval scorecard and computes regressions.
  5. Push to Prometheus. The scorecard gets pushed as metrics so the observability layer catches drift.
  6. Annotate. A step that updates the model’s metadata in the registry with the new scorecard.

The fan-out can be large. A serious eval suite is thousands of examples; the parallelism cap protects the serving foundation from being overwhelmed by your own eval traffic. Argo’s parallelism setting on the withItems step is the right primitive.

WorkflowTemplates as the DaC surface

This is where the methodology shows up, and it’s one of the cleanest expressions I’ve seen yet. Decisions as Code (DaC) is the methodology behind nearly every self-service and automation system I’ve designed: extract business decisions out of platform configuration into a small, curated layer (often five real decisions where the raw config exposed eighty-nine) and let the platform absorb the rest through templates and defaults. (I called this Property Toolkit during my OneFuse days; the foundation is different, the shape isn’t.)

A WorkflowTemplate in Argo is a parameterized, reusable workflow definition. The same template gets invoked by multiple Workflows, each with different parameters. The shape is DaC projected onto the workflow template surface.

In practice:

  • A rag-indexing-template WorkflowTemplate that takes parameters for the source manifest URI, the embedding model name, the target vector DB, the chunk size, the parallelism cap.
  • Per-project Workflows that reference the template via templateRef, supply project-specific parameters, and inherit the standard pipeline shape.
  • The template lives in a centralized standards namespace; the project Workflows live in the project namespaces.

Change the standard shape (add a step, tighten a default, fix a bug) and every project Workflow inherits the change on the next run. No PR storm across project repos. The pipeline definition itself is the curated decision surface; the project Workflow is what gets to vary.

This is the same five-primitive shape the Helm values article walked through. Centralization (the WorkflowTemplate). Platform-aware shape (Argo CRDs). Variable interpolation (Workflow parameters). Composition (template references). Discovery convention (the standards namespace).

The shape has held across substrates because the underlying problem (duplicated business logic across consumers) hasn’t changed.

Operational discipline

A few things that aren’t novel but are worth being explicit about, because skipping them is the most common failure mode.

Artifact retention. Workflow artifacts on the artifact store accumulate fast. A retention policy with the artifact store’s lifecycle rules (S3 lifecycle, MinIO retention) keeps the bill bounded.

Pod GC. Argo’s pod garbage collection has gotten cleaner over recent releases. The default is fine for most use; fast-failing test workflows can leave a pod backlog without a tighter setting.

Failure handling. A long-running workflow needs an onExit template that runs whether the workflow succeeded or failed. That’s where the cleanup, the alerting, the registry annotation lives. Skipping onExit is the most common reason a workflow’s failure mode is silent.

Resource limits per step. Each Pod template needs requests and limits. The fan-out steps especially, without limits, fifty parallel embedding pods can saturate the cluster. The standards library chart pattern from the Helm article is the right place for these too.

Auth. The serving endpoint, the vector DB, the model registry, the artifact store all have auth. Argo’s per-Workflow ServiceAccount pattern with IRSA-style federation is the path that scales; per-step credentials hardcoded in the template is the path that doesn’t.

What I keep coming back to

Argo Workflows is the foundation for the AI ops work that doesn’t fit a CI runner. RAG indexing, fine-tuning, eval execution, three different DAG shapes that all benefit from K8s-native fan-out, artifact passing, and Pod-per-step resource scoping.

The methodology piece is treating the WorkflowTemplate as a parameterized business standard. Define the standard pipeline shape once. Project it into every consuming project. Vary the parameters per project; keep the shape standard. The discipline is the same discipline that’s worked at every other layer of the AI platform; the foundation happens to be Argo this time.

Pick the foundation that matches your platform. Centralize the workflow definition. Parameterize for projects. Pair with the observability layer so the workflow’s outputs are visible in the same dashboard as the workload. The Argo piece is the orchestration. The methodology is the part that scales.