Migrating a YAML monster to a DaC shape, step by step

You've inherited a 1,500-line values.yaml. The fix isn't refactoring it in place, it's the six-step migration to a DaC shape: catalog, cluster, identify, push down, version, ship. Here's the walkthrough, concrete enough to mirror.

Migrating a YAML monster to a DaC shape, step by step

You've inherited it. A 1,500-line values.yaml that nobody wants to touch. A 200-property vRA catalog item that grew one field per ticket for four years. A 73-question intake form that the request-management team refuses to retire because "somebody must need that field." Pick whichever monster you've been handed; the shape of the work is the same.

The instinct, when the monster lands on your desk, is to refactor it in place. Reorder the YAML. Add a table of contents to the catalog item. Group the 73 questions into accordions. Don't. None of those is the fix. The monster isn't poorly organized, it's poorly scoped. The fix is to migrate it to a Decisions as Code shape, which means deciding which of those 1,500 lines (or 200 properties, or 73 questions) are the user's decisions, and pushing the rest into templates, defaults, and computed values the platform owns.

Here's how I think about it. This piece is the step-by-step, the playbook I run when I inherit one of these. Six steps. None are optional. The order matters.

Step 1. Catalog what's actually there

The monster is opaque because nobody has counted it. You can't migrate what you can't see. The first move is mechanical: dump every property, every field, every question into a single flat list. One row per surface element. No editorializing yet.

Then add a column nobody else has bothered to add: usage frequency. For a values.yaml, this is "what fraction of consuming charts override this from the default." For a catalog item, it's "what fraction of submissions in the last twelve months touched this field at all." For an intake form, it's "what fraction of submitters answered something other than the placeholder."

Pull the data, don't guess. Grep the consumers, query the request DB, and export the form analytics. The numbers will surprise you. In every monster I've audited, the distribution is the same: a small head of properties that get touched on every request, a long tail of properties that nobody has touched in eighteen months, and a middle band of "sometimes." A typical 1,500-line values.yaml has 80 lines in the head, 200 in the middle, and 1,220 in the tail. The tail is your easy win.

Don't delete the tail yet. Just label it. The catalog is the artifact you'll work from for every later step.

Step 2. Cluster by who-actually-decides-this

Now the harder pass. For each property, write down (in plain language) who makes that decision. Not who can set the field. Who decides. Those are different.

The decision-maker for replicaCount isn't the application team; it's the platform team's reliability standard ("production workloads run with at least 3"). The decision-maker for image.tag is the application team. The decision-maker for serviceAccount.annotations.iam.gke.io/gcp-service-account is the security team. The decision-maker for nodeSelector.gpu-type is the ML platform team's GPU sizing policy.

Cluster the catalog by those decision-makers. You'll usually end up with four or five groups: application team, platform team, security/compliance, finance/cost-attribution, and "nobody, this is dead config nobody owns."

The "nobody owns it" group is the easiest cleanup of your career. Delete it. If somebody complains, the field comes back. Nobody will complain, because nobody knew it was there.

The application-team group is your candidate decision surface. The other groups are your candidate platform-owned defaults. The clustering is what tells you which is which, not your gut, not the previous owner's documentation, the clustering.

Step 3. Identify the five real decisions

Now apply the five-not-eighty-nine test to the application-team group. The instinct after step 2 is to declare victory: "look. I have an application-team group of 30 fields, that's already a 50× reduction from 1,500." Thirty is still too many. The user came to the platform with a sentence; the sentence has five real decisions in it.

For each field in the application-team group, ask: would two different application teams, in good faith, pick different values for this? If the answer is "no, every team picks the same value or picks from a tiny enum," that field is a platform decision dressed up as an application decision. It moves to the platform-owned group.

After that pass, you'll usually have five to ten real application-team decisions left. Workload class. Environment. Sizing tier. Region. Ownership. Maybe data classification or cost-center attribution. That's the surface. That's what the new monster-replacement is going to expose.

The other twenty-five fields don't disappear. They just stop being on the surface. They get computed from the five.

Step 4. Push the rest into templates, defaults, and computed values

This is where most migrations stall, because it's where the actual platform work lives. Each of the deferred fields needs a home. There are exactly three places it can go.

Templates. If the field's value is determined by a combination of other surface decisions. It's a template output. nodeSelector is computed from workload class and GPU requirement. tolerations is computed from environment and workload class. podDisruptionBudget is computed from sizing tier and environment. The template engine. Helm helpers, Terraform locals, Crossplane Compositions, Argo WorkflowTemplates, whatever your foundation offers, is where this logic lives. You write it once. Every consumer renders the same correct output.

Defaults. If the field's value is the same for everyone, all the time, but might rarely need to be overridden. It's a default. securityContext.runAsNonRoot: true is a default. resources.requests.memory: 256Mi for a small tier is a default. The default lives in the standards layer (a library chart, a module.standards Terraform module, a base XR, the standard-decisions home that the Helm-values-as-business-standards piece describes). Consumers inherit it without naming it. Override is rare and explicit when it happens.

Computed values. If the field is derived from organizational state outside the user's request, what region the cost center prefers, what cluster the workload class maps to, what registry the namespace pulls from. It's a computed value. The lookup runs at render time, against the standards layer or an external source of truth. The user never sees it because the user couldn't have known the answer anyway.

The classification work in step 3 made this pass tractable. You're not staring at 1,500 lines and asking "where does this go." You're walking the deferred list and slotting each entry into one of three buckets.

Step 5. Version the surface

Before you ship the new shape, version it. The five-decision surface you've designed is now a contract. Contracts that change without versioning break consumers in ways that are expensive to debug.

Pick a versioning scheme that fits the foundation. For Helm, it's apiVersion in your values schema. For a Crossplane XR, it's the Kubernetes API version (v1alpha1, v1beta1, v1). For a Backstage software template. It's a version field in the template metadata. For a vRA catalog item. It's a discipline you have to invent because vRA doesn't give it to you natively, bake the version into the item name.

Then write down the compatibility rules. Mine, on every project I've migrated: additive changes (new optional decision, new enum value) bump the minor version. Breaking changes (removed decision, renamed field, changed semantics) bump the major version. Defaults can change inside a minor (that's the whole point of having defaults) as long as the consumer can pin the standards version if they need stability.

Without this step, the next platform engineer to inherit your work has the same problem you did. With this step, they have a contract they can reason about.

Step 6. Ship without breaking existing consumers

The migration isn't done when the new surface exists. It's done when the old monster is gone. The bridge is a compatibility layer.

The pattern: keep the old monster surface live, but generate it from the new five-decision surface. A consumer that was happy submitting 1,500-line YAMLs keeps submitting 1,500-line YAMLs; a translator on the platform side reads the old shape, extracts the five decisions, and renders the new shape underneath. Existing consumers don't see the change. New consumers use the new surface directly.

Concretely, in Helm: an old values.yaml keeps its old keys, with a deprecation comment, and the chart's templates read from .Values.standards (the new shape) by way of a helper that falls back to the old keys if the new ones aren't set. In a vRA catalog item: keep the old item with all 200 properties, add a new item with five, and have the old item silently route to the same downstream pipeline as the new one. In an intake form: keep the 73-question form, add a 5-question form, and process both through the same back-end.

You leave the compatibility layer up for at least two release cycles. You instrument it: every time the old shape is used, log it with the consumer ID. The deprecation timeline is evidence-driven, not calendar-driven. When the logs go quiet for a consumer, you ping them, confirm the migration, and remove their access to the old surface. When all consumers are migrated, the old surface comes down.

Here's the part that bit me when I skipped it: some team had a CI pipeline that had been hard-coding replicaCount: 1 against the old shape for two years. The compatibility layer is the step that distinguishes a migration that ships from a migration that gets reverted in week three. Skip it and you'll discover, on a Friday afternoon, that you just broke their deploy.

What you ship at the end

Six steps in, the monster is gone. What's there instead:

  • A five-to-ten field decision surface, versioned, with explicit owners.
  • A standards layer holding the deferred fields as templates, defaults, and computed values.
  • A compatibility translator from the old shape to the new.
  • A deprecation log telling you which consumers still need migration help.
  • A schema or contract that the next person to inherit this can read in an afternoon.

The 1,500 lines didn't go away. About 200 of them got deleted (the dead-config tail from step 1). About 1,200 got pushed into the standards layer and the templates, where the platform owns them. About 50 to 100 stayed visible, the real decisions, on the surface, where the user can reach them.

The cost of the migration is real. Two to four weeks of platform-engineer time, depending on how much consumer hand-holding the compatibility layer requires. The payoff is durable: every future change to the monster is a one-line change to a default in the standards layer, instead of a sweep across thirty consumer charts. (The shape, if you've spent any time around the OneFuse-era Property Toolkit work, is familiar, same discipline, different decade, much better tooling.)

The monster doesn't have to be the monster forever. You just have to be willing to count it, cluster it, cut it, and ship the bridge.

, Sid