Article 3: Double check, never delete

Destructive operations get a second look. Soft-delete by default. Backup before mutation. The bias is toward keeping the thing, and that bias gets stronger, not weaker, when an agent is the one running the command.

Article 3: Double check, never delete

The third article in my coding constitution is the one I wrote after a deeply unpleasant Tuesday in 2017. I was running a one-line cleanup command on what I was sure was the staging database. It wasn't. The command did what I asked it to do. I spent the rest of the week explaining what I'd asked it to do.

The article reads:

Double check, never delete.

Destructive operations need a second look. Soft-delete by default. Back up before you mutate. The default posture toward an existing thing is keep it.

The rule is about default posture. Most engineering rules are about how to do the thing right. This one is about which way to lean when you don't have time to be sure.

What "destructive" actually means

The word does a lot of work in this rule, so it's worth pinning down. A destructive operation is anything that the system, after the fact, cannot undo by reading its own state. A rm -rf. A DROP TABLE. A DELETE FROM without a WHERE. A terraform destroy. A kubectl delete namespace. A git push --force. A gcloud projects delete. A helm uninstall against a release with persistent volume reclaim policy set to delete.

The category is wider than the word makes it sound. A git rebase -i is destructive, the reflog covers you for thirty days, but operationally you've rewritten history that other people may have already pulled. A mv across filesystems is destructive in the sense that an interrupted move can leave you with neither source nor destination. A schema migration is destructive when it drops a column even if no rows are deleted, because you've removed the ability to read data that was previously readable.

The test I use: if I made this change and immediately changed my mind, can I undo it from the system's own state? If yes, it's reversible. If no (if the recovery needs backups, snapshots, reflogs, or a database restore) it's destructive, and the rule fires.

The two parts of the rule

The rule has two parts that are easy to collapse into one and shouldn't be.

Part one: double check before the destructive operation. A second look. Not the same look twice. A different look, read the resource you're about to destroy by name, confirm the count of things in scope, confirm the environment, confirm the cluster context. The second check is in a different shape from the first because the failure mode you're guarding against is the one where you were confident but wrong.

The shape I use, when I remember to use it well: before any destructive command, the previous command in my shell history is a get or a describe or a count against the same target. The destructive command is allowed to run only if the read command's output matches what I expected. If the read command shows me anything surprising (a different namespace, an unexpected count, a resource I didn't know existed) the destructive command does not run. I go back to planning.

Part two: never delete; soft-delete. Where the system supports it, the deletion mechanism marks the thing as deleted but leaves it recoverable. A deleted_at timestamp, a tombstone, a snapshot, an archive bucket. Hard-delete only happens later, on a scheduled job, after a delay long enough that "I made a terrible mistake yesterday" is recoverable.

The reason the two parts are distinct: the double-check protects the case where I'm wrong about what I'm about to delete. The soft-delete protects the case where I'm wrong about whether I should be deleting it at all. Both are common. Neither covers the other.

The bias toward keeping the thing

The rule has a posture buried in it. The default toward an existing thing is keep it. New things deserve scrutiny on the way in. Existing things deserve scrutiny on the way out, and the scrutiny is heavier.

This sounds like a tautology. It isn't. The opposite posture exists in plenty of codebases, "we'll clean it up later" cultures where the easy thing is to remove the noisy resource and the hard thing is to keep it. I've worked in those codebases. They are the ones where you discover six months later that the noisy resource was load-bearing and nobody documented why.

The bias toward keeping things has a cost. Old data piles up, dead code lingers, and resources you forgot about cost money. The cost is real and bounded, you can run cleanup sprints, you can write retention policies, you can put archive lifecycles on storage. The cost of the opposite bias is unbounded. You delete the wrong thing and you may not be able to get it back.

I'd rather pay the bounded cost. The rule is what makes sure I do.

Backups before mutation

A schema migration. A bulk update. A reformat of a config file the agent is about to walk through. The rule applies to mutations the same way it applies to deletions, because a mutation that's wrong and a deletion that's wrong have the same recovery story: you go to the backup.

The discipline is that there is a backup before the mutation runs. Not "we have nightly backups, we'll be fine", a named backup, taken just before the mutation, recoverable in the same window the mutation runs in. A pg_dump tagged with the migration ID. An etcd snapshot tagged with the cluster operation. A cp -a of the file before the agent rewrites it. The backup is the rollback. If you don't have one, you can't roll back, and the rule has been violated whether or not the mutation succeeds.

For mutations against state I care about, production databases, the homelab's store-01 data path, the Helix cluster's persistent volumes. I treat the backup-before-mutation step as a non-skippable prereq. The mutation script literally doesn't run if the backup didn't succeed. This is the kind of rule that's easy to write down and inconvenient to live by, which is why it's worth automating.

Want to go deeper on the agent side of this? The plan-then-confirm shape I use is the same one I wrote about in bounded autonomy, the agent gets rope, and the rope is shorter for destructive operations than for additive ones.

The agent dimension

This rule is the one I've had to harden the most for AI agents. Agents are competent and confident at running destructive commands. They'll happily execute a rm -rf against the path they think the user meant. They'll happily run a migration that drops the column they think is unused. They'll happily kubectl delete the namespace whose name matched the pattern.

The agent failure mode here isn't that the agent is malicious or careless. It's that the agent operates on the model of the world it has, and that model is reliably less complete than the engineer supervising it. The thing the agent thinks is unused is the thing somebody is depending on in a service the agent doesn't know exists. The path the agent thinks is scratch space is the path somebody is using as a backup target. The namespace the agent thinks is dev is the namespace whose name happened to match a regex that ran against the wrong cluster context.

The rule, applied to agents, has three parts:

The agent gets read-only by default for destructive operations. The agent can plan a deletion. It cannot execute one. The execution needs a human confirm step. The plan-then-confirm shape is the same shape as the double-check rule, just enforced by the tooling rather than by my willpower.

The agent runs the read before the destroy. When the agent does eventually get to execute a destructive operation, the previous step in its tool trace is a read against the same target. The read's output is in the conversation. The execution is allowed only if the read's output matches what was planned.

The agent can't disable soft-delete. Whatever soft-delete mechanism is in place stays in place. The agent doesn't get to decide that the cleanup is "obviously fine" and skip the tombstone. The agent gets the same defaults the engineer would get.

The pattern is bounded autonomy applied at the destructive-operation boundary. The agent has rope. The rope is shorter for destructive operations than for additive ones, on purpose. Bounded autonomy is its own topic and gets its own articles, but this rule is the place the bound matters most.

What this rule is not

The rule is not "never delete anything." Things do need to be deleted. Old data ages out, and dead resources cost money. The rule is about how, not whether: two checks, soft-delete first, backup before mutate. The deletion still happens; it just happens through a path that's recoverable from for as long as it can be.

The rule is also not a substitute for backups in general. Backups are the floor. The rule is the layer on top of the floor that stops you from needing to use the floor in routine cases. If your only protection against a wrong delete is the backup, you'll discover the limits of your backup at the worst possible time. Don't.

Why this is article three

The articles in this series are ordered by which rule earns the most when ignored. Article one (no issue, no code) earns the most over the long horizon, because the missing intent compounds across years. Article four (fail fast, three strikes) earns the most over a single afternoon, because the flailing compounds within an hour. Article three earns the most in a single moment, because the wrong destructive operation can compound in a single keystroke.

Three is the rule that's saved me, and cost me when I ignored it, more sharply than any other in the set. Two checks. Soft-delete first. Backup before mutate. The bias is toward keeping the thing. That's the rule.

, Sid