Your style is your IP: what AI ownership should actually look like

The legal frameworks around AI training data are still arguing about the wrong layer. The interesting ownership question isn't whether your text was scraped. It's whether the captured shape of how you reason is yours.

Fountain pen on a vintage ledger with glowing ethereal handwriting reforming on the page

The current legal conversation around AI ownership is mostly about training-data scraping. The New York Times case, the various class actions about books and code corpora being used without permission, the European AI Act's transparency requirements, all of those are arguing about whether the raw text somebody published can be ingested into a training corpus and what the rights and remedies are if it was.

That's an important argument. It's not the most interesting one, and the way the field is set up to handle the more interesting one is currently nonexistent. The more interesting layer is what happens after the training data is in, the layer at which a captured shape of how someone reasons becomes a deployable artifact. That layer is where the actual value of personal-AI work will accrue, and it's the one the legal system has barely started thinking about.

The difference that matters

A useful distinction, which has held up well over the last two years: there's the facts a model learns, and there's the priors it learns. The facts are what gets attention in the lawsuits, did the model memorize this passage, can it regurgitate this article, was this code base in the training corpus. Those are real questions, but they're answerable with a similarity check and they're correctable by retraining.

The priors are the harder thing. They're the way a model has learned to weigh evidence, to structure an argument, to choose words, to tell when it doesn't know. Those don't show up in a similarity check because they're not stored as text, they're stored as patterns in the weights, distributed across many parameters in ways that aren't easily attributed to any single source. A few years ago I tried to make this distinction concrete when most people were still arguing about whether AI even understood what it read. The distinction has become operationally relevant in a way it wasn't then.

The reason it's now operational: adapters. A LoRA adapter trained on one person's archive doesn't change the base model's facts substantially, it changes the priors. You can take a generic model, train an adapter on a specific writer's corpus, and end up with a system that reasons in that writer's shape without necessarily knowing more facts. That's a very different kind of artifact than a fine-tune that crammed someone's books into the weights, and the legal system has no concept that maps cleanly onto it.

What the law currently doesn't recognize

A few specific gaps worth being explicit about:

There's no concept of "adaptation rights" distinct from "training rights." The current framing treats use of training data as a binary question: was this text used to train, yes or no. But the legally interesting question for an adapter is "did this adapter capture how I reason in a way that produces output substantially in my voice." That's not a question about whether my text was in the corpus. It's a question about what the resulting artifact does.

There's no attribution mechanism for adapters that builds on a base model. When a derived adapter is published, there's no standard way to encode that the adapter was trained on a specific corpus owned by a specific person, and no enforceable mechanism for revenue or attribution to flow back. The marketplace I sketched a couple of years ago hasn't materialized partly because the underlying rights framework doesn't exist.

There's no expressive opt-out at the prior layer. A writer can tell a publisher to remove a book from a training corpus, at least in theory, with an arbitration over the meaning of "removal." There's no equivalent for "I don't want my reasoning patterns to be the basis of an adapter that competes with me." The law doesn't recognize that as a thing one could even claim.

The platforms have no economic incentive to fix any of this. The closed-frontier shops want their model to be the place value accrues. The open-source ecosystem wants permissionless adaptation to remain permissionless because that's what makes the tooling work. Neither party has a reason to push for the legal scaffolding that would let individual contributors capture value from their captured reasoning patterns.

Why this is actually the right frame

The "your text was used" framing keeps the legal conversation focused on inputs, where the harms are cumulative and abstract. The "your reasoning shape was captured" framing focuses it on outputs, where the harms are specific and demonstrable. A given person can sit with an adapter and decide whether the output reads as their voice, in their reasoning style, with their characteristic moves. That's an empirical question with an empirical answer, and the law works much better with empirical questions than with abstract ones.

The shift to the output frame also makes the economic case more legible. If an adapter captures how a writer reasons well enough that its output is substitutable for that writer's actual work, the writer has a real claim that something of value was extracted. That's a different argument than "my book was in the training corpus", it doesn't require proving training-data inclusion, it requires demonstrating output substitutability. The latter is much easier to demonstrate and much harder to dismiss.

What a sensible regime would look like

Putting aside whether any of this is politically feasible in the current environment, the structure of a working regime probably has three pieces:

A registry of adaptation lineage. When an adapter is published, it carries machine-readable metadata about what corpus it was trained on, who the corpus was sourced from, and what consent was given. That lineage is verifiable through cryptographic commitments at training time. The register doesn't need to be government-run; it could be a Hugging Face-equivalent that the field adopts because the alternative is endless litigation. Some of the early provenance work in the open-source ML community is moving in this direction, slowly.

Default attribution and royalty terms. Something analogous to what mechanical licenses do for music, a default rate and structure that applies unless parties negotiate otherwise. Most adapter creators wouldn't bother negotiating; they'd ship under the default terms. Most adapter consumers would prefer the default terms over the friction of bespoke negotiation. The defaults become the equilibrium for most of the market.

A working enforcement mechanism for output substitutability claims. This is the hardest piece, courts have to develop a doctrine for "this adapter's output is substantially in this person's voice and reasoning style." The case law would build out from a few early high-profile cases the same way music-similarity case law built out around early plagiarism suits. Slow, contested, gradually settling into something workable.

None of this is going to happen in the next year. Possibly not in the next five. But the legal scaffolding eventually arriving for adaptation rights specifically (distinct from the training-data rights that absorb most of the current oxygen) is the thing that would let the original Knowledge-as-a-Service framing turn into a real market structure rather than a sketch.

The current direction of the legal conversation is converging on the wrong layer. Worth being explicit about that now, while the framing is still being fought over, because the framing is what determines what the eventual settlement actually accomplishes. Settling the training-data question without settling the adaptation question is going to leave the actual asset (the captured shape of how someone reasons) outside the protection regime entirely. That's a worse outcome than what we have now.