Could there be a marketplace for AI training data?

The model providers scraped the open internet and called it fair game. The next phase needs an actual market for the data, and the structural pieces of that market don't exist yet.

Sid Smith

24 Apr 2023 • 5 min read

The model providers scraped the open internet for training data, paid no one, and called it fair use. That posture is being challenged in the courts (the Getty suit against Stability AI was filed in January, the artists' class action against Midjourney and others is moving through discovery, the New York Times has been making increasingly unhappy noises) but it's the operating posture for now. The next phase, if any of those suits succeed or even if they don't, has to involve some kind of actual market for the data. The structural pieces of that market don't exist yet, and it's worth walking through what they'd have to look like.

This sits a step further out than the knowledge-as-a-service thought experiment from a couple weeks back, which assumed the artifact existed and asked who would buy it. Same problem, the other side of the table, the supply side, where the people whose work goes into the models would need to participate in the value.

Why the current arrangement is unstable

Three things are converging that make "scrape and ignore the lawyers" a less viable long-term strategy.

The first is the legal pressure, which is real but slow. Courts will take years to settle the questions, and the answers will probably be jurisdiction-specific and case-specific. Don't expect a single ruling to clarify everything.

The second is the supply pressure. The high-quality, well-curated, English-language text on the open internet is finite. Models trained on bigger data tend to perform better, and the marginal new token of good training data is getting harder to find. Crawling Reddit comments and Common Crawl one more time isn't going to produce a 2x capability gain.

The third is the structural strangeness. The current arrangement says: if you write something on the open internet, you implicitly contribute to whatever models the major providers train this year, and you receive nothing, not money, not attribution, not the option to opt out. For a lot of contributors that's fine. For the small subset whose work is disproportionately valuable for training (technical writers, original researchers, working professionals who write detailed first-person accounts of complex problems), it's increasingly unfine.

A market is one way out of this. It's not the only way (opt-out registries, statutory licenses, and outright bans are all on the table) but a market is the answer that's most consistent with the way other knowledge industries work. The question is what it would take.

The pieces a real market would need

Walking through this from first principles. Here's what would have to be true.

Provenance for every training token

You can't have a market for an asset you can't track. Today, when a frontier model is trained, the providers can describe the training corpus at a high level ("Common Crawl, books, Wikipedia, code") but cannot say "this specific token came from this specific document." There's no incentive for them to do that (it would expose them to specific lawsuits) and the technical infrastructure to do it at scale isn't standardized.

A market needs the equivalent of a chain-of-custody system. Every document that goes into training has a known source. The training pipeline records which documents contributed to which model checkpoints. Inference can (at least in principle) be traced back to the documents that influenced a given output.

Some research is starting on this. Influence functions, training data attribution, the work coming out of various model interpretability groups. None of it is at the maturity required to support a market. It's the precondition that has to land first, and it's the longest-pole item.

A licensing framework with the right granularity

Today's options for licensing a body of work are: traditional copyright (which doesn't cleanly cover "use as training data"), Creative Commons (which covers redistribution but says nothing specific about training), and bespoke contracts (which scale poorly).

A market would need something like a training license, a formally-defined right separate from reading, distributing, or quoting. With granularity on the use: "yes for non-commercial research models, no for commercial frontier models," "yes with revenue share, no without," "yes if attribution is preserved in metadata, no otherwise."

This kind of framework has analogs. Music licensing has a public-performance license distinct from a synchronization license distinct from a mechanical license, and a complex collecting-society infrastructure to administer them. Text-and-data licensing for AI doesn't have any of that yet. It would need to be invented, agreed-on, and adopted across enough of the industry to matter.

Attribution and revenue mechanics

If training tokens are tracked and licensed, then payments can flow. The question is the unit of accounting. Per-token contribution? Per-document, weighted by influence? Per-author, with some kind of pool distribution?

The music industry analog is instructive but limited. ASCAP and BMI distribute licensing revenue among songwriters using formulas that everyone complains about. The accounting is messy. The math doesn't quite work. The system functions anyway because the alternative (direct contracting between every venue and every songwriter) would be worse.

A reasonable starting point for AI training revenue might be: providers report combine revenue attributable to a model, a portion is allocated to a contributor pool, and the pool is distributed by some formula tied to training-time influence. The formula is going to be wrong in detail and approximately right on average. Like ASCAP, it works because the alternative is no payments at all.

A clearinghouse

None of this scales without intermediaries. Individual writers cannot negotiate with individual model providers on their own behalf. Individual model providers cannot negotiate with millions of contributors. Some entity has to combine, standardize, and execute.

The natural candidates are: existing rights organizations (ASCAP-for-text), platform companies (Substack, Medium, GitHub negotiating on behalf of their writers), or new entities chartered specifically for this purpose. Each has problems. Existing rights organizations don't understand the technical layer. Platforms have conflicts of interest with their writers. New entities have to bootstrap from nothing.

Why none of this is in motion yet

The honest reason none of this exists is that the current arrangement is profitable for the model providers and there's no organized counterparty pressing for change. Individual writers complaining on Twitter is not organized counterparty pressure. The active lawsuits will create some pressure, but lawsuits are an inefficient market-formation mechanism, they produce settlements, not standards. (And, as the style-vs-knowledge piece covered, the legal questions don't even cleanly map to copyright as currently written, there's a missing layer for "training contribution" that isn't really any of the existing categories.)

The thing that would actually move this is a writers' or researchers' organization with enough collective weight to negotiate, paired with a technical framework for provenance reliable enough to enforce a license. Neither exists. Both could.

Where I think this goes

A few possibilities, none of which I'm confident in.

The market doesn't form, and the lawsuits succeed in fragmenting things by jurisdiction. Different rules in EU, US, and elsewhere. Model providers route training around hostile jurisdictions. Writers in friendly jurisdictions get paid; everyone else doesn't. Legal patchwork, no real market.

The market doesn't form, and a statutory license emerges. Governments (probably the EU first) set a fixed rate for training data use, collect from providers, distribute through some agency. Works for headline cases, leaves a lot of value on the table, slow to update.

A market does form, slowly, starting at the high end. Premium publishers (NYT, Wiley, Elsevier, the major news wires) negotiate licenses with the major model providers. Long-tail writers don't get included for years. The infrastructure for the long tail eventually trickles down, maybe by 2026 or 2027.

Something I can't see yet. The capability and the legal pressure are both new enough that the market structure that emerges might not look like any of the above. Maybe a crypto-attribution scheme works. Maybe the providers themselves start paying for training data as a competitive differentiator. Maybe the model architecture changes in a way that makes the question moot.

The thought experiment from a few weeks back assumed an infrastructure for personal AI artifacts that doesn't exist yet. This piece is the same problem from the other side: the supply side, the data side, the side where the people whose work goes into the models would need to participate in the value. Neither side of that infrastructure is built. They probably need to be built together. Worth seeing whether anyone tries.