The marketplace problem nobody is solving for AI training data

Three years after I sketched what a real training-data market would need, the structural pieces still aren't in place. The lawsuits stalled, the settlements happened, the tiny markets exist at the edges, and the actual marketplace at scale doesn't. Worth being honest about why.

The marketplace problem nobody is solving for AI training data

Almost three years ago I wrote a piece sketching out what a real marketplace for AI training data would need to look like. Provenance for every training token. A licensing framework with the right granularity. Attribution and revenue mechanics. A clearinghouse to make any of it run at scale. The argument was that the scrape-and-ignore-the-lawyers posture was unstable and the next phase had to involve some kind of actual market for the data.

It's 2026 now. The next phase happened. The market didn't.

Worth being honest about what did happen, what didn't, and why the marketplace I sketched in 2023 is no closer to existing than it was when I wrote about it, even though all the pressure I described as making it inevitable has, if anything, intensified.

What I expected by now

Reading the 2023 piece back, my optimistic read was something like: by 2026 or 2027 the high-end of the market would have formed (NYT, Wiley, Elsevier negotiating directly with the major labs), the long-tail infrastructure would be trickling down behind it, and at minimum the precondition pieces (provenance, licensing granularity, some kind of clearinghouse) would be visibly in motion. I hedged the call. I said "maybe by 2026 or 2027." I didn't think I was hedging it hard enough.

What I expected to see by now:

  • Real training-data licenses negotiated between the top five or six labs and the top one or two hundred premium publishers, with public terms or at least public existence.
  • A first version of a provenance standard (even a flawed one) being adopted across enough of the industry to matter.
  • An ASCAP-for-text or some equivalent collecting-society infrastructure visibly being built, even if not yet operational.
  • Some kind of opt-in registry for individual writers that the major labs were obligated to consult.

What I actually see:

  • A handful of high-profile bilateral deals. OpenAI with the Financial Times, with News Corp, with Axel Springer; Google with Reddit; the various Stack Overflow and Shutterstock arrangements. These are real and they matter. They are not a market.
  • No provenance standard. None. The work that was promising in 2023 (influence functions, training data attribution) has matured as research and has gone nowhere as infrastructure.
  • No clearinghouse. The Authors Guild has done what it can. The various creators-rights coalitions have made the case. None of them have the technical layer or the leverage to compel adoption.
  • A few tiny markets at the edges. Adobe Stock's Firefly licensing program, Getty's training-data licensing, some experimental work from smaller players. Useful as proof that the structure is technically possible. Tiny in volume. Not a market either.

That gap is what I want to walk through. It's the gap between what the conditions seemed to require and what actually got built. The conditions were right, the pressure was real, and the marketplace still didn't form. That's worth pulling on.

What did happen

The thing that filled the vacuum where I expected a marketplace to form was a different thing, settlements.

The Anthropic settlement in late 2025 is the cleanest example. One and a half billion dollars to authors whose books were in the LibGen and Z-Library shadow libraries Anthropic admitted to having trained on. The largest publicly-reported copyright settlement in history. Real money. Real precedent. Real signal that the courts can in fact land hard on the training-data question if the facts are clear enough.

It is also not a marketplace. It is a one-time payment for a past harm. It does not create a forward-looking license. It does not establish a rate. It does not produce infrastructure that the next lab can plug into when they train the next model. It clears one specific liability for one specific company for one specific corpus.

The other settlements, the various OpenAI deals with publishers, the Microsoft arrangements, the smaller per-corpus negotiations, have the same shape. Each one is a private bilateral arrangement that resolves a specific dispute or pre-empts a specific lawsuit. None of them add up. None of them produce reusable terms. None of them build the standardized layer that a market actually needs.

The lawsuits that haven't settled have mostly stalled. The Getty case against Stability is years deep with no clear endgame. The artists' class actions against the image-generator companies are grinding. The New York Times case against OpenAI has produced a lot of motion practice and not much law. The legal pressure is real and it has produced settlements; what it has not produced is a coherent framework that everyone can build on.

So the actual 2026 picture: a handful of giants have paid a handful of giants. Everyone else is in the same posture they were in 2023, except now they have a slightly clearer sense of how much it might cost them if they get sued and lose. That clearer sense is being priced into model-training budgets as a contingency line rather than as a license fee. The economics that should have produced a market have produced an insurance reserve.

Why no marketplace formed

The thing I underestimated in 2023 was how stable the current arrangement actually is for the parties who would have to build the alternative.

For the labs, the current arrangement is: train on whatever you can get, settle with whoever sues you that you can't outlast in court, and price the settlements as a cost of doing business. This is more expensive than free training data, which is what they had before. It is dramatically cheaper than a real market would be, because a real market would price every piece of high-quality training data at something above zero, and the total is a much larger number than the settlement reserve. The labs have no incentive to build the thing that would make their input costs go up by an order of magnitude.

For the publishers, the bilateral deals are working out fine. The FT, News Corp, and Axel Springer are getting paid. They are not getting paid through a market that anyone else can use. They are getting paid through a private negotiation that uses their specific leverage as premium brands with credible legal teams. A real market would dilute that leverage by making everyone's content available on standardized terms, which is good for the long tail and bad for the people currently negotiating from a position of strength.

For the long-tail writers, the technical writers, the working professionals, the researchers, the substack-and-blog people whose collective contribution is enormous and individual leverage is zero, there is no organized counterparty. There has never been an organized counterparty. The Authors Guild and the various creators-rights coalitions are doing real work, and they don't have the membership, the political weight, or the technical infrastructure to force the marketplace into existence the way ASCAP forced the music-licensing infrastructure into existence over decades.

For the governments, the regulatory motion has been about transparency and opt-out rights, not about market structure. The EU AI Act requires disclosure of training data sources at a high level. Various US state-level proposals have moved on opt-out rights. None of them mandate a licensing framework. None of them create the standardized layer that a market needs. The political coalition for "make the labs pay everyone fairly" doesn't exist. The political coalition for "make the labs disclose what they used" almost does.

The honest read is: the marketplace I sketched in 2023 required several parties to coordinate on building infrastructure that none of them individually wanted to build, against the interests of the parties currently extracting the most value from the absence of that infrastructure. That is a coordination problem of the type that historically gets solved by either enormous regulatory force or enormous collective-action force or both. Neither has materialized. The settlements have been just enough pressure-release that neither feels urgent.

The Adobe and Getty story

Worth spending a paragraph on the small markets that do exist, because they are the proof of what is possible and the demonstration of why it's not happening at scale.

Adobe's Firefly model was trained on Adobe Stock content, and Adobe set up a contributor compensation program, contributors get paid based on how much their content contributed to the training. Getty did something similar with its own model. These are real, working, micro-marketplaces. The pieces I sketched in 2023 (provenance, licensing granularity, attribution, revenue mechanics) all exist inside these specific bounded systems. They work because the system is closed. Adobe controls the inputs (Adobe Stock), the model (Firefly), and the outputs (Creative Cloud users). There is no coordination problem because there is one party.

The lesson is twofold. First: it is technically possible. The infrastructure question is solved at the small scale. Second: it does not generalize. The closed-system version doesn't tell you anything about how the open-internet version would work, because the open-internet version requires coordination across thousands of contributors and dozens of labs with no central authority. Adobe and Getty solved a different problem (they vertically integrated around a specific model) and that solution is not a precedent for an actual market.

The other thing worth noting: the Adobe and Getty programs cover stock photography. They do not cover individual creative work, do not cover writing, do not cover the kind of long-tail textual content that is the bulk of frontier-model training data. The micro-marketplaces are forming in the easiest case (bounded, owned, structured) and not in the harder cases where the marketplace would actually matter.

Who would benefit from the marketplace that isn't being built

Working through the populations:

Long-tail writers and creators. The biggest beneficiaries in principle, the smallest beneficiaries in practice. A real market would pay the technical writer whose detailed first-person account of a complex problem ends up disproportionately influential in a model's behavior. Today that writer gets nothing and has no recourse. A real market would also pay the long tail of bloggers, substackers, forum posters, and amateur researchers whose combined contribution is most of what the models actually learned from. None of these people have any leverage right now.

Mid-size publishers. The bilateral-deal world works for the FT and the NYT. It does not work for the trade publication, the academic journal that isn't Elsevier, the niche magazine. A real market would give them a path to monetize their content without having to mount a credible legal threat first. Without a market, they have to either mount the threat (expensive, slow, uncertain) or accept the status quo (their content is in the training set and they are not paid).

The labs themselves, in the long run. This is the counterintuitive one. The labs benefit short-term from the absence of a market. They suffer long-term from the lack of a stable input-cost structure. Every model release is shadowed by potential litigation. Every training run is a potential liability. Every new corpus has to be evaluated for legal risk by people who are not lawyers. A real market would let them buy training data the way they buy compute, with a known price, a clear license, and no overhanging legal risk. The smart labs would prefer this. The smart labs are also, individually, not going to be the ones to build it, because the first lab to build it is the one whose training costs go up first.

Governments and regulators. A real market would solve a problem they currently have to address through clumsy regulation. Disclosure requirements, opt-out rights, transparency mandates, these are workarounds for the absence of a functional licensing layer. A market would replace them with a cleaner mechanism. The regulators would benefit from not having to do the workarounds.

Individual people whose cognition is the asset. This is the population I keep coming back to in the labor essay and in the knowledge-as-an-asset piece and in the early-2026 encoding-a-person follow-up. The line I keep wanting to draw is around individual cognition as IP, the way a person thinks, the processes they follow, the way-of-doing-things that makes them who they are. A real market would have to grapple with this category, which is harder than the publisher category because the unit is a person rather than a corpus, and harder than the creator category because the asset is process rather than work product. None of the existing market structures, micro or macro, contemplate this. The marketplace that doesn't exist would have to.

What would unlock it

A few things, none of them imminent.

A regulatory mandate with teeth. The EU AI Act could be extended to require not just disclosure but standardized licensing terms. The US could pass something equivalent to the Music Modernization Act for training data, setting up a statutory licensing regime with default rates and a collecting body. Either of these would force the infrastructure into existence. Neither is on the table in any serious form right now. Both are conceivable on a five-year horizon.

A provenance standard adopted under duress. If a major labs settles a case in a way that requires it to maintain training-data provenance going forward, and that requirement gets baked into a consent decree, then the standard exists by default. Other labs facing similar exposure would adopt it to match. This is the path that has the most chance of happening organically, it requires no proactive coordination, just one settlement structured the right way.

A creators' coalition with real leverage. The Authors Guild, the various creators-rights groups, and a bunch of long-tail organizations would have to merge or align to a degree they haven't yet. Combined with a credible boycott mechanism, refusing to publish in any venue that doesn't honor an opt-out registry, for example, they could force the labs to negotiate. This requires organizing work that nobody is currently doing at the scale that would matter.

A lab that decides to differentiate on cleanness. One of the major labs decides that "trained only on licensed data" is a competitive advantage worth paying for. Builds the marketplace itself, in self-interest, to make sure of supply. Anthropic was closest to this posture pre-settlement; the settlement has, paradoxically, reduced the urgency by quantifying the alternative cost. Would still be the fastest path if a lab picked it up. None of them are.

A technological shift that makes the question moot. The model architecture changes in a way that removes the need for the kind of bulk training data the current marketplace question is about. RLHF taken further, synthetic data taken further, smaller models with bigger context taken further. Possible. Speculative. Worth tracking but not worth betting on.

The piece that connects to the labor question

The thread that ties this back to the labor work I've been doing this past year is the individual-cognition question.

When I write about AI displacing jobs, the line I keep wanting to draw is around the difference between automating routine systems work (which is fine, which is what I've spent my career doing) and replicating the way an individual person thinks. The first is a reasonable target for AI. The second is something closer to a property right that doesn't have a name yet.

The marketplace conversation is the same conversation from a different angle. The training-data market that doesn't exist is the market that would, in principle, let an individual person say "yes, you can train on my body of work, here are the terms, here is what I expect to be compensated." It would also, in principle, let that person say no. Today there is no mechanism for either. The labs train on what they can get. The individual has neither a price nor a veto.

The pace problem I keep coming back to in the labor essays is downstream of this. Companies are cutting fast because the AI is convenient and the markets reward the cuts. The AI is convenient in part because it was trained on the accumulated work product of the people now being displaced, who were not asked, not paid, and not given a way to participate in the value created from their contribution. A real market would not stop the displacement (I've been clear about that) but it would make the underlying ethics less ugly. The work-product-as-asset and the cognition-as-asset framings both fall out of the same missing infrastructure.

This is the corner where the marketplace question stops being about commerce and starts being about something more like dignity. I don't want to overstate it. The market wouldn't fix the labor problem. It would fix one specific failure mode where the people whose work made the technology possible got nothing for it. That failure mode is not a small thing.

Where I think this goes

A few takes for the next two to three years that fall out of where things are now:

The settlement-economy continues, slowly, expensively. More billion-dollar settlements as more cases work through the courts. The labs price these as a cost of doing business. The lawyers do well. The infrastructure for a real market does not get built because the settlements release the pressure that would have forced it.

The bilateral deals expand to the next tier of publishers. What worked for the FT works for the next twenty premium publishers. The deals become slightly more standardized, slightly more comparable, but stay bilateral. The long tail stays excluded.

The micro-markets stay micro. Adobe, Getty, and a handful of smaller players keep running their bounded systems. They show the technical feasibility and they do not generalize. Over time more closed-system versions appear. None of them are an open market.

The provenance question gets answered by a consent decree, eventually. Some settlement, sometime in 2027 or 2028, requires forward-looking provenance. The standard gets written under duress. Adoption follows. The infrastructure layer for a market gets built in spite of nobody wanting to build it proactively.

The individual-cognition layer remains unaddressed. The marketplace conversation continues to be about content corpora and not about people. The harder question (what does it mean to license the way someone thinks) does not get a serious legal or technical answer in the next few years. Worth pushing on anyway, because the answer matters and the absence of it has costs.

A regulatory framework eventually shows up. The EU first, the US later, with the usual lag and the usual compromises. By 2028 or 2029 there is something resembling a statutory framework. It works imperfectly. It exists. The market that the 2023 piece described is something like operational, in the same way ASCAP is operational, clunky, contested, worth complaining about, better than nothing.

The honest summary

The marketplace I sketched in 2023 has not formed. The conditions I described as making it inevitable are still in place and have, if anything, intensified. The structural reason it hasn't formed is that the parties who would have to build it are individually better off without it, and the parties who would benefit from it have neither the leverage nor the organization to force it into existence. The settlements that have happened have released just enough pressure to make the absence of the marketplace tolerable for the labs.

I was wrong about the timing in 2023. Not wrong in the way that means the original argument doesn't hold (the pieces I sketched are still the right pieces) but wrong in thinking that the obvious need would translate into the obvious infrastructure on a three-to-four-year horizon. It's going to take longer. It's going to take a forcing function I can't yet identify. It might take a regulatory regime that doesn't yet exist, or a creators' coalition that hasn't yet formed, or a consent decree that hasn't yet been written.

The piece I most want people working on this to take seriously is the individual-cognition layer. The marketplace conversation is heading toward something for big publishers and ignoring the harder question of what a person's way of thinking is worth, who owns it, and how it gets valued when an AI system is trained on it. That layer is the one that matters most for the long-term shape of the human-and-AI economy I keep writing about. It is also the one that is most absent from every conversation about training data I've seen this year.

I'd rather be wrong about how slow this is than caught off guard by how slow it actually is. The 2023 piece was hopeful in a way the 2026 follow-up can't be. The marketplace is still the right answer. It is not the answer that's being built. Worth saying so plainly while there's still time for someone to start building it differently.