ChatGPT three months in: what changes when knowledge becomes a chat

Three months after launch, the most interesting thing about ChatGPT isn't the model, it's that millions of people changed how they ask questions overnight.

Sid Smith

18 Feb 2023 • 5 min read

ChatGPT crossed a hundred million monthly users in January, two months after launch. That's a faster ramp than anything in consumer software history, including TikTok, including Instagram. The model behind it isn't new in any meaningful sense; GPT-3 has been around since 2020, and GPT-3.5 is an incremental tune on top of it. What's new is the interface, and the interface turned out to matter more than anyone expected.

Three months in, the model itself isn't the thing worth thinking about. The thing worth thinking about is that millions of people changed how they ask questions, almost overnight, and a lot of secondary effects are just starting to land.

The interface change is bigger than the model change

Before late November, getting an answer to a non-trivial question meant search. You typed keywords into Google, you got a list of links, you opened three or four, you skimmed, you synthesized. The synthesis step was on you. Search engines have been quietly removing pieces of that work for years (featured snippets, knowledge panels, rich answers) but the basic loop of triage-and-synthesize stayed the same.

ChatGPT removes the triage step. You ask the question in normal language, you get an answer in normal language. Whether the answer is correct is a separate question and one that's coming up a lot. But the expectation shifted. People who use it for a week stop reaching for a search tab for a whole class of question.

That category of question is roughly: "I sort of know what I'm doing here but I want a starting point." Recipes from what's in my fridge. A regex that does a thing I can describe but not write. A first draft of an email I don't want to write from scratch. The category isn't "questions where the answer needs to be authoritative." It's "questions where the answer needs to be a starting point."

For that category, the interface change is enormous. For the authoritative-answer category, the interface change is dangerous.

What the model gets wrong, and why it matters that it sounds confident

The hallucination problem is well-documented at this point. The model will state things that are false in the same tone it states things that are true. Cases that have made the rounds in the last three months: invented court rulings cited in a legal brief, fabricated academic papers cited in a literature review, plausible-sounding API endpoints that don't exist.

The interesting thing isn't that the model is sometimes wrong. Search results are sometimes wrong. The interesting thing is what the interface does to the wrong answer. Search shows you a list and asks you to triage. The model shows you an answer and asks you to trust. The trust calibration most people have for "an answer that sounds confident" was developed in a world where confident answers came from accountable sources. ChatGPT is not an accountable source.

I'm not sure where this goes. Some combination of better calibration in the model itself (a working "I don't know" response is harder than it sounds), better citation surfaces (Bing's preview is starting to show this, with footnotes for claims), and better user intuition over time. The third one happens slowly. The first two are research problems.

The Microsoft and Google response is a tell about where this goes

Microsoft launched a preview of Bing Chat on February 7th, eleven days ago. Reports of weird behavior have been circulating ever since (the model adopting an alter ego, declaring affection for journalists, refusing to admit the year). The interesting part isn't the weird behavior. The interesting part is that Microsoft shipped a half-baked product into a flagship surface anyway.

Google, meanwhile, declared a "code red" internally in December, brought back Brin and Page for emergency consultations, and demoed Bard last week. The demo had a factual error in it. Stock dropped seven percent. They shipped anyway.

Neither company is shipping AI assistants because the product is ready. They're shipping because the interface category is now contested and being absent from it for another year felt worse than being half-formed in it now. That's the strongest signal so far that the conversational interface is being treated as a platform-level shift, not a product feature.

What's still missing

Three things you'd want, none of which exist right now in any production form:

An "I don't know" that works. Right now the failure mode is confident wrong answers. The fix isn't just better training data. It's a model that has some kind of internal calibration about its own confidence. People have been working on this for years. The fact that the most-used AI product in the world doesn't have a usable version of it tells you something about how hard the problem actually is.

Grounding on your own information. ChatGPT knows the public internet up to its training cutoff. It does not know your inbox, your code, your notes, your team's documentation. The product everyone wants is "the chat interface but on the stuff I actually have." That product requires retrieval, vector storage, and embeddings infrastructure, all of which exist as components but not as a mainstream user experience yet.

Models you can run somewhere other than OpenAI's servers. Right now every conversation you have with ChatGPT goes to one company's data center and is (at minimum) used to improve the product. Whether that's a problem depends entirely on what's in the conversation. For consumer use, mostly fine. For enterprise use, the friction is enormous. The market for "the ChatGPT interface, but on a model my company runs" is going to be very large and is currently not served by anyone.

What's worth watching

The number that matters next is how ChatGPT use evolves once the novelty wears off. The hundred-million-user number is partly real, partly tourist traffic. The question is what fraction of people who tried it still use it weekly six months from now, and what they're using it for. If the surviving use cases are mostly "draft the first version of something" (emails, code, recipes, plans) that tells you the product category is roughly "writing co-pilot." If the surviving use cases include a lot of "answer my question authoritatively," that tells you the calibration problem either got fixed or got ignored.

The other thing worth watching is what shows up on the open-weights side. There's been chatter about Meta releasing a research-only set of model weights, a reasonable family of sizes, with permissive enough licensing for academic work. If something like that lands and the open community can fine-tune on it, the "models you can run somewhere other than OpenAI" gap might close faster than anyone expects.

Three months in, none of this is obvious yet. But it's the question I keep coming back to: the interface shift is real and probably permanent. Everything else is still in motion.

The interface change is bigger than the model change

What the model gets wrong, and why it matters that it sounds confident

The Microsoft and Google response is a tell about where this goes

What's still missing

What's worth watching

Subscribe to Echoes of the machine