After ChatGPT plugins: agents are coming, and most of us aren't ready

ChatGPT plugins shipped a month ago. Function calling shipped last week. The substrate for agents that take action on your behalf just landed, and the second-order problems are going to be ugly.

Sid Smith

22 Jun 2023 • 6 min read

OpenAI shipped ChatGPT plugins in May (general availability, after a March beta). Last week they released function calling for gpt-3.5-turbo and gpt-4. AutoGPT and BabyAGI have been making the rounds on Twitter for two months. LangChain agents have a usable abstraction for tool use. The foundation for "language models that take actions on the user's behalf" is the next category change after the chat interface itself, in production, today, in pieces, and the integration patterns are getting clearer by the week.

The capability is real and it's coming faster than the operational practices around it. The interesting work for the next year is going to be in the gap between "the agent can do this" and "you should let the agent do this without supervision." That gap is wide and most of the people working in it haven't realized it yet.

What just landed

Three pieces, related but distinct.

ChatGPT plugins, third-party services exposed as tools the ChatGPT model can decide to call during a conversation. The user authorizes a plugin once; from then on, the model can choose to invoke it when relevant. Wolfram Alpha, Zapier, Expedia, Instacart, OpenTable, the launch list spans research, automation, and consumer transactions.

Function calling for the API, gpt-3.5-turbo-0613 and gpt-4-0613, released last week, accept function definitions in their API call and return structured calls to those functions when the model decides one is appropriate. This is the same mechanism plugins use, exposed at the API level for developers to integrate into their own products.

Agent frameworks. AutoGPT, BabyAGI, the LangChain agent abstractions. Open-source projects that wrap a base LLM in a loop: take a goal, decompose it, choose a tool, execute, observe the result, decide what to do next, repeat. None of them are production-grade. All of them work well enough to demo. Many of them work poorly enough in real use to give the whole category a bad name.

Together, this is the first production-deployable foundation for "the language model is the orchestration layer, and external tools are the hands." That's a substantively different shape of system than "the language model writes text and a human takes the output and does something with it." The shift isn't subtle. The implications haven't propagated.

Where this works today

The use cases that already work, in roughly the order they're succeeding:

Information retrieval over external sources. "Look up X in Wolfram Alpha" or "search the web for Y", these are search tools wrapped in a chat interface. The model writes the query, the tool returns the result, the model summarizes. Low risk, high value, low operational complexity.

Structured generation against an API contract. Function calling is genuinely useful for "extract structured data from this unstructured input", the function definition is the schema, the model fills in the fields. Doesn't really need to be an "agent". It's just a better interface for one-shot extraction work.

Workflow automation with a tight scope. "Look up this order in Shopify, summarize the customer history, draft a response email, but don't send it." The agent does the assembly work, a human does the send. Useful when the assembly is the slow part and the send is the consequential part.

These all have one thing in common: the agent's actions are either reversible, low-stakes, or gated by a human approval step before anything irreversible happens. That gating is what makes them safe to deploy.

Where this doesn't work yet

The use cases that don't work, that I keep seeing demos of and seeing them not survive contact with reality:

Long-horizon autonomous task completion. AutoGPT-style demos where the agent is given a high-level goal ("research X and produce a report") and runs unattended for an hour. These produce outputs that look plausible from a distance and fall apart on inspection. The agent gets distracted, loops on the same dead end, hallucinates intermediate results, runs up an unbounded API bill. The tooling for "run an agent overnight and trust the output in the morning" is not there.

Anything with real money or real production systems on the line. I have not yet seen a credible production deployment of an agent with permission to spend money or modify production infrastructure. Demos exist. Production deployments do not. The reason is the next section.

The operational problems nobody is solving

A lot of the energy in this space is on capability. The capability is the easy part. The operational foundation around capability is where the actual work is, and most of the work hasn't started yet.

Authorization and least privilege

When an agent calls a tool on your behalf, what is it allowed to do? Right now the granularity is "you authorized the plugin, now the model can do anything that plugin can do." That's not a real authorization model. A useful authorization model would be "this agent can read from these resources, write to these resources, and trigger these specific actions, in these specific contexts." None of the current frameworks have that. Building it requires both a permissions model and a way to enforce it at runtime, against a non-deterministic caller.

Audit and forensics

When the agent does something wrong (when it calls a tool with arguments that produce bad outcomes) what does the trail look like? Today it's mostly "the chat history, plus whatever logs the tool itself emits." That's not a forensics-grade trail. You can't reconstruct why the model chose those arguments. You can't tell whether the same input would produce the same call again. You can't, in many cases, even reliably reproduce the failure.

The model providers are starting to expose better tracing (function call logs, tool invocation history). It's nowhere near where it needs to be for production use in regulated environments. That gap is going to close, but not in 2023.

Tool sprawl and capability creep

Every new tool an agent has access to expands the surface area of what it might do. Five tools is manageable. Fifty tools is a configuration nightmare. Five hundred tools is unenforceable in practice, the model will make selection mistakes, the user will lose track of what the agent can do, and the failure modes will be increasingly hard to reason about.

There's no good answer to this yet. Some kind of capability scoping (agent-A has access to tools X, Y, Z; agent-B has a different set) is necessary but not sufficient. Some kind of intent-based authorization (the agent can call this tool only if the user's intent is plausibly served by it) would help but doesn't exist.

Blast radius and rollback

When an agent takes an action, can it be undone? In some tools, yes (delete a draft email, reverse a transfer that's still pending). In many tools, no (sent emails, completed transactions, modified production data). An agent system without a clear blast-radius model is one bad call away from a story.

The right pattern is probably something like: classify every tool by its reversibility, gate irreversible actions behind explicit user approval, and track the cumulative blast radius of an agent session. None of the current frameworks expose this kind of structure.

Trust calibration over time

Even if all of the above is solved, there's a longer-term question: how does a user calibrate trust in an agent? An agent that's gotten ten things right is not necessarily an agent that will get the eleventh thing right. Calibrating user trust based on prior performance (without inducing complacency on the rare hard case) is a real human-factors problem.

We don't have great patterns for this. Self-driving cars are running into the same problem and haven't solved it either. There's probably some structural lesson there.

The shape of the next year

What I think happens next, with low confidence on timing.

The capability gets better, fast. Function calling is going to get more reliable. Tool selection is going to get smarter. Long-horizon planning is going to start working in narrow domains.

The operational foundation gets better, slower. Permission models, audit trails, blast-radius tooling, these get built by the platforms that make money from enterprise deployments, because the demand is concentrated there. Consumer-grade agent products will continue to ship without the foundation, because the consumer demand is for capability, not safety.

The first big public failure happens. Some agent does something embarrassing or expensive on someone's behalf. The story makes the rounds. People discover that "the model called the function with arguments that produced this outcome" is not a satisfying answer to "why did this happen." The conversation about agent governance shifts from "interesting future problem" to "blocking issue."

Some sub-industry of "agent observability" or "agent governance" emerges, as the natural follow-on to MLOps and as the next layer of the AI infrastructure stack. The first generation of those products will be inadequate. The second generation will be useful.

We are at the very early end of this. The plugin launch was six weeks ago. The capability is real and the foundation is in production. The operational practices are mostly missing. The places to put attention are the gaps, not the demos.

What I'm actually doing

For my own work, the rule I've landed on for now is: agents are useful for assembly, not for execution. They can pull together the inputs, draft the candidate output, organize the workflow. They can't be the thing that hits "send" or "deploy" or "transfer." The human stays in the loop on anything irreversible.

That's a temporary rule, not a principled one. It's where the operational foundation is today. As the foundation matures, the rule will change. But it'll change slower than the capability does, and that gap is going to be where most of the consequential decisions get made over the next year.