Spencer Saldana

The Multi-Agent Moment

July 18, 2024·Spencer Saldana

There's a particular kind of trade show booth in 2024 where someone is showing you a diagram. The diagram has five boxes. Each box is labeled with a role. "Research Agent." "Planning Agent." "Writer Agent." "Reviewer Agent." "Manager Agent." Arrows go up and down between them. The diagram is called a reference architecture for autonomous workflows. The presenter is enthusiastic. The demo always works.

You've seen this exact diagram approximately forty times this year.

CrewAI, AutoGen, MetaGPT, LangGraph, the long tail of personal blog posts about "how I built a startup with a hierarchy of agents." This is the multi-agent moment. Everyone is composing little agent societies and watching them produce something. Some of the demos are good. The boring question is whether any of it works.

I've been in pre-sales rooms for a lot of these in 2024. The honest answer right now is that the demos are great, the production deployments are vanishingly rare, and the ones that work look almost nothing like the diagram.

why the diagram feels right

The seduction is easy to explain. We already know hierarchies work, because we work in them. Companies are hierarchies. Departments coordinate. Managers route work. Specialists specialize. If you can write an LLM that's good at one thing (writing, planning, reviewing, summarizing), the natural next move is to compose them the way you would compose a team. The metaphor maps. The diagram practically draws itself.

It also feels right because the underlying model can almost get there. GPT-4o can plan. Claude 3 Opus can critique. The component capabilities are present. So if you wire them together the way you would wire a team, you should get something more capable than any of them alone. That's the argument. It's the argument that gets you funded. It's the argument that gets a Fortune 500 to sign off on a six-month autonomous workflow POC.

It's a wrong argument, or at least an incomplete one. But it's pretty.

what actually happens

The agent talks to the agent. The agent's output goes to the next agent. The next agent misinterprets it slightly. The next agent acts on the misinterpretation. The next agent reports up the chain that the work is done. The manager agent reads the report and asks for a revision. The revision starts from the misinterpretation. The misinterpretation compounds.

Each LLM call is a stochastic process. You can call it deterministic at temperature zero, but the deterministic version isn't the version anyone is actually using, because temperature zero produces robotic output. Real agent stacks run somewhere between 0.3 and 0.8. Every single hop introduces a probability of going slightly off rails. Chain enough of them together and the off-rails probability is the only thing left.

You can fight this by adding more agents to check the work, which is what most of the frameworks do. Now you have a reviewer agent checking the writer agent. The reviewer agent is also a stochastic process. The reviewer either rubber-stamps everything (because the prompt told it to be helpful) or flags everything (because the prompt told it to be critical). You spend the next two weeks tuning the reviewer prompt. The reviewer is now great at agreeing with the writer. You ship the agreement. The output is still wrong.

The other thing that happens is latency. Every agent is an API call. Every API call is anywhere from 800ms to 8 seconds. A five-agent workflow with two rounds of back-and-forth is a minute and a half of waiting. The user doesn't wait a minute and a half. The use case dies.

what does work

The successful agent deployments in 2024, in my experience, look almost nothing like the org chart diagram. They are:

A single agent. With good tools. In a tight loop. Stopping when it should.

Customer support agents that pull up the order, summarize the issue, propose a resolution, hand to a human if confidence drops below a threshold. Sales research agents that pull a company name, scrape three sources, write a 200-word brief, stop. Document extraction agents that look at a PDF, find the fields, return JSON, stop.

These are not orchestras. They are one player with a good instrument and a clear song. The "multi-agent" version of any of these is slower, more expensive, more error-prone, and worse.

When multi-agent does work, it's almost always for tasks with these properties: the sub-tasks are genuinely independent (no shared state), the sub-tasks are parallel (not sequential), the handoff is small (a single artifact), and the failure cases are easy to detect (you can verify the output cheaply). Map-reduce, basically. Multi-agent works when it's map-reduce wearing a costume.

Notice what's not on that list. "Agents collaborating creatively." "Agents iterating to convergence." "An agent that manages other agents." Those are the patterns that look good in demos and die in production.

the abstraction problem

I think the deeper issue is that "agent" turned out to be the wrong unit of abstraction for the thing people actually wanted. The actual unit is "tool-use turn." A model that takes a goal, decides what tool to call, calls it, looks at the output, decides what to do next. One model. One state. One loop.

We dressed this up as a society because society is what we know. But the model doesn't need a society. It needs context, tools, and a clear stopping criterion. The frameworks that ship in 2025 will probably reflect this. Less CrewAI, more "an agent is a model plus a tool registry plus a loop." The diagram with the five boxes will look quaint, like an early-2010s J2EE architecture diagram. You'll know exactly when somebody drew it.

what to take from this

If you're building, the heuristic is to start with one agent. Add tools, not agents. Only add a second agent when you've physically proven a single one cannot do the job. The bar for "add another agent" should be very high, because the costs compound.

If you're buying, the heuristic is to be suspicious of any vendor whose architecture diagram has more than two boxes. Make them demo the failure cases. Make them tell you the latency. Make them tell you the cost per session at production volume. Most of the multi-agent POCs that died in 2024 died because nobody asked those questions in 2023.

The multi-agent moment will pass. The good ideas in it (specialized capabilities, explicit handoffs, structured coordination) will get absorbed back into single-agent design. The diagram will get smaller. The work will get better. That's how the cycle usually goes.