Spencer Saldana

agents in the wild: a 2025 field report

September 10, 2025·Spencer Saldana

I've spent most of 2025 helping organizations put agent systems into production. By "agent" I mean what most people now mean: an LLM in a loop, calling tools, with some kind of stopping criterion. By "production" I mean the boring kind, not the demo kind. Actual users. Actual money. Actual consequences.

The view from this seat in 2025 is different than it was in 2024. The technology has improved. The hype curve has crested and started its descent. The serious deployments are starting to ship. Here's what I see.

what's actually shipping

The agent systems that successfully reached production in the first nine months of 2025, in roughly descending order of frequency:

Customer support tier one. Resolve the easy tickets autonomously, route the hard ones to a human with a summary attached. This is the workhorse of the agent era. Boring. Effective. Cost savings are real. The accuracy bar is "as good as the median tier-one human agent," which turns out to be lower than people expect, which means the model crosses it easily. Klarna's 2024 deployment was the canonical example. By mid-2025 every mid-market support org has a version.

Internal search and Q&A over private corpora. Slack history, wiki content, jira tickets, confluence pages, the email someone sent in 2019 that nobody can find anymore. The use case is "ask in plain language, get an answer with citations." Adoption is high when the corpus is well-curated, low when it's a mess. Most enterprises have a mess. The deployments that work do six months of corpus hygiene before they go live.

Structured data extraction from unstructured documents. Invoices, contracts, claims forms, PDFs from vendors, emails with attachments. The model reads the doc, returns JSON. The downstream system processes the JSON. Very high ROI. The use case has been around since 2023 but really matured in 2024 and 2025 as accuracy got high enough to drop the human reviewer for most cases.

Sales research agents. Given a company name, produce a brief. Find the LinkedIn, find the recent news, find the funding history, summarize. The output goes into a CRM field or a daily digest. Adopted unevenly. Loved by individual reps. Distrusted by leadership because the output quality is hard to govern. The agents that win here are conservative. They give clearly-cited summaries with explicit "I don't know" sections.

Coding assistants that do whole tasks. Cursor, Cognition's Devin (which finally became real this year), a wave of internal versions. This is the category that's most clearly moving. The deployments that ship are scoped: bug fixes, test generation, dependency updates, small features inside a well-bounded codebase. The "build my app" demos are still mostly demos. The "fix the flaky test" reality is shipping.

what isn't shipping

The mirror image. What the discourse was certain about in early 2024 and that has not actually arrived in production in 2025:

Multi-day autonomous workflows. The "set the agent loose for a week, come back to a finished project" pattern. Not happening. The reliability isn't there. The cost is high. The trust is low. When teams try, they end up reverting to either hourly check-ins (in which case it's not really autonomous) or to running it offline as an experiment.

Agents that write production code unsupervised. Code generation is great. Unsupervised code generation for production-bound code is rare. The economics don't pencil yet. The cost of a bad commit is high. The cost of human review is fine. Most teams still review.

"AI employees." The idea that you could just spin up an autonomous worker, assign it work, and treat it like a teammate. It's a vibe. It's not a deployment. The orgs that tried in 2024 mostly walked it back. The agents that succeed are clearly framed as tools, not coworkers. People can use tools well. People are bad at managing fake humans.

Multi-agent orchestration. Talked about more in 2025. Shipped slightly more. Still mostly not shipping. The single-agent plus tools pattern continues to outperform.

the boring lessons

The lessons that turned out to matter most in 2025 are not the ones the discourse pays attention to.

Observability is the bottleneck. The orgs that ship and iterate well have a logging layer that captures every agent turn, every tool call, every input and output, every cost and latency. The orgs that don't have this can't improve their agents, because they can't see what went wrong. The first three months of any serious agent deployment should be observability. Most teams skip this and pay for it later.

Eval surface area grows faster than anything else. An agent with one tool has a small eval problem. An agent with twenty tools has a combinatorial eval problem. The bottleneck on most agent deployments by mid-2025 isn't model quality or prompt design. It's "how do we know if a change made things better." The teams that have invested in eval infrastructure ship faster. The teams that haven't are stuck running the same trial-and-error loop on regression.

Stopping is harder than thinking. The hardest part of agent design in 2025, in my experience, remains stopping. Knowing when the agent has enough information. Knowing when to hand off. Knowing when to give up. Models trained for reasoning made this slightly worse in some cases, because they want to think more, not less. The deployments that ship well have explicit stopping logic outside the model. The model proposes; the loop decides.

the interesting lessons

The ones that surprised me.

Tool design quality predicts agent quality more than model choice. I would not have believed this in 2024. An agent with gpt-4o-mini and beautifully designed tools outperforms an agent with the latest reasoning model and sloppy tools, in almost every production deployment I've watched. "Tool design quality" means clear names, narrow scopes, idempotent semantics, good error messages, structured outputs the model can reason over. The work is unglamorous. It pays.

Agents are more conservative than people expect. In 2024 the fear was that agents would do reckless things. In production, the more common failure mode is the opposite. The agent gives up too easily, asks for clarification when it shouldn't, or refuses to act on partial information. The prompts that ship in production tend to push the model toward more action, not less. The "be helpful, take initiative" instruction is doing real work.

Cost models matter more than model quality at scale. A support agent handling 50,000 tickets a day at $0.02 per ticket is $1,000 a day. The same agent on a model that's 5x more expensive is $5,000. The quality difference between the two is rarely 5x. Most production agents in 2025 run on a Pareto-frontier cheap model with carefully designed prompts, not on the flagship. This is going to keep being true.

The reasoning models have a specific use case. o1, o3, their successors are good at agents that have to plan multi-step tool calls over complex state. They are not good at agents that have to respond fast. Most production agents have to respond fast. The reasoning models end up reserved for offline or batch use cases where their slowness is fine and their planning is genuinely better. Not the universal upgrade some people thought they'd be.

what's coming

Three things I'm watching for the rest of 2025 and into 2026.

Long-running agents become real. The infrastructure is starting to be there. MCP, durable execution, observability tooling. By the end of 2025 I think we'll see the first credible "let it run for a week" deployments, in narrow domains (financial close, large-scale research synthesis, periodic compliance audits). They won't be the general "AI employee." They'll be specific.

Computer use becomes useful. Anthropic shipped the first version in late 2024. It worked but slowly and with high error rates. The 2025 versions are starting to be useful for narrow tasks: browser automation, legacy GUI integration, the long tail of "no API, please click here" problems. It will be the way agents bridge to systems that were never meant to be programmable.

Voice agents win the consumer side. The text chatbot is the wrong interface for most consumer interactions. The voice agent, with the latency now down under 500ms and the model quality high enough, is starting to feel right. The companies that get the voice product right (not the technology, the product) are going to take a lot of market in the next eighteen months.

the meta-lesson

A lot of what made 2024 confusing was that everyone was trying to figure out what agents were. In 2025, we know. They're loops. They use tools. They stop. The interesting question shifted from "what can they do" to "how do you make them reliably good at one specific thing."

That second question has a lot less hype and a lot more upside. The teams that internalized this in 2024 are shipping in 2025. The teams that didn't are still chasing the demo.