2026.06|6 min#ai#agents#evals#reliability

Agents Fail Quietly

A passing test and a working workflow are not the same thing. The gap between them is where production breaks.

Most AI agents look flawless right up until they don't.

The demo runs clean. The test suite is green. The final answer reads well. And then the same agent, pointed at real work, quietly produces something wrong.

Not wrong in an obvious way. Wrong in a way that passes every check you thought to write.

This is the part of building with agents that nobody puts in the launch video.

The Demo Was the Easy Part

Every agent demo is built on the same foundation: clean inputs, a cooperative user, a defined scenario, and an environment where the agent's strengths are on display and its failure modes are conveniently out of frame.

Production is the opposite. Inputs are messy. Users ask for things the agent was never shaped to do. The scenario drifts halfway through. And the failure modes you kept off-screen in the demo are now the main event.

The numbers make this concrete. A March 2026 reliability report analyzed roughly 4.5 million tests across more than 6,000 production agents across ten regions. The aggregate end-to-end success rate was 56.6 percent.

These were not toy projects. They were real systems handling customer service, document processing, internal tooling, and workflow automation. Just over half of the runs actually worked.

That gap, between the demo and the deployment, is not a model problem you can prompt your way out of.

It is a measurement problem.

A Green Test Is Not a Working Agent

Traditional software has a comforting property: if the inputs are the same, the outputs are the same. A passing test means the code did what you asked.

Agents break that assumption.

An agent can pass a unit test and still fail the workflow. It can select the wrong tool. It can pass a malformed argument. It can hand off to the wrong sub-agent. It can take an unsafe path and then arrive at a plausible final answer that looks completely fine.

Here is the trap. Most checks look at the final state. The agent's mistakes happen in the middle.

Picture a research agent. It correctly retrieves competitor information in step one. It misattributes a feature to the wrong company in step three. It builds the rest of its analysis on that mistake. The final summary is well written, confident, and wrong, and it passes a surface-level check because the format is right and the conclusion sounds reasonable.

The error did not show up at the end.

It propagated from the middle and never announced itself.

This is what I mean by failing quietly. The agent does not crash. It does not throw an error. It hands you a clean, finished, incorrect result.

The Math Is Against Long Workflows

The reason this gets worse with ambition is simple arithmetic.

Say your agent is 85 percent reliable at each individual step. That sounds strong. Most people would ship it.

Now chain ten steps together. The end-to-end success rate is not 85 percent. It is 0.85 to the tenth power, which is roughly 20 percent.

Eighty-five percent per step. Twenty percent overall.

And real production workflows are often longer than ten steps. Every tool call, every handoff, every retrieval is another place for a small error to slip in and compound. Strong step-level performance can still produce cascading, end-to-end failure.

This is why "the model got better" rarely fixes a flaky agent. A better model raises per-step reliability a few points. The compounding math eats most of the gain. The fix is fewer steps, tighter scope, and a way to catch errors the moment they happen rather than at the finish line.

You cannot do any of that without measuring the middle.

Benchmarks Will Lie to You Too

The natural instinct is to reach for a benchmark. Run the agent against a standard suite, get a score, ship if the score is high.

Be careful there as well.

In April 2026, UC Berkeley researchers showed that every major AI agent benchmark could be exploited to reach near-perfect scores without actually solving a single task. The benchmark measured something. It just was not the thing you cared about.

A high benchmark score and a reliable production agent are not the same claim. One is a number on a leaderboard. The other is a system that behaves on the inputs your users actually send.

The benchmark is a starting point, not a verdict.

Build Evals Like You Mean It

So what actually works? Less hope, more measurement.

The teams shipping reliable agents in 2026 treat evaluation as core infrastructure, not an afterthought. A few principles I keep coming back to.

Measure steps, not just outcomes. Check the tool calls, the arguments, and the handoffs, not only the final answer. A workflow that gets the right answer for the wrong reason will eventually get the wrong answer.

Build evals from real failures. Every production incident is a test case you did not have yet. When an agent fails quietly, capture the trace and turn it into a permanent check. Your eval set should grow every week.

Run evals in CI. Treat an agent regression like a code regression. The point of CI tooling for agents is to catch the drop before users do, not to explain it afterward.

Score reliability, not vibes. "It feels smarter" is not a metric. End-to-end success rate, step-level accuracy, and tool-selection precision are. If you cannot put a number on it, you cannot tell whether your last change helped or hurt.

None of this is glamorous. It is the agent equivalent of writing tests and reading logs. But it is the difference between an agent that demos well and an agent you can actually leave running.

Quiet Failure Is a Trust Problem

The deeper issue is what quiet failure does to trust.

A loud failure is almost a gift. The agent crashes, you see the error, you fix it, you move on. You know exactly where you stand.

A quiet failure erodes something harder to rebuild. The agent hands you a confident, finished, wrong result, and you only find out later, after you have acted on it. Do that a few times and the user stops trusting any output, even the correct ones. They start re-checking everything by hand, which defeats the entire purpose of the agent.

An agent you have to fully re-verify is not saving you work. It is adding a step.

The whole value of an autonomous workflow is that you can trust the parts you did not watch. That trust is not earned by a good demo. It is earned by a system that catches its own mistakes and tells you when it is unsure.

The Bottom Line

The flashy part of agent building is the capability. Watch it browse, write code, call APIs, chain tools, finish the job.

The part that decides whether it survives contact with real work is quieter. It is the measurement underneath: the evals, the traces, the step-level checks, the honest reliability numbers.

Agents do not usually fail loudly. They fail in the middle, with a clean face, in a way that passes the checks you remembered to write.

So write the checks you would rather not think about. Measure the steps, not just the endings. Treat every quiet failure as the next test case.

The agents that win will not be the ones that demo best.

They will be the ones you can prove are right.