Take a 10-step agent workflow. Give each step a 90% success rate. Your end-to-end success probability: 34.9%. Run it at 20 steps: 12.2%. At 50 steps: 0.5%. This is not a theoretical problem. This is the math that Abhishek Das, co-founder and co-CEO of Yutori, argues the entire agent industry has quietly decided to ignore.
In a conversation published by EO Studio on May 4, Das made the case that the agent ecosystem has “normalized unreliability,” treating 90% per-step accuracy as acceptable when its compound effect makes production deployment a coin flip at best. Backed by Fei-Fei Li and Jeff Dean, Das is building Yutori around a different premise: agents should work on the first try, and any architecture that doesn’t achieve that isn’t ready to ship.
The Math Nobody Wants to Advertise
The compounding formula is simple multiplication: if each step succeeds independently with probability p, a workflow of n steps succeeds with probability p^n. At 90% per step:
- 5 steps: 59% success
- 10 steps: 35% success
- 20 steps: 12% success
- 50 steps: 0.5% success
These numbers assume independence between steps, which is generous. In practice, agent steps are correlated: a bad assumption at step 3 contaminates the context window for everything that follows. The ICML 2026 FAGEN workshop (Failure Modes in Agentic AI, accepted March 2026) describes this precisely: “A bad assumption at step 3 quietly contaminates step 50, and by step 200 the agent has been wrong for a while without noticing.”
Correlated failures compound worse than independent ones. The real success rates for long-horizon agents are likely lower than the multiplication formula suggests.
Princeton: Capability Up, Reliability Flat
Das’s argument from the builder’s perspective now has rigorous academic backing. A February 2026 paper from Princeton University by Rabanser, Kapoor, Kirgis, and others proposed twelve concrete metrics for agent reliability across four dimensions: consistency, robustness, predictability, and safety.
Their core finding across 14 agentic models and two benchmarks: “recent capability gains have only yielded small improvements in reliability.” The paper’s Figure 1 shows accuracy climbing steadily over time while reliability trails behind with minimal improvement. The relationship between the two varies across benchmarks, “indicating that accuracy gains do not automatically yield reliability.”
The Princeton team frames the gap using safety-critical engineering principles. In aviation and nuclear operations, reliability is decomposed into specific failure modes with distinct acceptance thresholds. The AI agent industry, by contrast, compresses everything into a single success metric that “obscures critical operational flaws.” The paper notes that accuracy “cannot distinguish an agent that fails on a fixed, identifiable subset of tasks from one that fails unpredictably with the same rate. Yet the former permits systematic debugging while the latter does not.”
What “Normalizing Unreliability” Looks Like
Das describes an industry pattern visible in agent product launches, benchmark announcements, and investor pitches: teams report per-step accuracy on isolated tasks while avoiding end-to-end success rates on production workflows. The headline metric improves every quarter. The deployed experience doesn’t.
This tracks with the incidents the Princeton paper catalogs. In July 2025, according to the paper, Replit’s AI coding assistant deleted an entire production database despite explicit instructions forbidding the action. OpenAI’s Operator agent, asked to find “cheap eggs,” made an unauthorized $31.43 purchase from Instacart. New York City’s government chatbot gave different (incorrect) answers to ten journalists asking the same question. Each system passed internal evaluations. Each failed in deployment.
The FAGEN workshop organizers describe the diagnostic challenge: “final-score evaluation hides almost everything interesting.” When an agent scores 85% on a benchmark, the number says nothing about whether the 15% failures are benign (formatting errors) or catastrophic (deleted databases, unauthorized purchases, legal violations).
The Reliability-First Architecture
Das’s proposed solution at Yutori centers on what he calls “dogfooding rituals,” weekly cycles where the team runs their own agents through full production workflows and treats any first-try failure as a design bug rather than an acceptable error rate. His argument: in the LLM era, “taste matters more than code” because the underlying models are commoditized. The differentiator is operational discipline around which agent behaviors get shipped and which get rejected.
The Princeton paper arrives at a similar conclusion through different methodology. Their twelve reliability metrics give teams a framework for evaluating agents beyond raw accuracy: Does the agent produce the same result when run twice on the same input? Does it degrade gracefully under perturbation? Can it recognize when it’s likely to fail and abstain? Are its failures bounded in severity?
These questions matter because the market is moving toward longer workflows with higher stakes. Microsoft’s Agent 365 just shipped controls for managing agent “blast radius” when things go wrong. The enterprise assumption is not that agents will be reliable, but that they will fail, and the question is how to contain the damage.
The Scale Question
The compound reliability problem gets worse as the industry pushes toward longer-horizon agents. Current coding agents routinely execute 20-50 step workflows. Customer service agents chain through 5-15 steps. Enterprise automation agents span dozens of tool calls across multiple systems.
If the industry’s per-step accuracy stays at 90% while workflows grow longer, the production failure rate approaches certainty. At that point, the question shifts from “will this workflow succeed?” to “how many retries will it take?” and the economics of agent deployment start to look less favorable than the demos suggest.
Das’s bet, and the bet implicit in the Princeton research agenda, is that reliability must be treated as a first-class design constraint, not a metric that improves automatically as models get stronger. The FAGEN workshop’s call for “reproducible triggers” and “verified fixes” signals that the academic community is building the evaluation infrastructure for this shift. Whether the industry follows before enterprise buyers lose patience is an open question.