Stanford’s 2026 AI Index Report, released this month, contains two findings that enterprise leaders need to read side by side. AI agents now complete 66% of real computer tasks on the OSWorld benchmark, a 54-percentage-point jump from 12% a year ago. In the same report, a randomized controlled trial by METR found that experienced open-source developers completed coding tasks approximately 19% slower when given access to frontier AI tools compared to working without them.
The capability story and the productivity story are diverging. That gap has real implications for how companies plan their agent investments in 2026.
The OSWorld Jump: 12% to 66%
The OSWorld benchmark tests AI agents on actual computer tasks: navigating software interfaces, manipulating files, executing multi-step workflows across operating systems. According to the Stanford HAI report, agent success rates on these tasks climbed from 12% in March 2025 to 66.3% by March 2026, bringing agents within roughly 6 percentage points of measured human performance on the same benchmark.
This is not a narrow coding benchmark. OSWorld measures whether agents can do the kind of work that fills an average knowledge worker’s day: processing documents, managing databases, coordinating between applications. Two-thirds success on these tasks represents a threshold where agent deployment in structured, high-volume workflows becomes viable, according to analysis from BERI.
Other benchmarks in the report tell a similar story. On SWE-bench Verified, which tests agents on real software engineering tasks from open-source projects, model performance jumped from 60% to near 100% of the human baseline in a single year, per the Stanford HAI takeaways.
The METR Paradox: Capable Tools, Slower Work
Buried in the same report is a finding that contradicts the dominant vendor narrative on developer productivity. METR (Model Evaluation and Threat Research) ran a randomized controlled trial with experienced open-source contributors working on their own repositories, projects they already knew thoroughly. Half the tasks were completed with AI coding assistants, half without.
The developers predicted AI would make them 24% faster. After the study, they still believed AI had made them roughly 20% faster. The measured result: 19% slower, with a confidence interval between 2% and 39% slower.
As Ars Technica reported when the study was first published, the gap between perceived and actual productivity was 39 percentage points. The tools felt fast. The stopwatch disagreed.
The slowdown appears to stem from overhead that developers don’t register as lost time: reviewing AI-generated code, re-prompting after incorrect outputs, debugging subtle errors they didn’t write, and the cognitive load of evaluating suggestions mid-flow. None of that feels like friction in the moment. It accumulates in the data.
Two Findings, One Planning Problem
The tension is real but not contradictory. OSWorld measures what agents can do autonomously on structured tasks. The METR study measures what happens when experienced humans collaborate with AI tools on tasks those humans already know how to do. These are different deployment models.
Agents executing high-volume, repeatable workflows (claims processing, document extraction, compliance checking) operate in the OSWorld paradigm: give the agent a defined task, let it run, measure output. The 66% success rate applies here.
Developers using AI coding assistants operate in the METR paradigm: a human expert is already fast at the task, and the tool introduces both generation speed and validation overhead. For this group, the net effect may be negative until tools and workflows mature.
For enterprise planning, the report suggests a clear segmentation. Agent autonomy works where tasks are structured and high-volume. Human-AI collaboration works where the human is learning (junior developers) or the task is genuinely novel. For experienced practitioners on familiar codebases, the productivity case is weaker than vendor marketing suggests.
The Adoption Numbers
The broader Stanford report shows organizational AI adoption hitting 88% in 2026, up from 78% a year earlier. Four in five university students now use generative AI. Global private AI investment reached new highs.
The capability surge is not in question. The question is whether enterprises are deploying agents in the right configurations, measuring the right outcomes, and not confusing capability benchmarks with productivity gains. Stanford’s report, taken as a whole, suggests many are doing exactly that.