Stanford University’s Human-Centered Artificial Intelligence Institute published its ninth annual AI Index Report this week, and the agent performance data is the headline: AI agents completing real computer tasks on OSWorld jumped from approximately 12% to 66% success rate in a single year. Terminal-Bench agents improved from 20% to 77.3%. On SWE-bench Verified, the coding benchmark, model performance rose from 60% to near 100% of the human baseline in the same period, according to the report.
These are not vendor claims. The Stanford AI Index is the most widely cited independent measurement of AI progress across the industry.
What the Benchmarks Actually Test
OSWorld measures AI agents’ ability to complete real tasks across actual operating system interfaces: desktop GUIs, web browsers, file systems. In 2024, agents could successfully complete about one in eight of these tasks. In 2026, they complete roughly two in three. AI Advisory Boards noted the report introduces the phrase “jagged frontier,” the same model that wins at olympiad-level mathematics reads an analog clock correctly just 50.1% of the time.
Terminal-Bench tests real-world terminal tasks, where agents went from 20% to 77.3% success. AI agents handling cybersecurity issues solved problems 93% of the time, compared to 15% in 2024, according to the Stanford report.
The Coding Threshold
The SWE-bench Verified coding benchmark result may be the most consequential for builders. Going from 60% to near 100% of the human baseline in one year means AI agents are now effectively at human-level performance on standardized software engineering tasks. The Decoder confirmed the figure, noting that “on the SWE-bench Verified coding benchmark, performance jumped from 60 to nearly 100 percent of the human baseline in a single year.”
Google’s Gemini Deep Think won a gold medal at the International Mathematical Olympiad, as reported by IEEE Spectrum.
Adoption at 88%, But Entry-Level Jobs Are Shrinking
Organizational AI adoption reached 88% globally. Generative AI reached 53% population adoption within three years, faster than either the PC or the internet.
The productivity data is real: 14% to 26% gains in customer support and software development, up to 72% in marketing teams, according to The Decoder.
But the report documents a corresponding contraction: U.S. software developers aged 22 to 25 saw employment fall nearly 20% since 2024, even as headcount for older developers continued to grow. AI is creating measurable value and displacing early-career workers simultaneously.
The US-China Gap Has Closed
The performance gap between leading U.S. and Chinese AI models has effectively closed. Since early 2025, models from both countries have traded the top spot. As of March 2026, Anthropic’s leading model holds a 2.7% edge. The U.S. still leads in investment ($285.9 billion in private AI investment in 2025, 23 times China’s $12.4 billion) and company formation (1,953 newly funded AI companies). China leads in publication volume, citations, and industrial robotics, per IEEE Spectrum.
Safety Research Lags Capability
The report flags that responsible AI reporting “remains inconsistent” among frontier model developers. Documented AI incidents rose to 362, up from 233 in 2024. Research found that improving one responsible AI dimension, such as safety, can degrade another, such as accuracy. The U.S. ranks last among surveyed countries in public trust in government AI regulation, at 31%.
The Production Readiness Question
The Stanford data provides empirical grounding for a narrative that has been building across the industry this week. The question is no longer whether AI agents can perform reliably enough to deploy. The OSWorld data answers that for most task categories. The question has shifted to which organizational contexts, governance frameworks, and pricing models are ready for autonomous agent deployment at scale.