Stanford University’s Institute for Human-Centered Artificial Intelligence released its ninth annual AI Index Report on April 13, 2026. The 400-page assessment, compiled by researchers led by computer scientist Yolanda Gil at the University of Southern California, tracks the state of AI across performance benchmarks, investment, adoption, environmental cost, and regulation. For anyone building or deploying autonomous agents, one number cuts through the noise: the best AI agents still score roughly half as well as human specialists with PhDs on complex multistep workflows.
“Agents are wonderful, but we are still far from a place where we understand how to use them effectively,” Gil told Nature.
That finding arrives alongside $581 billion in AI investment in 2025, more than double the prior year. The industry is deploying agents at scale while the data says those agents fail at expert-level tasks roughly half the time.
The Agent Performance Gap
The report’s agent findings draw on multiple benchmarks. PaperArena, which tests LLM-based agents on scientific research workflows (formulating reasoning plans, interacting with papers, invoking tools), saw even the best agent achieve just 39% accuracy, according to Nature. OSWorld, which benchmarks autonomous computer use, and SWE-Bench Verified, which tests autonomous coding, have seen the steepest improvements of any benchmark category, per IEEE Spectrum. SWE-Bench Verified scores jumped from roughly 60% in 2024 to near 100% in 2025.
The divergence matters. Agents that write and debug code are approaching human performance. Agents that handle complex, multistep reasoning across ambiguous real-world tasks are not. The report describes this as “jagged intelligence”: AI excels in data-rich domains where training data maps closely to the task, and fails where tasks require judgment, physical world understanding, or multi-domain reasoning.
Robots succeed in just 12% of household tasks, according to MIT Technology Review. Claude Opus 4.6, which scores among the best models on Humanity’s Last Exam (over 50% accuracy on questions designed by subject-matter experts to represent the hardest problems in their fields), reads analog clocks correctly just 8.9% of the time on ClockBench, as IEEE Spectrum reports.
“I am stunned that this technology continues to improve, and it’s just not plateauing in any way,” Gil told MIT Technology Review. The benchmarks keep falling, and the gaps in agent capability keep getting exposed.
The Broken Yardstick
The report is blunt about the measurement problem: benchmarks are failing to keep up. Models blow past test ceilings so quickly that scores become meaningless within months. A popular math benchmark has a 42% error rate in its own questions, according to MIT Technology Review. Models trained on benchmark test data learn to score well without gaining genuine capability.
Ray Perrault, co-director of the AI Index steering committee, put it plainly to IEEE Spectrum: “Knowing that a benchmark for legal reasoning has 75 percent accuracy tells us little about how well it would fit in a law practice’s activities.”
For complex, interactive technologies like AI agents and robots, benchmarks barely exist yet, according to MIT Technology Review. The industry is deploying agents commercially while the tools to measure whether they work remain in early drafts.
Companies are also selectively reporting. “A lot of companies are not releasing how their models do in certain benchmarks, particularly the responsible-AI benchmarks,” Gil told MIT Technology Review. “The absence of how your model is doing on a benchmark maybe says something.”
China Closes the Gap
The US-China AI performance gap has effectively disappeared. According to the Arena ranking platform, which uses community-driven comparisons, US and Chinese models now trade places at the top of benchmark rankings regularly, as SiliconAngle reports. As of March 2026, Anthropic leads, trailed closely by xAI, Google, and OpenAI. Chinese models from DeepSeek and Alibaba lag by razor-thin margins, according to MIT Technology Review.
The countries’ advantages are asymmetric. The US leads in capital, data center infrastructure (5,427 data centers, more than 10 times any other country), and top-line model performance. China leads in AI research publications, patents, and robotics deployment: 295,000 industrial robots installed in 2024 versus 34,200 in the US, according to IEEE Spectrum.
The race is no longer two-horse. South Korea has emerged as the global leader in AI “innovation density,” filing more patents per capita than any other country, per SiliconAngle. Forty-four nations now have state-backed supercomputing clusters. But the report warns of a widening digital divide: South American and Middle Eastern nations lag significantly, and countries that cannot shape AI development are unlikely to see its economic benefits.
Transparency Collapse
More than 90% of all notable AI models now come from private companies, up from under 50% in 2015, according to IEEE Spectrum. The shift from academic to corporate AI development has come with a sharp decline in openness.
OpenAI, Anthropic, and Google have all stopped disclosing training code, parameter counts, and dataset sizes, according to SiliconAngle and MIT Technology Review. Eighty of the 95 most notable models released in 2025 did not disclose training data details.
“We don’t know a lot of things about predicting model behaviors,” Gil told MIT Technology Review. The lack of transparency makes it harder for independent researchers to study how to make models safer.
Meanwhile, the AI industry has tripled its share of witnesses at US congressional hearings since 2017, while the presence of neutral academics has plummeted, per SiliconAngle. The companies building the technology are increasingly the ones explaining it to lawmakers.
The Environmental Bill
Global AI data centers now draw 29.6 gigawatts of power, enough to run the entire state of New York at peak demand, according to MIT Technology Review and SiliconAngle. Annual water use from GPT-4o inference alone may exceed the drinking water needs of 12 million people.
Training costs are escalating. xAI’s Grok 4 generated an estimated 72,000 tons of CO2-equivalent emissions during training, per IEEE Spectrum. That compares to 5,184 tons for GPT-4 and 8,930 tons for Meta’s Llama 3.1 405B. EpochAI independently estimates Grok 4’s emissions at approximately 140,000 tons, Perrault told IEEE Spectrum, though he cautioned these figures rely on inferred inputs.
Inference efficiency varies wildly. DeepSeek’s V3 models consume approximately 23 watts per medium-length prompt response, while Claude 4 Opus consumes about 5 watts, according to IEEE Spectrum. For anyone running agent fleets at scale, model selection has direct energy cost implications that multiply with every autonomous loop.
The Money and the Jobs
AI investment hit $581 billion in 2025, more than double the $253 billion in 2024 and surpassing the previous record of $360 billion set in 2021, per IEEE Spectrum. Unlike 2021, which was driven by mergers and acquisitions, 2025’s total was led by private investment into AI companies. Most of that money, over $344 billion, flowed into the United States.
World AI compute capacity has grown 3.3 times annually since 2022, a 30-fold increase since 2021, according to IEEE Spectrum. Nvidia GPUs account for over 60% of total global AI compute capacity.
Adoption is outpacing every prior technology wave. Fifty-three percent of the world’s population now uses generative AI regularly, a faster adoption curve than the personal computer, the internet, or smartphones, per SiliconAngle. An estimated 88% of organizations use AI. GitHub hosts 5.58 million AI-related projects as of 2025, a five-fold increase since 2020, according to IEEE Spectrum.
The job impact data is early but directional. Employment for software developers aged 22 to 25 has fallen nearly 20% since 2022, according to a Stanford economics study cited by MIT Technology Review. A third of organizations surveyed by McKinsey expect AI to shrink their workforce in the coming year, particularly in service operations, supply chains, and software engineering. AI is boosting productivity by 14% in customer service and 26% in software development, but those gains are not seen in tasks requiring more judgment.
The public knows this. While 73% of AI experts think AI will positively impact jobs, just 23% of the American public agrees, according to MIT Technology Review. Only 31% of Americans trust their government to regulate AI appropriately, the lowest score among surveyed nations except China at 27%.
The Calibration Problem for Agent Builders
The 2026 AI Index arrives at a particular moment for the agent ecosystem. Enterprise deployments are accelerating. HubSpot reported 8,000 customers activating Breeze agents last week. Oracle launched 22 autonomous agents for enterprise workflows. EY rolled out multi-agent systems for 130,000 auditors. Investment in agent-adjacent companies set records in Q1 2026.
The Stanford data provides a calibration check for that deployment wave. Agents that handle well-defined, code-centric tasks are approaching human performance. SWE-Bench scores near 100% confirm this. Agents handling complex, multistep, ambiguous workflows in the real world are at roughly 50% of PhD-level human performance. PaperArena scores of 39% confirm this.
The gap between those two numbers is the gap between what most enterprise agent deployments promise and what the technology currently delivers. The report does not say agents are useless. It says the best agents, powered by the best models, still fail at expert-level multistep work roughly half the time.
For builders, the takeaway is architectural: systems designed around agents that fail half the time require different guardrails, fallback patterns, and human-in-the-loop designs than systems built on the assumption that agents will “just work.” The companies deploying agents fastest are not necessarily the ones deploying them best.
Gil’s summary lands cleanly: “We are still far from a place where we understand how to use them effectively.” The $581 billion flowing into the industry suggests the market disagrees. The Stanford data suggests the market should check its benchmarks.