Andrej Karpathy, former OpenAI co-founder, has issued what amounts to a three-step intervention for the AI agent industry. In a recent sharing session reported by BigGo Finance, Karpathy argued that the current wave of agent investment is repeating a mistake OpenAI already made a decade ago.
In 2016, OpenAI attempted to build agents that could book flights and order food using reinforcement learning. The project failed completely and cost the company five years before it pivoted to language models. “The only hammer the team had was reinforcement learning,” Karpathy said, according to BigGo Finance. “The correct move at that point should have been to completely forget about AI agents and focus all energy on building language models.”
Karpathy’s Three-Step Logic
His framework breaks down to three claims. First, stop forcing agents to do everything and understand the underlying model. Second, recognize that demos are easy but products take a decade. Third, agents are not the product. The foundation model is the core. When the foundation is solid, agents emerge naturally.
The argument finds support in data released the same week. OpenAI’s own GeneBench-Pro benchmark, reported by BigGo Finance, showed that GPT-5.6 Sol achieved only a 28.7% end-to-end pass rate across 129 real-world scientific research workflows. Claude Opus 4.8 managed 16.0%. OpenAI called the gap between noticing data anomalies and actually acting on them the “notice-act gap,” a term that neatly captures the shortfall Karpathy described.
Labs Building Around the Shortfall
Each major lab is handling the shortfall differently, BigGo Finance reported. Anthropic released Claude Science, a research workbench that explicitly does not rely on new models. Instead it orchestrates existing capabilities through workflows, connecting to over 60 scientific databases and using MCP protocol to call external vertical models for specific computations. Claude handles task decomposition and interpretation, but outsources the heavy specialized work.
OpenAI is using GeneBench-Pro to define the standard, then pushing GPT-Rosalind, a biological reasoning fine-tuned model, to chase high scores on its own benchmark. Google DeepMind holds a structural advantage: AlphaFold and AlphaGenome are proprietary assets integrated into Gemini for Science, giving Google native access to scientific tools that competitors must treat as external APIs, per BigGo Finance.
Independent Developers at the Forefront
Karpathy’s most disruptive claim targets the competitive dynamics. “When a paper on a new type of agent is published, teams at major labs also find it eye-opening, because they haven’t been secretly developing in that specific branch for five years,” he said, per BigGo Finance. “This means the giants must compete on equal footing with all grassroots entrepreneurs and hackers in this space.”
Pharmaceutical giant Novo Nordisk appears simultaneously on Anthropic’s Claude Science client list and OpenAI’s list of early Rosalind partners, BigGo Finance reported. The same client trialing multiple vendors in parallel suggests no single lab has locked in the market.
The Autonomous Driving Parallel
For builders running agent systems today, Karpathy’s comparison to autonomous driving carries weight. Self-driving has already demonstrated a decade-long gap between demo and product. Agent systems fit the same pattern. The current generation of foundation models can notice problems but cannot reliably act on them across complex multi-step workflows. Until that gap closes, agent frameworks are engineering workarounds for model limitations, not durable products.
Karpathy’s closing message to developers was both a warning and an encouragement: “You are at the very forefront of this transformative technology.”