Anthropic Says Claude Writes 80% of Its Production Code and Warns Recursive Self-Improvement Is Approaching

Anthropic published previously unreported internal data on June 4 showing that Claude now authors more than 80% of all code merged into the company’s production codebase. The figure sat in the low single digits before Claude Code launched in research preview in February 2025. In a companion BBC Newsnight interview, co-founder Jack Clark warned that AI is nearing a point where it could develop without human input and called for regulatory “brake pedal” mechanisms that do not yet exist.

The report, published through the newly formed Anthropic Institute, frames these numbers as evidence that recursive self-improvement, the scenario in which an AI system autonomously designs and builds its own successor, “could come sooner than most institutions are prepared for.”

The Numbers Inside Anthropic

The headline figure, 80% of merged code authored by Claude, is striking on its own. The trajectory behind it is more revealing.

Anthropic’s internal data shows lines of code merged per engineer per day held steady across the company’s first four years (2021 to 2024). The line began climbing in 2025 when Claude shifted from suggesting code snippets to executing code directly. It steepened again in 2026 as models began working autonomously over longer time horizons. By Q2 2026, the typical Anthropic engineer was merging 8x as much code per day as in 2024, according to the VentureBeat analysis of the same data.

Anthropic acknowledges that lines of code is an imperfect metric. “8x lines of code/engineer/day in the second quarter of 2026 is almost certainly an overstatement of the true productivity gain,” the report states. But the company also cites a March 2026 poll of 130 employees across research teams: the median respondent estimated roughly 4x output with Mythos Preview compared to working without AI models, according to the Anthropic Institute report.

The quality of Claude’s output is converging with human engineers as well. On the most open-ended engineering tasks, those with no clear specification, Claude’s success rate hit 76% in May 2026, a 50 percentage point increase in six months. Anthropic reports that many staff now believe Claude-written code “was still worse in quality than human-written code at Anthropic in late 2025, and is roughly at parity today.” The company expects AI-written code to be “strictly better” within the year.

One case study from the report: in April 2026, Claude shipped over 800 fixes that reduced a class of API errors by a factor of one thousand. The overseeing engineer estimated a human would have taken four years to complete the same work.

Research Capability Is Accelerating Too

Code output is only half the picture. The Anthropic Institute report provides data on Claude’s growing role in research experimentation, a capability closer to the core of what recursive self-improvement would require.

Every time Anthropic releases a new model, it runs a standard benchmark: hand Claude some code that trains a small AI model and ask it to optimize for speed while passing correctness checks. In May 2025, Claude Opus 4 averaged a roughly 3x speedup. By April 2026, Claude Mythos Preview achieved 52x, according to Anthropic’s published model cards. For comparison, a skilled human researcher needs four to eight hours to reach 4x on the same task.

More significantly, in April 2026 Anthropic published a demonstration of Claude running an open-ended AI safety research project end to end. Agents were given the problem of whether a weaker model can reliably supervise a stronger one. They proposed hypotheses, ran experiments, shared findings with parallel agents, and iterated. Two human researchers recovered roughly 23% of the gap between floor and ceiling performance over about a week. The agents recovered 97% over 800 cumulative hours at roughly $18,000 in compute. Humans chose the problem and scoring rubric. The agents designed every experiment.

Anthropic also reports a measure of research judgment. Examining 129 real Claude Code sessions where researchers made suboptimal directional decisions, the company tested whether various Claude models would have chosen a better next step. Opus 4.5 (November 2025) beat the human choice 51% of the time. Mythos Preview (April 2026) reached 64%.

The External Benchmark Picture

The internal data sits alongside public benchmark trajectories that tell a consistent story. METR, the independent AI evaluation organization, measures the length of tasks that frontier models can reliably complete autonomously. That duration has been doubling roughly every four months, up from a prior trend of doubling every seven months, as reported by Interesting Engineering.

The progression: Claude Opus 3 handled four-minute tasks in March 2024. Claude Sonnet 3.7 managed roughly 90-minute tasks a year later. Claude Opus 4.6 handled 12-hour tasks by early 2026. METR found that Claude Mythos Preview could work for “at least” 16 hours and was “at the upper end of what METR can measure without new tasks,” according to the Anthropic Institute report.

If the four-month doubling trend holds, models capable of multi-day autonomous tasks could arrive later this year. Multi-week autonomous tasks could be within range by 2027.

SWE-bench, the standard real-world software engineering evaluation, went from models scoring in the low single digits to saturation in two years. CORE-Bench, which tests whether AI can reproduce published scientific research, went from 20% success in 2024 to near-saturation within fifteen months.

Clark’s Regulatory Warning

Co-founder Jack Clark’s BBC Newsnight interview put the data in policy terms. “Right now, it’s like the AI industry has a gas pedal, but it doesn’t have a brake pedal,” Clark said. “You want the option to be able to take your foot off the gas and put your foot on the brake.”

Clark told the BBC that getting to 100% AI-authored code is “possible within two years” and “would have huge implications.” He drew a parallel to the early oil industry: “Society’s response was to come up with a sensible policy and regulatory framework that gave people confidence in oil and the benefits that oil could provide to the world, and meant that you didn’t have to worry about the personalities of the people leading the companies. That’s clearly where we end up here.”

Clark also flagged labor disruption. “I am worried for my kids if we as a society don’t have a serious conversation about what the implications of AI’s continued advances mean,” he told BBC Newsnight. On creative work specifically, Clark noted: “There are open questions about whether AI systems can be truly creative… there is not really evidence for that yet.”

Anthropic plans to “engage lawmakers about recursive self-improvement in the coming months,” according to Axios. OpenAI has also raised concerns: in a December 2025 blog post, the company described recursive self-improvement as “a potentially dangerous phenomenon if researchers don’t share information about it,” Axios reported.

Three Futures, One Bottleneck

The Anthropic Institute report maps three scenarios for what comes next.

In the first, the trend stalls. Current capabilities diffuse across the economy but models stop getting meaningfully better at research judgment. Even frozen at today’s level, 100-person companies could do the work of 1,000-person organizations, the report argues. Anthropic explicitly says it does not believe this scenario is likely.

In the second, AI development becomes “substantially automated” but humans continue to set research directions. Organizations see compounding efficiency gains. This is where the report says “the evidence we’ve laid out suggests we’re likely heading.”

In the third, AI systems achieve full recursive self-improvement and begin building their own successors. “The pace of progress in AI development becomes determined entirely by the availability of compute,” with humans shifting to “oversight, validation, and verification of an expanding virtual lab run by AI systems.”

VentureBeat’s analysis highlights a practical constraint Anthropic has already hit: Amdahl’s law. Flooding the system with AI-generated code immediately turned human code review into a critical bottleneck. The company deployed automated Claude reviewers in its CI/CD pipeline to address this, and found through retrospective analysis that automated review would have caught roughly a third of bugs behind past claude.ai production incidents.

The Remaining Gap

The gap between current reality and recursive self-improvement is narrow but significant. The Anthropic Institute report is precise about where it lies: “research taste and judgment, including choosing which problems matter, which results to trust, and when an approach is a dead end.”

Claude can execute engineering tasks from underspecified goals. It can optimize code at superhuman levels. It can run open-ended experiments and, 64% of the time, suggest better research directions than human researchers on novel problems. What it cannot yet do is decide which problems are worth working on.

Anthropic’s report is candid that this gap may close: “Research taste might be just another AI capability that AI systems fail at for a time, then get good at.” The company points to other qualitative capabilities, like theory of mind and linguistic riddle-solving, that followed the same pattern.

The Anthropic Institute states it will “conduct research and take actions to help build the systems that a credible slowdown or pause would require.” That framing is notable: the company acknowledges it would “likely be a good thing” to slow development for societal preparation, but argues unilateral slowdowns risk letting “the least cautious actors catch up technologically.”

One anonymous Anthropic employee quoted in the report captured the tension: “On days where everything works well, I can’t help but think nothing I do matters, everything is automated and better and faster than I ever will be. But then there are days where everything breaks and I don’t understand why and I realize I have no idea what I’ve been up to anymore.”

For agent builders and platform operators, the data from this report forces a concrete question: if the company building your foundation model is already 80% automated in its own engineering, and that percentage is rising, what does your team’s relationship with these systems look like in twelve months?

Anthropic Says Claude Writes 80% of Its Production Code and Warns Recursive Self-Improvement Is Approaching

The Numbers Inside Anthropic

Research Capability Is Accelerating Too

The External Benchmark Picture

Clark’s Regulatory Warning

Three Futures, One Bottleneck

The Remaining Gap

Get our morning briefing in your inbox

Keep Reading

Microsoft Scout Built on OpenClaw Signals Agents Are Becoming Enterprise Infrastructure

OpenAI, Anthropic, and Vercel Ship Agent Workflow Infrastructure Within Hours of Each Other

DeepSeek's $7.4B Round and OpenAI's $34B Burn Rate Reveal Two Competing Models for Funding the AI Arms Race