AI Cyber Capability Doubling Every 4.7 Months, Now Outpacing AISI's Own Evaluation Framework

The UK AI Safety Institute published new data on May 14 showing that frontier AI models are gaining autonomous cyber capability faster than its own benchmarks can track. The 80%-reliability “cyber time horizon,” which measures the length of cybersecurity tasks models can complete autonomously compared to human experts, has been doubling every 4.7 months since reasoning models emerged in late 2024.

That 4.7-month doubling time is itself an acceleration. In November 2025, AISI estimated the doubling time at 8 months. The rate nearly halved in three months.

Claude Mythos Preview and GPT-5.5 Break the Trend

Two models have since outperformed even the accelerated curve. Claude Mythos Preview and GPT-5.5 “significantly outperformed this trend,” according to AISI’s analysis. Both models achieved near-100% success rates on the longest tasks in AISI’s narrow cyber suite under a 2.5 million token limit, pushing them to the measurement ceiling.

“It’s unclear whether this represents an isolated break from existing rates of progress or part of a new, faster trend,” AISI wrote.

Claude Mythos Preview became the first model to complete both of AISI’s “cyber range” simulations, which test sustained multi-step intrusion operations against small undefended enterprise networks after initial access has been gained. The model solved “The Last Ones,” a 32-step simulated corporate network attack, in 6 out of 10 attempts. It completed “Cooling Tower,” a 7-step industrial control system attack that no model had previously solved, in 3 out of 10 attempts. GPT-5.5 completed “The Last Ones” in 3 out of 10 attempts, according to Help Net Security.

The Evaluation Gap

The more significant finding is structural. AISI’s narrow cyber task suite can no longer differentiate between the most capable models. With the 2.5M token cap in place, Mythos Preview and GPT-5.5 hit near-perfect scores on the hardest tasks. Without the cap, success rates climb high enough that time horizons “become impossible to calculate,” AISI noted.

The token cap exists to keep results comparable across models over time, but it deliberately understates what the models can actually do. In cyber range experiments, AISI uses up to 100 million tokens and found that “performance would likely still improve beyond that budget, especially for recent models.”

This creates a measurement blind spot. The tools designed to track capability acceleration are now too easy for the models they are tracking.

Corroborating Data from METR

AISI’s findings align with independent research from METR, a nonprofit tracking AI performance on software engineering tasks. METR’s data shows software engineering capability doubling roughly every 4.2 months since late 2024, close to AISI’s 4.7-month cyber estimate.

The convergence matters because cyber operations and software engineering share overlapping skill sets. Consistent doubling times across independent benchmarks from two separate organizations suggest the acceleration is real, not an artifact of any single evaluation methodology.

What Governance Teams Need to Watch

AISI was careful to note limitations: “No single benchmark result should be read as a precise measure of AI capability,” and the trend line “is not a future prediction, nor a fixed law.” The longest-task estimates rely on only six tasks with durations of eight hours or more.

But the operational implication is concrete. Organizations deploying AI agents in security workflows, whether for vulnerability scanning, incident response, or red-team simulation, may be working with models whose autonomous capability exceeds what current evaluation frameworks can measure. The gap between what these models can do and what testing suites can verify is widening with each generation.

AISI’s advice pointed to the UK’s National Cyber Security Centre, which has published guidance urging organizations to prepare for an accelerating vulnerability-patch cycle driven by AI-discovered flaws.

AI Cyber Capability Doubling Every 4.7 Months, Now Outpacing AISI's Own Evaluation Framework

Claude Mythos Preview and GPT-5.5 Break the Trend

The Evaluation Gap

Corroborating Data from METR

What Governance Teams Need to Watch

Get our morning briefing in your inbox

Keep Reading

OpenAI Codex Arrives in ChatGPT Mobile App With Remote Agent Control on iOS and Android

Three Vertical AI Agent Startups Raise $84.5M Combined as Investors Shift from Frameworks to Domain-Specific Agents

Tencent Cloud Open-Sources Four-Tier Agent Memory Engine That Cuts Token Consumption 61% on OpenClaw