Claude Opus 4.7 Launches With Task Budgets, xhigh Effort, and Autonomous Self-Verification: Anthropic's GA Frontier Is Now Explicitly Agentic

Anthropic released Claude Opus 4.7 on April 16, 2026. It is the company’s most capable generally available model, and the first GA frontier model explicitly architected around production agent primitives: task budgets for controlling token spend in autonomous loops, a new xhigh effort level for granular cost-performance tuning, and autonomous self-verification where the model devises its own checks before reporting a task complete.

NCT covered the pre-announcement last week, when reports of the impending release triggered stock declines at Figma and Wix. The actual release confirms the competitive framing but reveals something more consequential for builders: Anthropic is not just shipping a better model. It is shipping the operational infrastructure that makes autonomous agents financially viable in production.

What Opus 4.7 Actually Ships

Three features distinguish this release from a standard model upgrade.

Task budgets for agentic loops are now in public beta on the Claude API. Developers set a hard ceiling on token spend for autonomous agent sessions. When a debugging agent enters a recursive investigation loop or a code review agent keeps expanding its scope, the task budget enforces a financial and computational boundary. According to Anthropic’s announcement, this is designed specifically for “long-running” autonomous workflows where costs can spiral without developer awareness. VentureBeat reports that internal data shows max effort yields the highest scores (approaching 75% on coding tasks), but the budget mechanism lets teams constrain spend without abandoning agentic workflows entirely.

The xhigh effort level sits between high and max on the API’s effort parameter. This is a pricing lever disguised as a capability toggle. Max effort produces the best results but consumes the most tokens. High effort is cheaper but cuts corners on complex multi-step reasoning. xhigh occupies the middle ground that production deployments actually need: strong enough for serious agentic work, constrained enough to not bankrupt a startup running agents at scale. Per VentureBeat, internal benchmarks show xhigh provides “a compelling sweet spot between performance and token expenditure.”

Autonomous self-verification is the behavioral change that matters most for agent reliability. Opus 4.7 independently devises verification steps before reporting a task as complete. Anthropic’s announcement describes users “being able to hand off their hardest coding work, the kind that previously needed close supervision, to Opus 4.7 with confidence.” VentureBeat’s testing documents a concrete example: in internal tests, the model built a Rust-based text-to-speech engine from scratch, then independently fed its own generated audio through a separate speech recognizer to verify the output against a Python reference. The model checked its own work without being asked to.

The Benchmark Picture

Opus 4.7 leads on several critical benchmarks, but the competitive landscape is closer than headlines suggest.

On the GDPVal-AA knowledge work evaluation, Opus 4.7 scores an Elo of 1753, ahead of OpenAI’s GPT-5.4 (1674) and Google’s Gemini 3.1 Pro (1314), according to VentureBeat. On SWE-bench Pro (agentic coding), it resolves 64.3% of tasks versus 53.4% for Opus 4.6. On GPQA Diamond (graduate-level reasoning), it reaches 94.2%. On arXiv visual reasoning with tools, it scores 91.0% compared to Opus 4.6’s 84.7%.

The gaps are real but narrow. VentureBeat notes that “on directly comparable benchmarks, Opus 4.7 only leads GPT-5.4 by 7-4.” GPT-5.4 still wins on agentic search (89.3% vs. 79.3%), multilingual Q&A, and raw terminal-based coding. This is not a clean sweep. It is a specialization trade-off: Opus 4.7 optimizes for the “reliability and long-horizon autonomy required by the burgeoning agentic economy,” as VentureBeat frames it, rather than trying to dominate every category.

The vision upgrade contributes to the agentic coding gains. Opus 4.7 processes images at up to 2,576 pixels on the longest edge, roughly 3.75 megapixels, a threefold increase over previous versions. VentureBeat reports that on XBOW visual-acuity tests, the model jumped from 54.5% to 98.5%. For agents that navigate dense UIs, read technical diagrams, or process screenshots in computer-use workflows, the “blurry vision” ceiling is gone.

Literal Instruction Following: A Feature and a Migration Cost

VentureBeat flags a behavioral change that will catch some production deployments off guard: Opus 4.7 follows instructions literally. Where older models might interpret ambiguous prompts loosely, Opus 4.7 executes the exact text provided. Anthropic warns that “legacy prompt libraries may require re-tuning to avoid unexpected results caused by the model’s strict adherence to the letter of the request.”

For agent builders, this is a double-edged feature. Strict instruction following is exactly what you want in an autonomous agent that executes multi-step workflows without human oversight. Loose interpretation creates unpredictable agent behavior. But every team running Opus 4.6 in production will need to audit their prompt templates before upgrading. The migration cost is non-trivial for complex agent systems with hundreds of prompt configurations.

Hex’s early testing offers a specific data point: “It correctly reports when data is missing instead of providing plausible-but-incorrect fallbacks, and it resists dissonant-data traps that even Opus 4.6 falls for.” The model refuses to guess. For financial applications, compliance workflows, and any domain where confident-but-wrong outputs create liability, this behavioral shift is the single most important upgrade.

The Tokenizer Change

Opus 4.7 ships with a new tokenizer that improves text processing efficiency, but VentureBeat reports it can increase the token count of certain inputs by 1.0 to 1.35x. Teams running cost-sensitive agent workloads need to benchmark their actual prompt payloads against the new tokenizer before switching. A 35% increase in input token count at $5 per million tokens shifts the economics of high-volume agent deployments.

GitHub Copilot Integration: The Enterprise Distribution Channel

GitHub’s changelog confirms Opus 4.7 is rolling out across GitHub Copilot for Pro+, Business, and Enterprise users, available in VS Code, Visual Studio, JetBrains, Xcode, Eclipse, Copilot CLI, GitHub Copilot Cloud Agent, github.com, and GitHub Mobile. GitHub is consolidating: over the coming weeks, Opus 4.7 will replace both Opus 4.5 and Opus 4.6 in the model picker for Copilot Pro+.

GitHub’s specific framing matters: “stronger multi-step task performance and more reliable agentic execution.” This is the language of agent reliability, not general intelligence. GitHub is targeting the long-running coding agent workflows, background agents, complex refactors, and CI/CD-triggered autonomous tasks where Copilot’s agent mode is increasingly deployed. The model launches with a 7.5x premium request multiplier as promotional pricing until April 30.

For enterprise developers, this is the most consequential distribution detail. Most enterprise teams will encounter Opus 4.7 through Copilot, not through the Claude API directly. GitHub’s consolidation to a single Opus model (replacing both 4.5 and 4.6) means every Copilot Enterprise deployment will be running the agentic-first model by default within weeks.

The Mythos Shadow

CNBC reports that Anthropic does not plan to make Claude Mythos Preview generally available. The company “has said its goal is to learn how it could eventually deploy Mythos-class models at scale.” Opus 4.7 ships with “safeguards that automatically detect and block requests that indicate prohibited or high-risk cybersecurity uses,” per Anthropic. Security professionals who want to use Opus 4.7 for legitimate cybersecurity work (vulnerability research, pen testing, red-teaming) can apply to Anthropic’s new Cyber Verification Program.

This creates a deliberate capability gradient. Mythos remains restricted to enterprise security partners under Project Glasswing. Opus 4.7 sits at the boundary of what Anthropic believes is safe for broad deployment, and “long-horizon agentic work” falls inside that boundary. The implication: Anthropic has assessed that autonomous agent deployment at scale is a lower-risk capability than unrestricted cybersecurity use, and is pricing and releasing accordingly.

VentureBeat notes that during Opus 4.7’s training, Anthropic “experimented with efforts to differentially reduce” its cyber capabilities compared to Mythos. This is the first public confirmation that Anthropic is training models with selective capability suppression: making models stronger at agentic coding and long-horizon reasoning while deliberately limiting their cyber-offensive potential. Whether selective suppression holds under adversarial pressure is an open research question, but the training intent is novel and worth tracking.

Pricing and Availability

Opus 4.7 is available immediately on Claude.ai, the Claude API, Amazon Bedrock, Google Cloud Vertex AI, Microsoft Foundry, and GitHub Copilot. Pricing stays at $5 per million input tokens and $25 per million output tokens, matching Opus 4.6, confirmed by Anthropic and VentureBeat. The API model identifier is claude-opus-4-7.

Price parity with Opus 4.6 makes the upgrade decision straightforward for existing deployments, assuming teams account for the tokenizer change and literal instruction-following migration. For new deployments, the $5/$25 pricing with task budgets and xhigh effort creates a more predictable cost envelope for agentic workloads than any previous Opus release.

The Production Agent Infrastructure Shift

Early access testers provide concrete operational context. A financial technology platform serving millions of consumers told Anthropic the model “catches its own logical faults during the planning phase and accelerates execution, far beyond previous Claude models.” An unnamed company reported that “low-effort Opus 4.7 is roughly equivalent to medium-effort Opus 4.6,” suggesting teams can achieve their current performance levels at lower token cost by adjusting the effort parameter. On a 93-task coding benchmark, Opus 4.7 “lifted resolution by 13% over Opus 4.6, including four tasks neither Opus 4.6 nor Sonnet 4.6 could solve,” according to another early tester quoted in Anthropic’s announcement.

The pattern across these early reports is consistent: the model’s value is operational, not just intellectual. It does not just answer harder questions. It manages its own work, checks its own output, and operates within budgets. These are the properties that production agent deployments require and that previous GA models lacked.

Opus 4.7 marks the moment when “agentic” stops being a marketing adjective attached to model releases and becomes a set of specific, shipped production primitives: task budgets, effort granularity, self-verification, and literal instruction adherence. The question is no longer whether frontier models can power autonomous agents. It is whether the operational infrastructure around those models, the cost controls, the verification loops, the instruction precision, is mature enough for teams to deploy agents they do not constantly supervise. Anthropic’s answer with Opus 4.7 is that the infrastructure is ready. The next months of production deployments will test whether that confidence is earned.

Claude Opus 4.7 Launches With Task Budgets, xhigh Effort, and Autonomous Self-Verification: Anthropic's GA Frontier Is Now Explicitly Agentic

What Opus 4.7 Actually Ships

The Benchmark Picture

Literal Instruction Following: A Feature and a Migration Cost

The Tokenizer Change

GitHub Copilot Integration: The Enterprise Distribution Channel

The Mythos Shadow

Pricing and Availability

The Production Agent Infrastructure Shift

Get our morning briefing in your inbox

Keep Reading

MCPwn: The First Major MCP Exploit in the Wild Is a CVSS 9.8 That Owns Your Nginx Server in Two HTTP Requests

Agentic Endpoint Security Is Now a Product Category: How Palo Alto, Norton, and a Hacked Samsung TV Got Us Here

Stanford's 2026 AI Index: Agents Score Half as Well as PhD Experts, China Erases US Performance Gap, and the Industry Stopped Explaining Itself