Anthropic released Claude Opus 4.7 on April 16, making it generally available across the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. The model improves on Opus 4.6 in agentic coding, long-running autonomous task handling, and vision, while introducing automated safeguards that detect and block prohibited cybersecurity requests.
Pricing remains unchanged at $5 per million input tokens and $25 per million output tokens.
Agentic Coding and Self-Verification
Opus 4.7’s most significant gains are in complex software engineering tasks that previously required close human supervision. On SWE-bench Pro, the model resolved 64.3% of tasks compared to 53.4% for Opus 4.6, according to Anthropic. On CursorBench, it cleared 70% versus 58% for its predecessor, per VentureBeat.
The model now devises its own verification steps before reporting a task as complete. In internal testing, Anthropic observed Opus 4.7 building a Rust-based text-to-speech engine and then independently feeding its generated audio through a separate speech recognizer to verify the output against a Python reference, VentureBeat reported.
On the GDPVal-AA knowledge work evaluation, Opus 4.7 scored an Elo of 1753, compared to 1674 for OpenAI’s GPT-5.4 and 1314 for Google’s Gemini 3.1 Pro, according to Anthropic. The lead is narrow: across directly comparable benchmarks, Opus 4.7 leads GPT-5.4 by a 7-to-4 margin, per VentureBeat. GPT-5.4 still outperforms on agentic search (89.3% vs. 79.3%), multilingual Q&A, and raw terminal-based coding.
High-Resolution Vision
Opus 4.7 processes images up to 2,576 pixels on the longest edge, roughly 3.75 megapixels and a threefold increase over previous models. On XBOW visual-acuity tests, the model jumped from 54.5% to 98.5%, according to VentureBeat. On arXiv visual reasoning with tools, it scored 91.0%, up from 84.7% on Opus 4.6, per Anthropic.
Cybersecurity Safeguards and Mythos
Anthropic publicly stated that Opus 4.7 is less broadly capable than Claude Mythos Preview, the restricted model currently deployed through Project Glasswing with Amazon, Apple, Cisco, CrowdStrike, Google, and others for zero-day vulnerability discovery.
Opus 4.7 ships with safeguards that automatically detect and block requests indicating prohibited or high-risk cybersecurity uses. Anthropic described these as the first deployment of cyber safeguards it will later apply to Mythos-class models for broader release, according to the company’s announcement. Security professionals performing legitimate penetration testing and red-teaming can apply for Anthropic’s new Cyber Verification Program.
Cost Controls for Agentic Workloads
Anthropic introduced a new “xhigh” effort parameter positioned between high and max, giving developers more granular control over reasoning depth and token consumption. The Claude API also launched “task budgets” in public beta, allowing developers to set hard ceilings on token spend for autonomous agents, VentureBeat reported.
One operational note for teams upgrading: Opus 4.7 follows instructions more literally than its predecessor. Anthropic warns that legacy prompt libraries may require re-tuning to avoid unexpected results from the model’s strict adherence to instructions.
Early Access Feedback
Multiple enterprise testers reported measurable improvements. Notion reported a 14% gain over Opus 4.6 at fewer tokens and a third of the tool errors. Hex said Opus 4.7 “correctly reports when data is missing instead of providing plausible-but-incorrect fallbacks.” Replit called it “an easy upgrade decision,” noting the same quality at lower cost. Devin described the model as enabling “a class of deep investigation work we couldn’t reliably run before,” per Anthropic.
The Benchmark Convergence Problem
The competitive picture reveals how tight the frontier model race has become. Three models from three companies are separated by single-digit percentage points on most benchmarks, with each holding specific domain leads. For teams selecting models for agentic deployments, the decision increasingly depends on which specific capabilities matter most for their use case rather than any single model’s overall ranking.