Microsoft Research released Webwright, an open-source web agent framework that replaces persistent browser sessions with a terminal workspace. The agent writes Playwright scripts, runs bash commands, and stores all code, logs, and screenshots as rerunnable artifacts. On the Odysseys long-horizon browsing benchmark, Webwright powered by GPT-5.4 scored 60.1%, compared to 33.5% for base GPT-5.4 and 44.5% for Opus 4.6, the previous state of the art.
How Webwright Works
Most web agents today predict one browser action at a time: click here, type there, scroll down. The model receives a screenshot or DOM text and outputs the next step. Webwright inverts this. According to Microsoft Research, the framework “separates the agent from the browser, and treats the browser as something the agent can launch, inspect, and discard while developing a program.”
The entire framework spans roughly 1,000 lines of code across three modules: a Runner (~150 lines), a Model Endpoint (~550 lines), and a terminal Environment (~300 lines). There is no multi-agent orchestration or planning hierarchy. A single agent loop sends context to the model, receives a thinking block and a shell command, executes the command, and feeds the output back into context.
Because the agent writes code rather than predicting pixel coordinates, it can chain multiple web interactions into a single step. Filling out a form, selecting a date range, or navigating across pages becomes a compact Playwright script rather than a sequence of individual click predictions, as MarkTechPost reported.
Benchmark Numbers
On Online-Mind2Web, which covers 300 tasks across 136 websites, GPT-5.4 with Webwright reached 86.67% overall accuracy with a 100-step budget. Claude Opus 4.7 scored 84.7% overall but performed better on hard tasks at 100 steps: 80.5% versus 76.6% for GPT-5.4.
The more striking result came from Odysseys, a benchmark designed for long-horizon browsing tasks averaging 272 words of instructions per task. Before Webwright, Opus 4.6 held the top score at 44.5% (April 2026 leaderboard). Webwright with GPT-5.4 hit 60.1%, a 26.6-point absolute improvement over the base model and a 35% relative gain over the previous best.
Cost matters too. According to the Microsoft Research write-up, GPT-5.4 averages $2.37 per task, while Claude Opus 4.7 costs $6.09 per task despite completing tasks in fewer steps (21.9 mean steps versus 26.3). The price difference comes from token pricing: GPT-5.4 runs $2.50 per million input tokens versus $5.00 for Opus 4.7, and $15.00 versus $25.00 per million output tokens.
Solving Premature Completion
One engineering problem the team tackled: agents claiming they finished when they had not. With open-ended bash access, models frequently declare success prematurely. Webwright addresses this with a self-reflection gate. The agent must generate a configuration, run a final verification script in a fresh folder with logs and screenshots, and pass its own judgment before the framework accepts a “done” signal. If the self-check fails, the framework drops the completion flag and the agent retries, per WinBuzzer’s analysis.
For context length, the framework compacts history every 20 steps into a single summary to prevent trajectories from exceeding model limits.
Cross-Platform Plugin Support
The GitHub repository shows plugin manifest additions dated May 6 for Claude Code, Codex, and OpenClaw. This positions Webwright as infrastructure that slots into existing agent harnesses rather than competing with them. The resulting scripts can be packaged as reusable CLIs with arguments, turning one-off browser automation into shareable, version-controlled tooling.
The framework also tested smaller models. Qwen3.5-9B, running with Webwright’s crafted tools, achieved competitive performance on the hard split of Online-Mind2Web, suggesting the terminal-native approach benefits models across the capability spectrum, not just frontier systems.