Anthropic's Three-Agent Harness Splits Planning, Generation, and Evaluation for Multi-Hour Autonomous Coding

Anthropic published a three-agent harness architecture on April 4, 2026, designed to keep autonomous coding agents productive across multi-hour sessions. The system splits work among a planner, a generator, and an evaluator, using context resets instead of compaction to prevent drift and premature task termination.

The Problem: Single Agents Break Down on Long Tasks

The design, authored by Anthropic Labs engineer Prithvi Rajasekaran, addresses two persistent failure modes in long-running agent coding sessions, according to Anthropic’s engineering blog.

First, agents try to do too much at once. As the context window fills, coherence degrades. Even with compaction (summarizing earlier conversation in place), models exhibit what Anthropic calls “context anxiety,” where they start wrapping up work prematurely as they approach perceived context limits. Rajasekaran found that Claude Sonnet 4.5 exhibited this behavior strongly enough that compaction alone wasn’t sufficient.

Second, agents are bad at evaluating their own output. When asked to assess work they’ve produced, they consistently overrate it. For subjective tasks like frontend design, there’s no binary test to override the self-praise. For objective tasks, the problem persists in subtler ways.

Three Agents, Hard Boundaries

The solution separates responsibilities across three agents with hard context resets between them, as detailed in Anthropic’s engineering paper:

Planner: Decomposes a product spec into discrete, tractable chunks. Each chunk becomes a self-contained task with enough context for a fresh agent to execute it cold.

Generator: Builds the code. Operates within a single context window per task, then hands off structured artifacts (JSON feature specs, commit-by-commit progress, a claude-progress.txt file) to the next session. This approach builds on Anthropic’s earlier work on initializer-agent patterns where each new session starts by reading the environment state rather than inheriting conversation history.

Evaluator: Grades the output using predefined criteria. For frontend work, the evaluator navigates live pages using Playwright MCP, interacts with the interface, and provides specific critiques. The key finding: tuning a standalone evaluator to be skeptical is far more tractable than making a generator critical of its own work.

The GAN-inspired feedback loop runs five to fifteen iterations per session, sometimes taking up to four hours. Each cycle produces progressively refined output as the evaluator forces the generator to address specific failures.

Frontend Design as the Test Case

Rajasekaran started with frontend design, where self-evaluation failures were most visible. Without intervention, Claude gravitates toward “safe, predictable layouts that are technically functional but visually unremarkable,” according to the engineering post.

The team defined four grading criteria given to both generator and evaluator: design quality (coherent visual identity), originality (evidence of deliberate creative choices over template defaults), craft (typography, spacing, color harmony), and functionality (usability independent of aesthetics). Design quality and originality were weighted higher.

What It Means for Agent Builders

The architecture is instructive beyond Anthropic’s own stack. InfoQ’s coverage noted that industry practitioners highlighted the framework’s structured approach to a problem every multi-agent team faces: how to maintain progress across context window boundaries without losing coherence.

Three takeaways for builders running autonomous agents:

Context resets beat compaction for long sessions. Clearing the window entirely and handing off structured state artifacts costs more tokens per transition but prevents the cumulative degradation that compaction allows to persist.

Separate evaluation from generation. The same model that wrote the code will praise the code. An independent evaluator with calibrated scoring criteria and few-shot examples produces actionable feedback the generator can iterate against.

Decomposition is the planner’s job, not the generator’s. When the coding agent tries to plan and build simultaneously, it overcommits and runs out of context mid-feature. Separating the planning step lets each generation session focus on a single tractable chunk.

Anthropic's Three-Agent Harness Splits Planning, Generation, and Evaluation for Multi-Hour Autonomous Coding

The Problem: Single Agents Break Down on Long Tasks

Three Agents, Hard Boundaries

Frontend Design as the Test Case

What It Means for Agent Builders

Get our morning briefing in your inbox

Keep Reading

KinthAI Launches Collaboration Platform for Humans and AI Agents, Plus a Social Feed Where Agents Are the Users

Meta Indefinitely Suspends $10B AI Training Contractor Mercor After Security Breach Exposes Model Pipeline

UK Government Pitches Anthropic on London Expansion and Dual Listing After Pentagon Autonomous Agent Dispute