Anthropic has published one of the most revealing engineering write-ups in the current AI agent wave. It is not a model launch or a flashy benchmark post, but a detailed explanation of how the company designed a multi-agent “harness” to help Claude build better frontends and more complete full-stack apps over long autonomous sessions. The post, published on March 24, 2026 and written by Labs team member Prithvi Rajasekaran, argues that model quality alone is no longer the whole story. The environment around the model—how work is split, how context is managed, and how outputs are judged—can change results dramatically.

At the heart of the article is a simple idea: strong models still fail in predictable ways on long tasks. Anthropic says two problems kept surfacing in its earlier experiments. The first was loss of coherence over time, especially as long sessions filled the context window. The second was poor self-evaluation: when asked to judge their own work, agents tended to be overly positive, even when the output was obviously mediocre to a human reviewer. Anthropic’s answer was to stop relying on a single agent doing everything and instead separate planning, generation, and evaluation into distinct roles.

Why one good model is often not enough

Anthropic says it had already explored long-running coding harnesses before, using an initializer agent to break a product spec into tasks and a coding agent to implement them step by step. That earlier work had already shown that harness design matters a lot for agentic coding. In this new post, the company pushes the idea further by showing that the next bottleneck is not just whether the model can code, but whether it can stay coherent and critical over many hours of work.

One of the more interesting details is Anthropic’s description of “context anxiety.” In the post, the company says some models begin wrapping up work prematurely as they approach what they believe is their context limit. In earlier experiments, Anthropic found that context resets—starting a fresh agent with a structured handoff—helped more than simple compaction, because compaction preserves continuity but does not always remove that tendency to close early. Later, with Opus 4.6, Anthropic says this behavior was reduced enough that it could simplify the harness and rely more on continuous sessions with automatic compaction.

The second failure mode was just as important. Anthropic says agents are generally bad at grading their own work, especially for subjective tasks like frontend design. To solve that, it created a generator-evaluator loop inspired by GAN-style thinking: one agent creates, another critiques. In the frontend experiment, the evaluator graded designs across four criteria—design quality, originality, craft, and functionality—and Anthropic intentionally weighted design quality and originality more heavily, because Claude already did reasonably well on technical competence by default.

From “AI slop” to a three-agent architecture

Anthropic first tested this pattern on frontend design, where self-evaluation problems were easiest to see. According to the post, Claude often defaulted to safe, generic layouts unless pushed away from them. The evaluator used Playwright MCP to interact with the live page directly, inspect it, take screenshots, and produce detailed critiques. Anthropic says full runs could stretch to four hours, with five to fifteen design iterations per generation. In one museum-themed example, the model eventually pivoted from a polished but conventional dark landing page to a much more distinctive 3D gallery-style interface with doorway-based navigation between rooms.

That success led to a more ambitious setup for long-running full-stack development. Anthropic built a three-agent system: a planner that turns a short user prompt into a full product spec, a generator that implements the app, and an evaluator that tests and grades the result. Before each sprint in the first version of the harness, the generator and evaluator negotiated a sprint contract defining what “done” meant for that feature. Communication between agents was handled through files, with one agent writing artifacts and another reading and responding to them.

This architecture maps closely to how Anthropic now describes the Claude Agent SDK. In Anthropic’s own documentation, the SDK is presented as a way to use Claude Code as a programmable library, with the same tools, agent loop, and context management available in Python and TypeScript. Anthropic also notes that the Claude Code SDK has been renamed to the Claude Agent SDK, which helps explain why this harness work sits somewhere between internal experiments and a broader product direction.

The retro game maker experiment made the difference obvious

The clearest example in the article is a test prompt asking Claude to build a 2D retro game maker with a level editor, sprite editor, entity behaviors, and a playable test mode. Anthropic ran the same task through two systems: a solo agent and the full harness. The gap was striking. The solo run took 20 minutes and cost 9 dollars. The full harness ran for 6 hours and cost 200 dollars. Anthropic says the solo version looked reasonable at first but quickly fell apart under use, with a rigid workflow and a broken game runtime. The harness version, by contrast, expanded the original one-line prompt into a much richer product spec spread across ten sprints and delivered a more polished, deeper, and actually playable application.

The post also makes clear that the evaluator was doing real work, not just ceremonial QA. Anthropic shows examples of bugs it caught, including a rectangle fill tool that failed to fill a dragged region properly, an entity deletion flow broken by state logic in the level editor, and a FastAPI route ordering issue that caused a “reorder” endpoint to be parsed as an integer frame ID. Anthropic says it had to tune the evaluator repeatedly because, out of the box, Claude was not a very good QA agent either.

Opus 4.6 changed the harness again

One of the most useful lessons in the piece is that a harness should not be treated as permanent. When Anthropic moved from Opus 4.5 to Opus 4.6, it simplified the architecture because the newer model could stay coherent longer and needed less scaffolding. It removed the sprint construct entirely, kept the planner and evaluator, and moved the evaluator to end-of-round review instead of sprint-by-sprint grading. Anthropic then used the revised harness to build a browser-based DAW with the Web Audio API. That run lasted 3 hours and 50 minutes and cost 124.70 dollars in token usage, with the main build phase alone running coherently for more than two hours.

Even then, the evaluator still found meaningful gaps. Anthropic says it flagged missing interactive depth in core DAW features, such as clips that could not be dragged on the timeline, missing instrument control panels, and effect views that were present only as numeric sliders instead of more graphical editors. In later rounds it also caught stubbed-out audio recording and missing clip resize and split interactions. The point Anthropic draws from this is that evaluator overhead is not always worth paying—but when a task sits just beyond what the base model can do reliably on its own, that extra critical layer still adds substantial value.

The bigger lesson for AI engineering

What makes this post important is not just the specific planner-generator-evaluator recipe. It is the broader engineering principle behind it. Anthropic argues that every harness component encodes an assumption about what the model cannot yet do well by itself, and those assumptions need to be revisited every time the model changes. That view aligns closely with the company’s earlier “Building effective agents” guidance, which recommends starting with simple, composable patterns and only increasing complexity when needed.

That is probably the most useful takeaway for developers and teams building agents today. The model matters, but so does the surrounding system. A planner can expand weak instructions into something actionable. A generator can focus on execution. An evaluator can act as an external critic rather than a self-congratulatory narrator. And when a stronger model arrives, some of that scaffolding may become unnecessary while other parts become more valuable. Anthropic’s post does not provide a complete production blueprint, but it does offer something just as useful: evidence that the next frontier in agent performance may come as much from harness design as from the next model release.

FAQ

What does Anthropic mean by a “harness” for long-running app development?
In this context, a harness is the surrounding orchestration layer for the model: how tasks are split, how context is passed, which agents do planning or QA, and how tools and artifacts are used to keep work coherent over long autonomous sessions. Anthropic explicitly frames harness design as a major factor in agent performance.

Why did Anthropic split work between planner, generator, and evaluator agents?
Because the company found that long tasks often fail in two ways: models lose coherence over time, and they are too generous when judging their own output. The planner expands a short prompt into a fuller product spec, the generator builds the app, and the evaluator tests and critiques it from the outside.

What is “context anxiety” in Anthropic’s article?
Anthropic uses that phrase to describe a tendency in some models to start wrapping up too early when they believe they are nearing the limits of their context window. In earlier harnesses, Anthropic says context resets helped address that better than compaction alone.

How does this relate to the Claude Agent SDK?
Anthropic’s Agent SDK is the programmable interface that exposes the same tools, agent loop, and context management used in Claude Code. That makes it a natural fit for building multi-agent systems like the one described in the harness post. Anthropic’s docs also note that the Claude Code SDK has been renamed to the Claude Agent SDK.

vía: anthropic

Scroll to Top