Anthropic Harness Design For Long-Running Application Development

中文 | English

Summary

This source describes how Anthropic evolved a long-running coding harness from simple decomposition into a planner-generator-evaluator architecture, with the strongest gains coming from explicit evaluation, structured handoff artifacts, and periodic simplification as model capabilities improve.

Source

Raw file: raw/anthropic/Harness design for long-running application development.md
Translated raw file: raw/anthropic/Harness design for long-running application development.zh.md
Original URL: https://www.anthropic.com/engineering/harness-design-long-running-apps
Author: Prithvi Rajasekaran
Ingest date: 2026-04-08

Key Contributions

Recasts multi-agent coding harnesses in generator-evaluator terms, with a planner added to expand underspecified prompts into product specs.
Argues that self-evaluation is weak by default and that a separate skeptical evaluator is easier to tune than a self-critical generator.
Distinguishes context resets from compaction: resets solve context anxiety more cleanly, but add orchestration cost.
Shows that scaffolding should be treated as temporary and load-bearing assumptions should be re-tested as models improve.
Makes verification concrete through evaluator tooling, sprint contracts, and thresholded grading criteria.

Strongest Claims

Planner, generator, and evaluator roles create better long-running coding outcomes than a solo agent on tasks near the model's capability boundary.
Structured artifacts and explicit handoffs matter because long-running work loses coherence over time.
Evaluators are not universally required; they are worth the cost only when the task sits beyond what the current model handles reliably on its own.
Harnesses should become simpler when newer models absorb responsibilities that the scaffold previously had to supply.

Practical Implications For This Vault

Source ingestion should preserve structured artifacts when a source describes an operating method rather than just a concept.
Topic pages about agent systems should capture not only architecture but also when each layer stops being worth its complexity.
Lint passes should look for sources that imply missing canonical topics, not just broken links.

everything-claude-code

learn-claude-code

awesome-design-md

design-md

skills

llm-wiki

tips

2026

2026-04

claude-code

everything-claude-code

claude-mythos-preview

multica-ai

piebald-ai

voltagent

codex

configuration

skills

Anthropic Harness Design For Long-Running Application Development ​

Summary ​

Source ​

Key Contributions ​

Strongest Claims ​

Practical Implications For This Vault ​

Related Pages ​