Built 26/04/19 09:34commit 1ac42b7
Anthropic Harness Design For Long-Running Application Development
中文 | English
Summary
This source describes how Anthropic evolved a long-running coding harness from simple decomposition into a planner-generator-evaluator architecture, with the strongest gains coming from explicit evaluation, structured handoff artifacts, and periodic simplification as model capabilities improve.
Source
- Raw file: raw/anthropic/Harness design for long-running application development.md
- Translated raw file: raw/anthropic/Harness design for long-running application development.zh.md
- Original URL: https://www.anthropic.com/engineering/harness-design-long-running-apps
- Author: Prithvi Rajasekaran
- Ingest date: 2026-04-08
Key Contributions
- Recasts multi-agent coding harnesses in generator-evaluator terms, with a planner added to expand underspecified prompts into product specs.
- Argues that self-evaluation is weak by default and that a separate skeptical evaluator is easier to tune than a self-critical generator.
- Distinguishes context resets from compaction: resets solve context anxiety more cleanly, but add orchestration cost.
- Shows that scaffolding should be treated as temporary and load-bearing assumptions should be re-tested as models improve.
- Makes verification concrete through evaluator tooling, sprint contracts, and thresholded grading criteria.
Strongest Claims
- Planner, generator, and evaluator roles create better long-running coding outcomes than a solo agent on tasks near the model's capability boundary.
- Structured artifacts and explicit handoffs matter because long-running work loses coherence over time.
- Evaluators are not universally required; they are worth the cost only when the task sits beyond what the current model handles reliably on its own.
- Harnesses should become simpler when newer models absorb responsibilities that the scaffold previously had to supply.
Practical Implications For This Vault
- Source ingestion should preserve structured artifacts when a source describes an operating method rather than just a concept.
- Topic pages about agent systems should capture not only architecture but also when each layer stops being worth its complexity.
- Lint passes should look for sources that imply missing canonical topics, not just broken links.