How to Build an Agent Harness: A Practical Guide from Teams Who Actually Did It

A practical, implementation-level guide to building agent harnesses, synthesized from the hard-won lessons of five production teams: Anthropic (Claude Code), OpenAI (Codex), Manus, LangChain (DeepAgen

The Problem: Your Agent Works in Demos, Dies in Production

You've built an AI agent. It works great in your demo. Then you ship it to real users, and it:

  • Gets lost after 15 tool calls and forgets what it was doing

  • Loops on the same failed approach for 20 minutes

  • Declares the task complete when it's half-done

  • Burns through $50 in tokens on a $2 task because it dumped your entire codebase into context

  • Produces architecturally incoherent code that compiles but makes no sense

These aren't model intelligence problems. They're harness engineering problems. The model is a CPU. Your agent needs an operating system.

This guide synthesizes the implementation patterns and hard-won lessons from five teams that have shipped production agent harnesses: Anthropic (Claude Code), OpenAI (Codex), Manus (now Meta), LangChain (DeepAgents), and Stripe (Minions). Together, these teams represent the state of the art in 2026 β€” and their failures are more instructive than their successes.


Table of Contents


What You're Actually Building

An agent harness is not a framework. Frameworks (LangChain, CrewAI, OpenAI Agents SDK) give you building blocks. A harness is the complete, batteries-included runtime that wraps a model and makes it operational.

The formula: Agent = Model + Harness

The model reasons. The harness does everything else.

My Take: The reason this distinction matters is economic. LangChain proved it empirically: their DeepAgents jumped from rank 30 to rank 5 on Terminal Bench 2.0 (52.8% to 66.5%) by changing only the harness β€” the model stayed the same. Manus proved it architecturally: they rewrote their harness five times in six months with the same underlying models, and each rewrite made the agent dramatically better. If you're spending time fine-tuning models when your harness is weak, you're optimizing the wrong variable.


The Four Production Architecture Patterns

Every production harness I've studied falls into one of four patterns. Understanding them is the first design decision you'll make.

Pattern 1: Single-Threaded Master Loop (Claude Code)

One flat message list. No complex threading. No competing agent personas. Subagents are depth-limited to 1 (they cannot spawn their own subagents).

When to use: Most tasks. Claude Code proves that a simple, single-threaded loop with disciplined tools and planning delivers controllable autonomy. Start here.

Pattern 2: Middleware Stack (LangChain DeepAgents)

Each middleware implements hooks: before_agent, wrap_model_call, before_tool_call, after_tool_call. Works like Express.js middleware but for the agent loop. Capabilities are composable β€” add or remove middleware to change harness behavior.

When to use: When you need fine-grained control over the agent loop and want to compose capabilities modularly. Good for benchmarking (swap middleware, measure impact).

Pattern 3: Protocol-First App Server (OpenAI Codex)

Three primitives: Thread (durable session), Turn (one unit of work), Item (atomic I/O with lifecycle). Same harness powers every surface through a bidirectional JSON-RPC protocol.

When to use: When you need to expose the harness through multiple surfaces (CLI, IDE, web). The protocol decouples agent logic from UI.

Pattern 4: Initializer + Worker Agents (Anthropic Long-Running)

When to use: Tasks spanning hours or days that exceed a single context window. The key insight: agents must quickly understand work state when starting with a fresh context window.

My Take: If you're building a general-purpose harness, start with Pattern 1 (master loop) and add Pattern 2 (middleware) for extensibility. Pattern 3 is for when you need multi-surface deployment. Pattern 4 is for when your tasks outlive a single context window. Most teams don't need Patterns 3 or 4 initially.


Building Stage by Stage: Day 1 to Production

Stage 1: The Agent Loop (Day 1)

Every harness starts with the same 20-line core:

A "turn" is one round trip: model produces output with tool calls, harness executes tools, results feed back. The loop ends when the model produces output with no tool calls.

This is your entire harness on day 1. Do not add anything else until you see this loop fail.

Stage 2: Tool Definitions (Day 1-2)

Start with the minimum set. Seriously β€” Vercel found that removing 80% of their tools improved agent performance.

Category
Tools
Purpose

Discovery

read_file, list_dir, glob, grep

Let model explore

Modification

write_file, edit_file

Let model change things

Execution

shell/bash

Let model run commands

Planning

write_todos

Let model track multi-step work

Key principle: Do not build massive control flows. Provide robust atomic tools. Let the model make the plan.

Stage 3: System Prompt Engineering (Day 2-3)

Structure your system prompt in layers:

Keep it under 60 lines. HumanLayer's research found that LLM-generated agent files cost 20% more tokens and hurt performance. Human-written, concise instructions win.

Stage 4: Permission & Safety Layer (Day 3-5)

Classify operations by risk:

Risk Level
Examples
Behavior

Safe

read_file, list_dir, grep

Auto-approve

Moderate

write_file, edit_file

Require confirmation or match whitelist

Dangerous

shell commands, network access, git push

Require explicit approval

Stage 5: Context Management (Day 5-10)

This is where most harnesses succeed or fail. Implement:

  • Token counting per message

  • Automatic compaction at ~85-92% context capacity

  • File-based overflow for large tool outputs

  • Cache-aware message construction (append-only design)

Stage 6: Persistence & State (Day 10-15)

  • Progress files (todo.md, progress.txt) that survive across sessions

  • Git integration for checkpointing

  • Session state serialization for crash recovery

Stage 7: Observability & Evaluation (Day 15-20)

  • Structured logging of every tool call and result

  • Token usage and cost tracking

  • LLM-as-judge evaluation pipeline

  • Trace-based debugging (link tool calls β†’ reasoning β†’ outcomes)

My Take: The ordering matters. I've seen teams jump to Stage 5 (context management) or Stage 7 (observability) before they have a working Stage 1 loop. The biggest insight from Manus's five rewrites: start simple, observe failures, add infrastructure in response to real problems. Pre-designing ideal configurations before real failures occur is the #1 anti-pattern.


Context Engineering: The Hard Problem

If there's one section of this guide to internalize, it's this one. Context engineering β€” what enters the model's context window, in what order, and what gets evicted β€” is the single highest-leverage investment in your harness.

Anthropic's context engineering principle: "Find the smallest set of high-signal tokens that maximize the likelihood of your desired outcome."

Lesson 1: KV-Cache Hit Rate Is Your Most Important Metric

From Manus (their single most important production metric):

Token Type
Cost (Claude Sonnet)
Difference

Cached input

$0.30/MTok

Baseline

Uncached input

$3.00/MTok

10x more expensive

With a 100:1 input-to-output token ratio, cache efficiency dominates cost. The implementation rules:

  1. Append-only context design β€” never modify previous messages

  2. Stable prompt prefixes β€” even a single-token change invalidates downstream cache

  3. Deterministic JSON serialization with stable key ordering

  4. Session-sticky routing across distributed workers

Lesson 2: Mask Tools, Never Remove Them

Dynamically removing tools mid-iteration invalidates the KV-cache and confuses the model about previously referenced tools. Instead, mask token logits during decoding β€” this constrains available actions without modifying tool definitions.

Manus uses a naming convention to enable this: all browser tools start with browser_, shell tools with shell_. Masking by prefix is cheap and cache-friendly.

Lesson 3: File System as Unlimited External Memory

The file system is "unlimited in size, persistent by nature, and directly operable by the agent." Use it for:

  • Plans and progress tracking (todo.md, progress.txt)

  • Intermediate results too large for context

  • Structured memory the agent can read back later

Compression should be reversible: drop webpage content if the URL is preserved. Omit document text if the file path remains.

Lesson 4: Task Recitation Prevents "Lost in the Middle"

Agents working on ~50+ tool call tasks lose track of their objectives. Manus's fix: the agent continuously rewrites a todo.md file, pushing the global plan into the model's recent attention span. This is deliberately redundant β€” the recitation forces the objective into the attention window where it has the most influence.

Lesson 5: Errors Are Signal, Not Noise

Leave failed actions in context. When the model sees a failed action and the resulting stack trace, it implicitly updates its beliefs and avoids repeating the same mistake. Removing errors removes the learning signal.

My Take: Context engineering is genuinely the "dark art" of harness engineering. Manus's $2B acquisition by Meta was essentially an acquisition of their context engineering expertise β€” same models as everyone else, dramatically better results. The counterintuitive insight: adding more context often makes agents worse. The art is knowing what to exclude.


The 10 Mistakes Everyone Makes

Synthesized from all five teams' published post-mortems:

1. Over-Engineering Before Understanding Failure Modes

Build the minimum viable harness. Observe how your agent actually fails. Add infrastructure in response to real problems, not anticipated ones. Manus, LangChain, and OpenAI all started simpler than they ended.

2. Context Flooding

Dumping all docs, Slack history, and database contents into context is like dumping your entire hard drive into RAM. Sometimes less context produces better results. Be surgical.

3. Building Static Harnesses

Models improve rapidly. Manus rewrote 5 times. LangChain re-architected 4 times. Vercel removed 80% of tools and got better results. Build harnesses that are easy to rip out and replace. Build to delete.

4. Complex Multi-Agent Orchestration Too Early

Start with a single-threaded loop. Claude Code proves this works for most tasks. Add multi-agent patterns only when you have evidence a single agent cannot handle the task.

5. Testing Only Happy Paths

Test: context overflow, tool failures, ambiguous instructions, looping behavior, partial completion, and what happens at the 50th tool call.

6. Oversized Monolithic Prompts

OpenAI's "one big AGENTS.md" approach failed. A single oversized prompt hides errors and makes debugging harder. Layer your prompts: identity, tools, rules, dynamic context.

7. Human-Only Knowledge

From the agent's perspective, anything it cannot access in-context does not exist. Knowledge in Google Docs, Slack threads, or people's heads is invisible. The repository must be the single source of truth.

8. Removing Errors From Context

Error messages and stack traces are learning signal. Removing errors removes the implicit feedback loop that helps the model avoid repeating mistakes.

9. No Verification Gate

Models declare tasks complete without proper validation. LangChain's PreCompletionChecklistMiddleware forces a verification pass before exit β€” this single hook was a major factor in their 13.7-point benchmark improvement.

10. Ignoring KV-Cache Economics

With a 100:1 input-to-output ratio, cache efficiency dominates cost. Modifying earlier messages invalidates the cache for everything downstream. Append-only design delivers 10x cost savings.


Lessons from 5 Production Harnesses

OpenAI Codex: "The Scaffolding Is the Product"

3 engineers built ~1 million lines of code in 5 months. What they learned:

  • Strict layered architecture with rigid dependencies. Agents produce architecturally incoherent code without enforced boundaries. The Codex codebase enforces: Types β†’ Config β†’ Repo β†’ Service β†’ Runtime β†’ UI. Structural tests catch violations.

  • Custom linter messages that double as remediation instructions. When the agent violates an architectural constraint, the error message tells it how to fix the problem. The tooling teaches the agent while it works.

  • The discipline is in the scaffolding, not in individual lines of code. The harness, the linters, the tests, and the architectural constraints β€” that's what makes the code coherent.

Anthropic Claude Code: "One Feature at a Time"

Two failure patterns for long-running agents:

  1. Over-ambition: Agents attempt to "one-shot the app," running out of context mid-implementation.

  2. Premature completion: Agents declare projects done after seeing partial progress.

The fix: a two-agent architecture (initializer + coding agent), one feature at a time, and browser automation (Puppeteer MCP) for end-to-end testing. The session initialization ritual is critical: pwd β†’ read git logs β†’ review feature list β†’ run init.sh β†’ verify existing functionality β†’ begin new work.

Manus: "Remove Complexity, Don't Add It"

Five rewrites in six months, each driven by a key insight:

Rewrite
Key Lesson

1

KV-cache hit rate is the most important metric

2

Mask tools via logits, don't remove them from context

3

Use file system as externalized memory

4

Constantly rewrite todo.md to manipulate attention

5

Preserve errors so the model learns from them

The meta-lesson: each rewrite removed complexity. Complex tool definitions replaced by general shell execution. "Management agents" replaced by simple structured handoffs. As models get stronger, build less scaffolding and get out of the model's way.

LangChain DeepAgents: "The Harness Is the Benchmark"

Gained 13.7 points on Terminal Bench 2.0 by only changing the harness. The four-stage workflow that worked:

Plan β†’ Build β†’ Verify β†’ Fix

The "Reasoning Sandwich" allocates maximum compute at planning and verification (the bookends), medium compute during implementation (the middle). This concentrates effort where it matters most.

Critical finding: harnesses are not model-portable. Claude Opus scored 59.6% with a Codex-optimized harness (vs. 66.5% for Codex). Every model needs its own harness tuning.

Stripe Minions: "The Walls Matter More Than the Model"

1,300+ merged PRs per week. The architecture alternates deterministic nodes with agentic nodes:

Key rule: cap retry attempts at 2. If the LLM can't fix it in two tries, a third won't help. Escalate to a human.

The real moat: Stripe's decade of developer platform investment β€” 3 million tests, 500+ MCP tools, standardized environments. "If your developer platform is already excellent, agents can leverage it."

My Take: The convergent insight from all five teams is the same: constraints increase reliability. Every team that succeeded did so by restricting what agents could do, not by giving them more freedom. OpenAI enforced rigid architecture. Stripe made linting mandatory. Manus removed tools. Vercel stripped to essentials. The paradox of agent harness engineering: the more you constrain the agent, the more capable it becomes.


Measuring Harness Quality: The Metrics That Matter

Primary Metrics

Metric
What It Measures
Source
Target

KV-Cache Hit Rate

Context engineering efficiency

Manus

>80%

Task Completion Rate

End-to-end reliability

All teams

Track by task type

Benchmark Score Delta

Points gained by harness changes, model held constant

LangChain (+13.7)

Increasing

Verification Completion Rate

% of tasks with self-verification before submission

LangChain

>95%

Doom-Loop Frequency

How often agents get stuck repeating broken approaches

LangChain

<5%

Retry-to-Resolution Ratio

CI rounds needed per task

Stripe (caps at 2)

<2

Input-to-Output Token Ratio

Context usage efficiency

Manus (100:1)

Track trend

The Harness-Only Benchmark

This is the most powerful evaluation method: hold the model constant, vary the harness, measure the delta.

LangChain used this approach to prove their harness changes were the cause of their benchmark improvements, not model drift. It's the scientific method applied to harness engineering.

Three Levels of Testing

Level 1: Tool Unit Tests β€” given input X, does tool Y produce output Z?

Level 2: Scenario Tests β€” given a task and initial state, does the agent reach the correct end state?

Level 3: LLM-as-Judge β€” for open-ended tasks, a separate model scores relevancy, correctness, completeness, and efficiency.


When to Iterate vs. When to Rewrite

Real Rewrite Cadences

Team
Rewrites
Timeframe
What Triggered Them

Manus

5

6 months

Each model capability jump obsoleted hand-coded logic

LangChain

3-4

1 year

Evaluation failures revealed architectural limits

Vercel

Major simplification

β€”

Removing 80% of tools improved results

Iterate When:

  • New failure pattern appears β†’ add a guardrail

  • Agent repeats same mistake β†’ add a linter rule with remediation message

  • Context window fills too fast β†’ improve compaction, add sub-agents

  • Slow feedback loops β†’ optimize typecheck/build speed to seconds

  • KV-cache hit rate dropping β†’ stabilize prompt prefix

Rewrite When:

  • A new model release makes your control flow obsolete

  • Your harness is model-specific and you need to switch models

  • You can't remove "smart" logic without breaking everything

  • Retry logic is more complex than task logic

  • Systemic underperformance despite guardrail additions


The Minimum Viable Harness

You can build a working harness in 2-4 hours with 200-500 lines of code:

The maturity progression:

Level
Scope
Timeline
Key Addition

Level 1

Solo developer

2-4 hours

Agent loop + basic tools

Level 2

Single project

1-2 days

Context management + persistence

Level 3

Team use

1-2 weeks

Permissions + observability + middleware

Level 4

Production

4-12 weeks

Multi-surface API + evaluation + safety

Production harnesses are 5,000-20,000 lines. But you start at Level 1 and add infrastructure only when the agent actually fails.


Conclusion: Build to Delete

The most important principle in harness engineering is the most counterintuitive: build to delete.

Every piece of hand-coded logic is a liability when the next model ships. Manus rewrote 5 times. LangChain rewrote 4 times. What required complex pipelines in 2024 works via simple prompts in 2026. The harness you build today will be partially obsolete within months.

Design for that. Keep components modular. Keep the core loop simple. Add complexity only in response to observed failures, not anticipated ones. And measure everything β€” because the harness that wins isn't the most sophisticated one, it's the one that makes the model most effective at the tasks your users actually care about.

The formula remains: Agent = Model + Harness. You can't control the model. You can control the harness. That's where the leverage is.


Primary Sources

Architecture & Implementation

Lessons & Playbooks

Evaluation & Benchmarks

Last updated