The Agent Harness: The Infrastructure Layer That Makes AI Agents Actually Work
Metadata
Created: 2026-02-21
Status: Published
Tags: AI Agents, Agent Harness, LLM Infrastructure, Enterprise AI, Anthropic, OpenAI
TL;DR
Everyone is building AI agents. Few are building what actually makes them work. The agent harness β the infrastructure layer that wraps around a model to give it memory, tools, safety, and persistence β is what separates a clever demo from a reliable production system. In 2025, we built agents. In 2026, we're building the harnesses that make them last.
The Core Problem: Models Are Brilliant, but Amnesiac
Imagine hiring the world's most capable consultant. She can write code, analyze contracts, browse the web, and reason through complex problems at superhuman speed. There's one catch: every morning she wakes up with no memory of the day before. She doesn't know your project, your codebase, your preferences, or what she finished last Tuesday.
That's an LLM without a harness.
A frontier model like Claude or GPT-4o can reason, plan, and execute sophisticated tasks within a single context window. But enterprise work doesn't fit in a context window. It spans days, involves multiple systems, requires human approval at critical junctures, and demands an audit trail when something goes wrong. The model cannot solve these problems on its own. The infrastructure around the model must solve them.
That infrastructure is the agent harness.
What Is an Agent Harness?
An agent harness is the software layer that wraps around an AI model to manage its lifecycle, context, tools, memory, and safety constraints β everything the model needs to function reliably in the world, but cannot provide for itself.
The definition from Salesforce, which has built one of the largest commercial agent platforms, is crisp:
"An AI agent harness is the operational software layer that manages an AI's tools, memory, and safety to ensure reliable, autonomous task execution."
Parallel.ai's technical breakdown adds a crucial clarification:
"The harness is not the 'brain' that does the thinking. It is the environment that provides the brain with the tools, memories, and safety limits it needs to function."
The brain and the harness are distinct. You can swap the brain (upgrade to a newer model) without touching the harness. You can scale the harness (add more guardrails, more memory backends, more tools) without retraining the model. This separation is deliberate and powerful.
The Three-Layer Stack
Before going deeper, it helps to understand where the harness sits in the full agent stack:
LangChain's team drew this distinction explicitly: LangChain is a framework, LangGraph is a runtime, and DeepAgents is a harness. Each level adds a different kind of abstraction. The harness is the outermost layer β the one that touches the real world.
Why Agents Fail Without a Harness
Three failure modes kill agents in production:
1. Context Rot
Every model has a fixed context window. In long-running tasks, that window fills with tool outputs, conversation turns, and intermediate results. As the window approaches its limit, the agent begins to "forget" the original goal. It ignores instructions from the start of the session. It re-does work it already completed. It declares victory when the task is half-finished.
Anthropic's engineering team documented this precisely: agents often "run out of context in the middle of implementation, leaving the next session to start with features half-implemented and undocumented." Even with context compaction β summarizing older messages to free window space β the summary doesn't always carry forward the right details.
2. Agent Collision
Multi-agent systems create a coordination problem. When several agents work in parallel toward a shared goal, each makes locally rational decisions that can be globally destructive. One agent writes a file another is reading. Two agents implement the same feature. An agent declares a task done before a dependency finishes. Without a harness managing shared state and locking, agents collide.
3. Black-Box Decisions
An agent without observability is a liability in any regulated environment. When it fails β and it will fail β you need to know what it decided, what tools it called, what data it read, and why it took the action it took. A raw LLM loop gives you none of this. The harness must record every step, every tool call, every decision, before the agent touches any production system.
The Six Core Components of an Agent Harness
Component 1: Context Management
Context management is the harness's job of controlling what the model sees β and what it doesn't.
The challenge: Models nominally have large context windows (128K to 1M+ tokens), but research consistently shows they effectively use only the first and last 8Kβ50K tokens. Critical instructions buried in the middle get ignored.
What the harness does:
Prioritization: Keeps the task goal and current state near the beginning of the context; compresses or evicts stale tool outputs
Compaction: When the window fills, the harness summarizes the session β preserving architectural decisions, unresolved bugs, and open questions, while discarding repetitive tool outputs
Artifact externalization: Instead of keeping everything in-context, the harness writes structured artifacts to disk (feature lists, progress files, git state) that the next session can read on startup
Anthropic's implementation in Claude Code: The Claude Agent SDK passes the full message history to the model to generate a compressed summary when approaching the context limit. The summary explicitly preserves what matters: architectural decisions, unresolved bugs, in-progress implementations. It discards what doesn't: redundant tool confirmations, exploratory dead ends.
Component 2: Tool Orchestration
Tools are how agents act on the world. The harness manages the entire tool call lifecycle: detection, validation, execution, and result injection.
The lifecycle:
Model generates text containing a tool call (a structured JSON block)
Harness intercepts the output, detects the tool call pattern
Harness pauses model text generation
Harness executes the tool in an appropriate sandbox (browser, shell, database, API)
Harness injects the result back into the model's context
Model resumes reasoning over the live result
What sophisticated harnesses add on top:
Tool access control: Only certain agents can call certain tools (the filing agent can't delete the database)
Rate limiting: Prevents an agent from hammering an external API in a loop
Sandboxing: Code execution happens in an isolated container; filesystem writes are scoped to a working directory
Retry logic: On transient failures, the harness retries with exponential backoff before surfacing the error to the model
Anthropic's implementation: The computer use framework gives Claude access to specialized tools β computer for GUI interaction via screenshots, bash for shell commands, text_editor for file manipulation β each with their own execution environment and sandboxing layer. Claude Code's harness (now the Claude Agent SDK) manages these tool calls through a recursive loop: claude β tool call β harness executes β result β claude β ...
Component 3: Memory Architecture
Memory is how agents accumulate knowledge across time. The harness implements multiple memory tiers with different latencies, capacities, and persistence guarantees.
The four memory tiers:
Working Memory
Current task state, recent messages
Current session
In-context window
Episodic Memory
Past task summaries, decisions made
Days to weeks
Vector DB, key-value store
Semantic Memory
Domain knowledge, user preferences, codebases
Long-term
Vector DB + retrieval
Procedural Memory
How to do recurring tasks (skills)
Permanent
Prompt templates, skill libraries
LangChain DeepAgents' implementation: The harness provides a configurable virtual filesystem as its primary memory backend. File system operations (read, write, search) become the memory interface. Any backend β local disk, S3, a database β can plug in behind the same interface. Memory, code execution, and context management all share this filesystem abstraction.
Letta's (MemGPT) implementation: Uses a three-tier architecture: in-context core_memory blocks for the agent's current self-model and task state; archival_memory for long-term vector search; and conversation_memory with automatic pagination when the message buffer grows too large.
Component 4: Human-in-the-Loop (HITL)
Agents that act autonomously on production systems must pause at high-stakes decisions. The harness implements the checkpoint system that makes this possible.
The mechanism:
Every tool call passes through a policy engine before execution
The policy engine classifies the action by risk level (read-only vs. write vs. destructive)
High-risk actions trigger a pause: execution halts, a notification fires, a human reviews
The human approves, rejects, or modifies the proposed action
The harness resumes (or aborts) based on the decision
What this looks like in practice:
An agent that can read any file but must ask before writing to production configs
An agent that can query any database but must approve before running UPDATE or DELETE
An agent that can browse the web but must confirm before submitting any form
Anthropic's Claude Code harness: Implements HITL through permission scopes. The harness checks each bash command and file operation against a permission set (read-only mode, sandbox mode, full access mode). In enterprise deployments, this maps to a formal approval queue. The agent cannot proceed until the human approves or rejects.
LangGraph's implementation: Built-in interrupt patterns that pause graph execution at any node. The state machine checkpoints its state to durable storage before pausing, so if the human takes 30 minutes to respond, the graph resumes exactly where it left off β with no context lost.
Component 5: Lifecycle Management
An agent harness manages the full lifecycle of an agent: initialization, execution, session handoffs, recovery from failure, and graceful termination.
The challenges the harness solves:
Cold start: On first launch, set up the working environment (git repo, feature list, progress files)
Warm resume: On subsequent launches, reload prior state from persistent artifacts
Session handoff: When a session ends (context full, timeout, user exit), write a structured handoff document for the next session
Crash recovery: If the agent dies mid-task, the harness can recover the last checkpoint and resume
Anthropic's two-agent harness for long-running tasks:
The engineering blog post describes a two-agent pattern that solves the session handoff problem:
The session handoff is not just a summary: it's a structured artifact. The coding agent leaves explicit notes about what it tried, what it left unfinished, and what the next session should do first. This dramatically reduces the "forgot what I was doing" failure mode.
Component 6: Observability and Evaluation
An agent harness records every step so you can understand, debug, and improve agent behavior.
What the harness captures:
Every message in the conversation loop
Every tool call: name, arguments, result, latency, error
Every decision point: what options the agent considered, which it chose
Every state transition: what changed in memory, filesystem, external systems
What evaluation does with this data:
Regression testing: Did a model update break a workflow that previously passed?
Correctness scoring: Did the agent reach the right answer, or just a confident-sounding wrong one?
Efficiency metrics: How many tool calls did it take? How much context did it consume?
Safety auditing: Did the agent ever attempt an action it wasn't authorized to take?
OpenAI's eval harness: The OpenAI Evals framework defines evals as: prompt β captured run (trace + artifacts) β checks β score. The harness records the full trace of an agent run. A separate grader (which can itself be an LLM) then scores each trace against a rubric. Scores accumulate into a dashboard that tracks model and harness quality over time.
Anthropic's Bloom: An open-source agentic framework for behavioral evaluations. Given a target behavior to evaluate (e.g., "does the agent refuse to execute unauthorized SQL?"), Bloom automatically generates a battery of test scenarios, runs the agent against them, and quantifies how often the behavior occurs and how severe it is when it does.
Real Companies, Real Harnesses
Anthropic: Claude Code / Claude Agent SDK
Claude Code is, in essence, an agent harness. Anthropic's engineering team describes it directly: "Claude Code is a flexible agent harness." The core primitives β context compaction, tool execution, HITL approval, session persistence β were extracted from Claude Code and published as the Claude Agent SDK, enabling any team to build long-running agents with the same infrastructure.
Key design decisions:
Filesystem as IPC: Agent coordination happens through files, not sockets. Simple, debuggable, backend-agnostic.
Git as state: Committed code is clean state. A crash mid-commit is recoverable because git is transactional.
Structured handoffs: Progress is written to explicit files (not just context summaries) that survive session boundaries.
OpenAI: Harness Engineering with Codex
OpenAI's internal team built a product called "harness engineering" β and published a detailed breakdown of what it means to build large-scale software with Codex agents instead of human engineers.
Their harness has three distinctive components:
Context Engineering: A continuously maintained knowledge base embedded in the codebase itself. Agents have access to observability data, browser navigation, and architectural documentation as live context β not static snapshots.
Architectural Constraints: Enforced not just by LLM judgment but by deterministic linters and structural tests. An agent cannot merge code that violates architectural rules; the harness catches violations mechanically before they propagate.
Garbage Collection Agents: Periodic agents that scan the codebase for documentation drift, architectural violations, and stale constraints. They fight entropy. Without them, a million-line AI-generated codebase would degrade into chaos.
The result: a team of Codex agents wrote approximately one million lines of production code across 1,500 pull requests in five months, with zero manually-written code.
LangChain: DeepAgents (Open-Source Harness)
LangChain published DeepAgents as an explicit, opinionated agent harness. Its architecture makes the harness abstraction concrete:
Key capabilities:
Planning tool: The agent breaks work into a plan before executing, giving the harness structure to track progress
Filesystem backend: Memory, code artifacts, and context all share one storage interface; swap from local disk to S3 without changing agent code
Subagent spawning: The harness manages ephemeral sub-agents for isolated parallel tasks β each gets its own context, its own filesystem scope, its own HITL policy
Salesforce: Agentforce
Salesforce's Agentforce is the enterprise-grade harness designed for CRM-native agents. Its harness abstractions are built around enterprise concerns that pure ML companies often overlook:
Data 360: A unified data layer with PII redaction on ingestion, so agents never see raw customer data in their context
Intelligent Context: Unstructured CRM data (emails, notes, call logs) is automatically structured before entering agent context
Governance controls: Role-based access control on which agents can call which tools β a support agent cannot access billing systems
Agentforce 3 observability: Per-agent audit trails that satisfy enterprise compliance requirements
Applying the Agent Harness to Enterprise Systems
The research is clear: context engineering, tool management, memory, and observability are table stakes for any production agent system. Here's what each component means when you build one yourself.
Design Principle 1: Externalize All State
Never rely on context as your only state store. Context is ephemeral β it disappears when the session ends. Every piece of state that must survive a session boundary goes into an external store:
Work-in-progress β git commits (atomic, reversible)
Task tracking β a JSON file or database row (readable by the next session)
Decisions made β a structured log (auditable, debuggable)
Agent preferences β a vector database (retrievable by semantic search)
The Anthropic harness pattern is instructive: the initializer agent writes a feature list on day one. Every coding session reads that list, finds the next unchecked item, and marks it done when finished. The harness never has to trust the model's memory.
Design Principle 2: Enforce Safety at the Infrastructure Level
Don't rely on the model to refuse dangerous actions. The model will be convinced, manipulated, or simply confused eventually. Safety must be deterministic, enforced by the harness before the tool executes:
The key insight from OpenAI's harness: architectural constraints are enforced by deterministic linters, not LLM judgment. The LLM can't violate a database schema. The linter catches it mechanically.
Design Principle 3: Build for Observability First
An agent that you can't debug is worse than no agent at all β it fails opaquely and takes your trust with it. Build your trace model before you build anything else:
Every step is logged before the action executes. On failure, you have a complete trace. On success, you have the data to score the run and catch regressions when the model updates.
Design Principle 4: Plan Before You Act
The biggest predictor of agent success in complex tasks is whether the agent plans before it executes. The harness should require a plan:
Force a planning step: The harness injects a system prompt that requires the agent to write a plan to a file before taking any external action
Make the plan machine-readable: JSON, not prose. The harness can parse it, track progress against it, and detect when the agent diverges
Use the plan for recovery: If the agent crashes mid-task, the harness loads the plan, identifies the last completed step, and resumes from there
This is exactly what Anthropic's initializer agent does: it writes 200+ requirements to a JSON file before a single line of code is written. Every subsequent session reads that file and knows exactly where it stands.
Design Principle 5: Layer Your Memory
Don't reach for a vector database as your first answer to memory. Build a layered system:
Retrieve from L4 on task start; summarize L1 into L2 on session end; write outputs to L3 continuously. This gives you the right information at the right time without blowing the context window.
The Shift That Changed Everything
In 2025, every team asked: "Which model should we use?"
In 2026, the smarter teams are asking: "How should we harness it?"
Aakash Gupta's framing is direct: "2025 was agents. 2026 is agent harnesses." The model has become a commodity. The harness is the moat.
Philipp Schmid from Google DeepMind echoes this: "Great harnesses manage human approvals, filesystem access, tool orchestration, sub-agents, prompts, and lifecycle. The harness determines whether agents succeed or fail."
OpenAI's harness engineering experiment is the clearest proof. They didn't build better models to generate a million lines of production code. They built a better harness: better context engineering, better architectural enforcement, better garbage collection. The model was already capable. The infrastructure made it reliable.
The same truth holds for enterprise teams. You don't need a better model to build a reliable agent. You need:
Context management that keeps the right information in the window
Tool orchestration that executes safely and recovers from failure
Layered memory that persists knowledge across sessions
Human-in-the-loop controls that enforce policy deterministically
Lifecycle management that handles multi-session work gracefully
Observability that gives you a complete audit trail
Build these six things, and your agents will work. Skip them, and no model β however powerful β will save you.
Further Reading
Last updated