The Agent Harness: Everything Except the Model

There is a term consolidating across the AI engineering community right now, and it names something that has existed far longer than the word for it. The term is harness. It means: everything in an AI agent that is not the model itself.LangChain's Vivek Trivedy put it most cleanly: "If you're not the model, you're the harness." Anthropic's Claude Code documentation calls its SDK "the agent harness that powers Claude Code." Birgitta Böckeler at Thoughtworks wrote an entire framework for harness engineering on Martin Fowler's blog. Anthropic was already using the word in late 2025. By early 2026, after OpenAI published their own harness engineering guide, it had become the industry's default term. The concept it describes has been accumulating for two years.

Model size, training data, benchmark scores — all are important, however it turns out that the surrounding system often determines real-world agent performance more than the model inside it. LangChain demonstrated this directly: same model, same weights, different harness — their coding agent jumped from outside the top 30 to the top 5 on TerminalBench 2.0. According to LangChain's analysis, a separate research project hit a 76.4% pass rate by having an LLM optimise the harness infrastructure itself, surpassing hand-designed systems.

The model is the brain. The harness is the body. And it turns out the body matters enormously.

What a harness actually does

A raw LLM takes in text and outputs text. That is genuinely all it does. It cannot remember what happened five minutes ago. It cannot run a command. It cannot read a file. It cannot check whether its own output is correct. Every one of those capabilities — memory, tool execution, file access, self-verification — is a harness feature.

Beren Millidge mapped the analogy systematically in a 2023 essay: the LLM is the CPU, the context window is RAM (fast but limited), external databases are disk (large but slow), tool integrations are device drivers. The harness, in this framing, is the operating system.

The simplest possible harness is a chat interface. You type something, the model responds, the harness appends both messages to a list, and next time it sends the whole list back. That while loop — read input, call model, append output, repeat — is a harness. Everyone who has used ChatGPT has used one.

The interesting harnesses are the ones that go further. A production agent harness typically handles prompt construction (assembling the right context for the model at each turn), tool execution (intercepting the model's tool-call requests, executing them, and feeding results back), memory management (compacting, summarising, or persisting information across context windows), verification (running tests, checking outputs, retrying on failure), and state persistence (ensuring the project has memory even when the model does not). The model proposes. The harness disposes.

Why harnesses exist

Harnesses emerged because LLMs hit practical limits the moment anyone tried to use them for real work.

The first limit is memory. A standard LLM starts each session with no knowledge of previous interactions. Anthropic's engineering team described the problem precisely: imagine staffing a software project with engineers who work in shifts, where each new engineer arrives with complete amnesia about the previous shift. That is what a raw LLM does across context windows. Harnesses solve this by maintaining progress files, git histories, and structured logs that persist between sessions — giving the project memory even when the model has none.

The second limit is action. LLMs produce text. Tasks require actions — running code, querying databases, browsing the web, writing files. The harness bridges this by monitoring the model's output for tool-call commands, executing them in a sandboxed environment, and injecting the results back into the model's context. The harness gives the model hands.

The third limit is discipline. Without external structure, agents attempt to solve everything at once, declare victory prematurely, or produce outputs that look plausible but fall apart on inspection. Anthropic discovered this when they tried to have Claude build a full web application: the model attempted to one-shot the entire app, ran out of context mid-implementation, and left the next session to start with half-built, undocumented code. The fix was structural — an initialiser agent that creates a feature list and a progress file, followed by coding agents instructed to work on one feature at a time and commit after each one. That fix is a harness.

The fourth limit is context decay. Even within a single session, model performance degrades as the context window fills with irrelevant or contradictory information — a phenomenon now called context rot. Harnesses manage this through compaction (summarising older exchanges) and selective context injection (giving the model only what it needs for the current step, not everything that has ever happened).

None of these are model problems. They are all system problems. The harness exists because the gap between "can generate text" and "can do useful work" is a systems engineering problem, not a machine learning one.

Two meanings, one lineage

The word "harness" appears in two distinct but related contexts in AI engineering, and the distinction matters.

The first is the agent harness — the runtime infrastructure around an LLM-powered agent that enables it to plan, act, remember, and self-correct. This is what Anthropic, LangChain, and the broader agent infrastructure community are building. Claude Code, DeepAgents, Codex — these are agent harnesses.

The second is the evaluation harness — a framework for running models against standardised tasks so results are reproducible and comparable. The canonical example is EleutherAI's LM Evaluation Harness, which has become the default benchmarking tool for open-source language models. When a paper reports scores on MMLU or HellaSwag, it is almost certainly running through EleutherAI's harness.

The lineage is the same. Both are software systems that wrap a model, manage its inputs and outputs, and impose structure on what would otherwise be unconstrained text generation. The agent harness imposes structure to make the model useful. The evaluation harness imposes structure to make the model measurable. Different purposes, identical architectural principle: the model does not operate alone.

Who built this

The concept did not arrive from a single source. It converged.

Anthropic published "Effective Harnesses for Long-Running Agents" in November 2025, introducing the initialiser-plus-coding-agent pattern and the Claude Agent SDK as a general-purpose harness. The term exploded into common usage in February 2026 when OpenAI published a harness engineering guide by Ryan Lopopolo, describing how their team built roughly a million lines of production code with Codex agents — zero lines written by humans. Mitchell Hashimoto, creator of Terraform, distilled the core insight into a formula that practitioners adopted immediately: Agent = Model + Harness. LangChain formalised that equation in a series of posts through early 2026, culminating in the DeepAgentsproject — an open-source, batteries-included harness explicitly modelled on what made Claude Code effective. Birgitta Böckeler at Thoughtworks published a harness engineering framework on martinfowler.com in April 2026, introducing the feedforward/feedback distinction (guides that steer the agent before it acts, sensors that check after) and the computational/inferential axis (deterministic checks versus LLM-as-judge).

The evaluation harness lineage is older. EleutherAI's LM Evaluation Harness has been the standard benchmarking tool since 2023, and the concept of test harnesses in software engineering predates LLMs by decades. What changed is that the agent community adopted the word and extended it from testing to production runtime infrastructure.

There is no single inventor. There is a convergence point: the moment enough people tried to build production agents and discovered that the model was the easy part.

Examples in the wild

Claude Code is among the most prominent agent harnesses in production today. It wraps Anthropic's models with a terminal-based execution loop, file system access, git integration, context compaction, and a system prompt that constitutes a substantial portion of the agent's effective behaviour. When Anthropic says the Claude Agent SDK is "the agent harness that powers Claude Code," they are making a precise claim: the SDK is not the model. It is everything else.

LangChain's DeepAgents is an open-source harness built on the LangGraph runtime. It ships with a planning tool, filesystem access, sub-agent spawning, middleware hooks, and context management — what the team calls "a general-purpose version of Claude Code." The middleware architecture is particularly interesting: hooks that fire before and after each model call, enabling loop detection, context injection, and pre-completion checklists without modifying the core agent loop.

The Ralph Loop is a harness pattern, not a product. Created by developer Geoffrey Huntley, it wraps any coding agent in a bash loop: run the agent, let it finish, start it again with fresh context. The source of truth is the filesystem — the PRD, the progress file, the git history — not the agent's internal state. A YC hackathon team used it to generate 1,100+ commits across six repositories overnight for about $10.50 per hour per agent. The pattern is simple, but the principle is important: the harness controls the agent's lifecycle, not the other way around.

clive takes a different architectural position. Instead of wrapping tools in schemas and APIs, it gives the agent a terminal and a keyboard. The harness is tmux — a multiplexed terminal emulator that provides isolation between tool panes, persistence across disconnections, and atomic screen capture. The agent reads the screen, decides what to type, sends keystrokes, observes the result, repeats. The terminal is the harness. Every CLI tool that exists becomes agent-accessible without an MCP server, a REST wrapper, or a schema definition.

Each of these makes a different bet about what the harness should manage and what the model should handle. The convergence is on the principle: the model alone is not the agent.

How to build one

The minimal viable harness is surprisingly small. It is a while loop.

while true:
    context = assemble_context(history, tools, task)
    response = call_model(context)
    if response.has_tool_calls():
        results = execute_tools(response.tool_calls)
        history.append(response, results)
    else:
        return response.text

That is the core of every agent harness. Anthropic's approach has been described as a "dumb loop" — all intelligence lives in the model, the harness just manages turns. The complexity is not in the loop — it is in everything the loop manages.

From that core, you layer in the features that the task demands.

Context management is first. The model has a fixed context window. Your harness decides what goes in it. At minimum, you need the system prompt, the current task, and the most recent tool results. As sessions get longer, you need compaction — summarising older exchanges to free space for new ones. Claude Code uses git commits as checkpoints and progress files as structured scratchpads. DeepAgents uses a filesystem backend. The Ralph Loop solves it by killing the agent and starting fresh, letting the filesystem carry state.

Verification is third, and most often skipped. A production harness does not accept the model's first output blindly. It runs tests, checks formatting, validates against acceptance criteria, and feeds failures back for retry. Böckeler's frameworkdistinguishes computational sensors (linters, test suites, type checkers — deterministic and fast) from inferential sensors (LLM-as-judge, review agents — slower but semantically richer). The combination is powerful: run the fast checks on every change, run the expensive checks before integration.

Memory and persistence is fourth. If the task spans multiple sessions, the harness needs to persist state that survives context window boundaries. The options range from simple (a progress.txt file that the agent reads at the start of each session) to sophisticated (vector stores, structured databases, cross-session memory systems). The key insight from Anthropic's long-running agent work: give the project memory, not the model. Files, commits, and logs persist. The model's internal state does not.

Steering is ongoing. Böckeler calls this the human's job: iterating on the harness. When an issue happens multiple times, improve the feedforward controls (rules, instructions, reference docs) and feedback sensors (linters, review agents, structural tests) to prevent it from recurring. The harness is not a one-time build — it is a system that improves over time as you discover what the model gets wrong.

The deeper point

The harness concept is the realisation that model intelligence and system capability are different things, and that the gap between them is where most of the engineering value lives.

This is not a temporary state of affairs that disappears when models get smarter. LangChain's framing is instructive: as models improve, some of what lives in the harness today — planning, self-verification, long-horizon coherence — will get absorbed into the model. But that does not eliminate the harness. It shifts what the harness does. Just as prompt engineering did not disappear when models got better at following instructions, harness engineering will not disappear when models get better at managing their own context. The interface between model and world always requires a system to manage it.

The formula is worth internalising: Agent = Model + Harness. When someone says "I built an agent," they mean they built a harness and pointed it at a model. The model is available to everyone. The harness is where the differentiation lives.