What is an agent harness, and why it decides whether your agents work

Last month I watched a perfectly capable Claude Opus botch a refactor I could have finished in an afternoon. The model wasn’t the problem. I had given it no memory of the codebase’s invariants, no tool that could run the existing test suite, a single bloated system prompt, and free rein to edit files it didn’t understand yet.

The model isn’t what decided whether this worked. The harness around it did.

The short answer

An agent harness is the deterministic infrastructure around an AI model. Not the model itself. It’s the code that routes work, caches context, loads skills, picks which model runs where, and defines the tools the agent can reach for.

When your agent behaves reliably, it’s usually the harness doing the work. When it doesn’t, the harness is usually what’s missing.

That quote from IndyDevDan sums it up: “Without the agent harness, there are no agents, no agentic coding, and no agentic engineering.” It sounds dramatic until you sit with it. The raw model is a brain in a jar. The harness is everything that turns it into something that can get work done.

What’s actually in a harness

If you strip a production agent down to its parts, you find these:

Deterministic code. The non-AI logic that drives behavior. Routing, retries, guardrails, session management. The part that makes the same input produce the same output every time.
Token caching. Context is expensive and slow. A harness that caches system prompts, tool definitions, and stable context across turns will cost less and run faster than one that doesn’t. Pure infrastructure, huge leverage.
Agent orchestration. How work flows between agents (if you have more than one). Main agent plus subagents, pipeline, fan-out/fan-in, supervisor and reviewer. The orchestration shape is part of the harness, not something the model invents per turn.
Prompts. System prompts, reusable prompt templates, skill definitions. The text layer that steers behavior.
Skills. Modular capabilities that can be loaded into agents on demand. Think of them as specialised playbooks. In Claude Code they’re SKILL.md files; in other surfaces they may be different but the idea is the same.
Model control. Which model runs which agent role, and when you swap between them. Opus for the planning step, Sonnet for execution, Haiku for throwaway cleanup. The harness picks, not the user.

None of these is glamorous. All of them are load-bearing. Nate B Jones summarises it cleanly: building agents is 80% plumbing, 20% AI.

The Core Four: what the harness actually gives you control over

IndyDevDan’s framing is the tightest way I’ve found to think about this. An agent harness gives you four axes of control over every agent you run:

Context. What information this agent can see. Which files it has access to, what’s in its system prompt, what memory it carries across turns, what domain constraints it knows about.
Model. Which LLM runs this agent. Not a global setting. A per-agent choice.
Prompt. The system prompt, the skills it has loaded, the mental models you’ve given it.
Tools. What actions this agent can take. Read a file, run a test, fetch from an API, spawn a subagent, post to Slack.

Owning the harness means owning all four, deliberately, for each agent you run. Most “my agent isn’t working” situations I’ve seen come down to one of these axes being left on defaults: the agent had access to the whole repo when it should have had five files, or it was running Opus when Haiku would have been fine, or it had a tool it should never have been given.

When you hear someone talk about “harness engineering,” this is the discipline they mean. Treat agent configuration as engineering, not conversation.

Brain, hands, session: a way to decompose a production harness

Anthropic’s engineering blog, in their managed-agents writeup, lays out a decomposition that has stuck with me because it cleanly separates the concerns most people conflate:

Brain. Claude plus the agent loop. Stateless. If the process crashes, you replay the session log and the brain recovers.
Hands. The sandboxes and tools. Each tool is a simple function: execute(name, input) -> string. Containers and sandboxes are disposable. If one fails, you provision another. Cattle, not pets.
Session. A durable, append-only event log that lives outside both brain and hands. This is what makes crash recovery possible and what lets you engineer context arbitrarily between turns.

The insight buried in this is subtle and important: a harness encodes assumptions about what the model can’t do on its own. Those assumptions go stale as models get better. If your brain, hands, and session are clean boundaries, you can swap implementations behind them as the model improves. If they’re tangled, every model upgrade becomes a rewrite.

This decomposition also explains why “agent frameworks” written in 2023 often feel archaeological now. They encoded 2023 model assumptions into the harness. Better harnesses are the ones whose boundaries held as the models changed.

Why most people skip this and why you shouldn’t

There’s a pattern I see often, in my own work and in other people’s. Someone builds a thing with an LLM. It sort of works. Then it stops working reliably. The response is to try a better model, or tweak the prompt, or add more context to the prompt until it barely fits.

What’s usually missing is harness discipline. The agent was never given a clear context boundary, a fit-for-purpose tool set, or a deliberate model choice. Every run was a fresh conversation with a fresh general-purpose agent. That approach works for quick prototyping. It does not work for anything you need to run twice the same way.

You can spot the shift the first time you stop thinking in terms of “the prompt” and start thinking in terms of “the harness around this agent.” The questions change:

What should this agent see, and nothing more?
Which model earns its cost here?
What tools does this agent need, and which would just be a liability?
What has to persist across turns, and what should be dropped?

Those are harness questions. The answers feel more like engineering decisions than chat decisions, because they are.

What this means for how you build

Three practical consequences I’ve internalised:

1. Invest in the boundaries before the fancy stuff. Session logs, tool registries, permission checks, and a clean model-picking interface are worth more than another prompt tweak. They compound. The prompt tweak evaporates the next time the model changes.

2. Make the harness explicit. If you can’t draw your agents and tools on a whiteboard without ambiguity, neither can the people who’ll read your code in six months. A diagram with five boxes and a few arrows is often the single most valuable deliverable in an AI-assisted project.

3. Assume the harness outlives the model. Write it that way. Stable interfaces, swappable implementations. Treat model choice as a config knob, not an architectural decision.

Why this matters right now

Claude Code itself, the product, is fundamentally an agent harness with a model running on top of it. The leak analysis of its internals read like a greatest-hits album of harness primitives: tool registry, permissions, session persistence, workflow state, token budgets, streaming events, compaction, audit trails. Nate B Jones’ breakdown of the leak identified twelve production-grade primitives across three tiers, and the pattern is telling: the “day-one non-negotiables” are all harness infrastructure. None of them are model choices.

The reason it matters right now, in 2026, is that the models are getting commoditised quickly. Opus 4.7, GPT 5, Gemini 3, and a long tail of smaller models are all genuinely good at the work. What separates a system that ships and a system that limps is not the model you picked. It’s the harness you built around it.

Where to go from here

This is the founding article for this site. The rest of the content here will be specific patterns inside the harness: how to design subagents that actually help rather than thrash, how to choose skills versus tools versus hooks, when to pay for a Opus call and when Haiku is enough, how to keep a session log that survives refactors.

If you want to keep reading today, the three sources this article leans on most are:

IndyDevDan’s harness-engineering material, which is where the Core Four framing comes from.
Nate B Jones’ leak analysis, which gives you the 12-primitive checklist.
Anthropic’s own writeup on scaling managed agents, which is where the brain / hands / session decomposition comes from.

The pitch for this site is simple. Patterns from real work. Built in production, across greenfield and inherited systems. Not demos, not theory, not content for content’s sake.

Start from the harness, not the model.