How to design subagents that help rather than thrash

A few weeks ago I dispatched a code-review subagent over a feature branch. Fifteen minutes later it returned a tidy summary: three nits, no blockers. We shipped. Monday morning the bug report came in. The reviewer had walked the diff cleanly, missed a race condition in the concurrency path, and written “looks good” with the confidence of something that had checked.

The subagent wasn’t broken. I had given it the wrong tools, the wrong model, and a description that routed it toward nits rather than correctness. The failure was the harness around the agent. Confident nonsense from a well-meaning delegate is what this article is about.

The short answer

A subagent is a separate Claude instance with its own context window, system prompt, and tool list, invoked from the main agent via the Agent tool. It sees only the prompt you pass it plus whatever inheritance you opted into (CLAUDE.md via settingSources, the tools you listed, the skills you named). It does not see the parent’s conversation history, tool results, or system prompt. It returns only its final message.

You design one subagent per role. The levers are the four the founding pillar named: context, model, prompt, tools. Output scales with how deliberately you pick them.

When a subagent earns its spot

Delegate when the work would flood your main conversation with content you won’t reference again. Anthropic’s threshold is worth quoting: the task touches 10 or more files, or contains 3 or more independent pieces of work that don’t depend on each other. Below that bar, inline is faster, cheaper, and usually fine.

The vibe-check version, borrowed from Builder.io: “noisy, bounded, easy to summarize.” Noisy means the work would swamp the parent. Bounded means the subagent can finish. Easy to summarize means the final message is compact and load-bearing.

Do not delegate work tightly coupled to the parent’s decision path, work that is already small, or work where the parent has to see every intermediate step. A subagent for “add a comment to this function” is pure overhead. A subagent for “read these 30 files and tell me where the retry logic lives” earns its session instantly.

The four controls you’re choosing per subagent

The four harness controls from the founding pillar apply per role, chosen again per subagent.

Context. What this subagent sees. A repo-explorer gets read access to the whole tree. A release-notes writer sees only the commit log and the last CHANGELOG. Narrow surface, tighter output.
Model. Which model runs this subagent. Haiku for fan-out. Sonnet as the default. Opus for work you’d hate to rerun. The parent’s model is not the ceiling.
Prompt. The system prompt plus any skills you list. Shapes the mental model the subagent brings to its slice. Bad role descriptions become bad results.
Tools. The subset of tools the subagent can reach for. Default narrow, widen on evidence. Analysis subagents get Read, Grep, Glob. Test runners get Bash, Read, Grep. Full-access should be rare and justified.

When a subagent thrashes, it is almost always one of these four left on defaults.

Writing a description Claude will actually route to

The description field is the router. Anthropic’s subagent docs restate this more than any other claim, and it is the thing people tune last.

The rule: describe the situation that should trigger delegation, not the agent’s identity.

Bad: description: "Security expert."

Good: description: "Reviews code for security vulnerabilities before commits, especially on auth, payment, and session-handling paths. Use proactively."

The bad version is a noun. Noun descriptions force Claude to guess when they apply, and Claude guesses wrong. The good version names trigger, scope, and urgency, and earns proactive routing through the “Use proactively” convention.

If a subagent isn’t firing when you expect, check the description first. If invoking by name works but auto-routing doesn’t, the agent works and the description doesn’t.

Scoping tools and model per subagent

Two design choices, one section, because they fail the same way: parent defaults produce subagents more capable than they need to be and less predictable than you want.

Tools. Anthropic’s SDK reference exposes both tools and disallowedTools on AgentDefinition. Use the narrow list for anything that touches the filesystem or the shell. Three patterns I reach for:

Read-only analysis. Read, Grep, Glob. Fits repo explorers, architecture surveyors, doc extractors. Cannot write, run, or fetch.
Test runner. Bash, Read, Grep. Executes tests, reads output, greps logs. Cannot edit source.
Full-access. Inherit from parent. Reserve for subagents that genuinely need to edit code in isolation.

Model. Set per subagent via the model field:

Haiku for cheap fan-out. File triage, log scanning, callsite discovery. High volume, low risk.
Sonnet as the sensible default. Most roles live here.
Opus for reviews and plans you’d hate to rerun. Code review at a checkpoint, architecture sketching, post-mortems. Worth the cost when retrying is expensive.

AgentDefinition exposes twelve fields (description, prompt, tools, disallowedTools, model, skills, memory, mcpServers, maxTurns, background, effort, permissionMode). Most of the time you are tuning four. The rest are there when you need them.

Parallel or sequential, pick the wiring, not the agents

Subagent work falls into two wiring shapes.

Fan-out, fan-in (parallel). The parent spawns multiple subagents at once, each on an independent slice, collects their final messages, stitches them. Good fit: “find references to this deprecated API across these four services.” Four parallel subagents, one aggregation step.

Pipeline (sequential). The parent runs subagent A, reads its output, shapes the prompt for subagent B, runs B. Good fit: repo-explorer finds files, reviewer reads only those files, test runner runs only the relevant tests. Each subagent gets a narrower prompt than a generalist would.

Subagents do not see each other’s context in either shape. If step B needs something step A produced, the parent shuttles it across. If you need subagents to converse in real time, you are past what subagents can do.

When you’ve outgrown subagents, the agent-teams jump

The tell that you need agent teams, not deeper subagent nesting, is a coordination requirement. Teammates need to react to each other’s partial output, share a task list, or run a QA loop where one finds problems and another fixes them.

	Subagents	Agent teams
Communication	Parent only	Peer-to-peer
State	Parent’s single session	Shared task list, separate sessions
Shape	Fan-out or pipeline	Coordinated, iterative
Scale	3 to 5 roles	Larger rosters, long-running

Most coding work does not need agent teams. Subagents cover fan-out, pipelines, isolated reviewers, specialist scouts. The moment you catch yourself wanting a subagent to “ask the other subagent” something, that is the jump.

The Anthropic docs are explicit that subagents do not spawn other subagents. Don’t include Agent in a subagent’s tools list. Nested orchestration wants agent teams.

Failure modes you should be watching for

Subagents make some failure modes structurally unlikely and others more likely. Four matter here.

Context degradation (helped). The one subagents fix by design. The main conversation stays clean because intermediate reads and tool calls live inside the subagent. If your subagent is dumping 20k tokens of scratch reasoning back into the parent, you’ve negated the reason to use one.

Cascading failure (watch carefully). The subagent returns a wrong result. The parent treats the final message as ground truth. The next decision compounds the wrongness. Mitigation: add a verification step between subagent output and any decision that depends on it.

Silent failure (the hard one). The final message looks right. It reads right, summarises convincingly, uses the vocabulary of the thing it was supposed to do. It is wrong on the function, not the form. Detection requires something outside self-reporting: an external test, a ground-truth check, a second subagent reviewing the first against a known answer. Reading the summary and agreeing with it is not a check.

Tool-selection errors (avoidable at design time). The subagent reaches for a wrong tool, or worse, does something destructive because the wrong tool was in scope. Usually a design-time failure: the tool list was too wide.

The trade: subagents swap one easy-to-see failure mode (parent context bloat) for two harder-to-see ones. Worth it, but design for it.

How many subagents is too many

The practical ceiling is whatever keeps description-based routing reliable. For most coding workflows that is three to five specialised subagents plus the built-in general-purpose. Past that, the router starts picking wrong, not from a hard limit but because descriptions start competing for the same trigger conditions.

If you feel the pull toward a twelfth subagent, ask two questions. Can two of them collapse into one broader role? Is the work shape forcing you up to agent teams? The answer is almost always one of those.

What this looks like in practice

A three-subagent workflow I’ve been running, as a concrete shape:

# repo-explorer
description: "Surveys the codebase for files matching a pattern. Returns a ranked list of candidates with one-line rationales. Read-only."
tools: [Read, Grep, Glob]
model: claude-haiku-4-5
prompt: "You survey codebases. Given a query, return a ranked list of files with one-line rationales. No edits. No opinions on what to do next."

# test-runner
description: "Runs the test suite or a subset, returns pass/fail plus relevant output. Does not edit source files."
tools: [Bash, Read, Grep]
model: claude-sonnet-4-6
prompt: "You run tests and report results. You do not modify source. You do not fix failures. You surface what failed and where."

# code-reviewer
description: "Reviews code changes for correctness, concurrency issues, and security before commits. Use proactively on feature branches."
tools: [Read, Grep, Glob]
model: claude-opus-4-7
prompt: "You review code for correctness first, then security, then style. You return blockers separately from nits."

Three subagents, three Core Four choices. Haiku for cheap grep. Sonnet for tests. Opus for the review where being wrong is expensive. Three well-scoped subagents beat twelve half-scoped ones, because the router keeps working.

Frequently asked questions

Q1. When should I use a subagent instead of letting the main agent handle the task? Use a subagent when the side task would flood your main conversation with content you will not reference again. The concrete threshold: the work touches 10+ files, contains 3+ independent pieces you do not need to interleave, or needs an isolated perspective. Below that bar, inline is cheaper and faster.

Q2. What’s the difference between a subagent and an agent team? Subagents return results to the main agent only; agent teams share a task list and message each other directly. Subagents live inside one Claude Code session; agent teams coordinate across separate sessions. Choose subagents for sequential pipelines or parallel fan-out. Choose agent teams when teammates must react to each other.

Q3. How should I write the description field so Claude routes correctly? Describe the exact situation that should trigger delegation, not the agent’s identity. “Reviews code for security vulnerabilities before commits, especially on auth or payment paths” routes better than “security expert.” Add “use proactively” if Claude should reach for the agent without being asked. Vague descriptions produce missed or wrong delegations.

Q4. Can a subagent spawn another subagent? No. This is a hard constraint. The Anthropic SDK documentation explicitly says not to include Agent in a subagent’s tools array. If your work needs nested orchestration, you want agent teams, not deeper subagent nesting. This boundary is a feature. It keeps the delegation graph legible.

Q5. How many subagents should I define? Practical ceiling is 3 to 5 specialized subagents plus the built-in general-purpose. Beyond that, Claude’s description-based router degrades because similar-sounding agents compete for the same tasks. If you feel the need for more, ask whether two of your agents collapse into one, or whether the work shape demands agent teams instead.

Q6. Should a subagent use a different model than the main agent? Often yes. Haiku handles cheap fan-out work like file triage and log scanning; Sonnet is the sensible default for most roles; Opus earns its cost on reviews or plans you’d hate to rerun. Set it per-agent via the model field on AgentDefinition. The parent’s model is not the ceiling.

Q7. How do I debug a subagent that isn’t being invoked? Check the description field first. It is the router. If the description is vague or overlaps another agent’s, Claude will skip it or pick the wrong one. Invoke explicitly by name (“Use the code-reviewer agent to…”) to confirm the agent itself works. If explicit invocation succeeds, the bug is the description.

Where to go from here

Three places to go next inside this cluster:

Tool design: skills versus tools versus hooks. Choosing skills, tools, and hooks for a subagent’s tool list is its own design problem.
Model choice per subagent (coming soon). Picking Opus, Sonnet, or Haiku per role, with the latency and cost tradeoffs made explicit.
Evals that catch subagent silent failures (coming soon). Writing evals that catch the silent-failure mode this article flagged, before they ship.

The founding pillar on the agent harness frames the four controls this article extends per subagent.