I recently asked an AI assistant to review a blog post and “use three subagents”. Instead of running one big review, it split the request into three independent subagents: one focused on technical accuracy, one on writing quality, and one on search optimization. Each agent received its own tailored instructions and worked in parallel, returning three separate reports that covered ground a single pass would have missed.
And I just asked it for multiple subagents. Nothing else.
This pattern – one orchestrator spawning multiple subagents – offers a lot of advantages. It might look like unnecessary complexity, but the reasons it works go beyond “more agents means more coverage.” There are structural advantages baked into how Large Language Models (LLMs) operate that make this pattern genuinely effective. If you’ve been wondering why these tools bother splitting work up instead of just asking harder, this post is for you.
Each subagent starts with a clean slate
When you have a long conversation with an AI assistant, every message you’ve exchanged becomes part of the context – the text the model has to process before generating a response. That context has a hard limit called the context window, and I’ve written about why that constraint matters for focused agents. The short version: as the context grows, the model’s ability to track relevant details degrades. Important information gets diluted by older messages, abandoned approaches, and historical “thinking” details.
Subagents sidestep this problem entirely. When an orchestrator agent spawns a subagent, it doesn’t copy the full conversation history. Instead, it writes a fresh, purpose-built prompt containing only what the subagent needs to do its job. A 30-turn conversation about debugging, planning, and iterating on a design gets distilled into a single focused instruction like “review this function for security vulnerabilities, including injection risks and improper authentication patterns.”
The subagent never sees the dead ends you explored, the questions you asked along the way, or the unrelated tasks you handled earlier in the session. It doesn’t get biased by the reasoning in the thinking. Its context is clean, focused, and short – which means more of the model’s attention budget (how much it can effectively focus on relevant details) goes toward the actual task. At the same time, the agent that requested the subagent may end up with just a short summary to add to its context rather than the thoughts behind that.
There’s a tradeoff here. The orchestrator has to decide what to include and what to leave out, and that’s a lossy compression step (some details are dropped). If it drops a detail that turns out to matter – a constraint you mentioned casually ten messages ago or a naming convention the team uses – the subagent won’t know about it. But in practice, this focused context produces noticeably sharper results than asking the same model to juggle everything at once.
Non-determinism gives you diversity for free
Here’s something that might be counterintuitive: if you send the exact same prompt to three separate subagent instances, you’ll get three different answers. Not slightly different – meaningfully different. They’ll notice different things, emphasize different concerns, and structure their responses in distinct ways.
This happens because LLMs are fundamentally non-deterministic. In plain language, that means the same prompt can produce different outputs across different runs. When a model generates text, it doesn’t always pick the single most likely next word. Instead, it samples from a probability distribution shaped by parameters like temperature, Top-K, and Top-P. Each time the model makes a sampling decision, small variations cascade through the rest of the response. The result is that two runs of the exact same prompt diverge. It’s like rolling dice – the same process, but different outcome.
Multi-agent systems exploit this property deliberately. Instead of generating one answer and hoping it’s comprehensive, you generate three and compare. One security reviewer might focus on input validation while another zeroes in on authentication flows. One writing critic catches a tone problem that the other two miss. You’re effectively getting a panel of reviewers instead of a single opinion – and it usually costs little extra in prompt engineering, because the diversity emerges naturally from how the model works.
The important caveat is that non-determinism isn’t the same as independent reasoning. All three agents share the same training data and the same learned biases. If the model has a systematic blind spot (a type of mistake it tends to miss) – say, it consistently overlooks a particular class of race condition – all three subagents are likely to miss it too. Diversity of output doesn’t guarantee diversity of understanding. But it does increase coverage in a way that a single pass cannot.
Of course, that means there can also be an advantage to asking the subagents to use a different model. Since models are trained differently and respond in different ways, the agents will have even more variety in how they approach the issue. This technique is especially powerful with reviewing generated code.
Orchestrators naturally sharpen the task
When you ask an AI assistant to “review my blog post,” that’s a deliberately broad request. You probably mean something like “check for technical errors, awkward phrasing, missing context, and anything that would embarrass me if I published it.” But you didn’t say all of that, and a single agent trying to cover every dimension at once will inevitably spread its attention. That would mean fewer (and less useful) findings.
When an orchestrator decomposes that request into subagent tasks, something interesting happens: the act of writing instructions forces it to be specific. It can’t just pass along “review the post” – it has to decide what each subagent is responsible for. It then builds each subagent with a specific set of instructions that are more precise and more targeted than what you originally asked for.
Here’s a real example from my own experience. I asked an AI assistant to “create three subagents to each review my latest post.” The orchestrator produced three distinct instruction sets with unique criteria to consider:
- The technical accuracy agent received a very detailed checklist. It included steps such as verify feature availability, confirm specific product and code behaviors, evaluate sample code, check the syntax is accurate, and consider best practices.
- The writing quality agent was told to check heading capitalization against APA sentence-case rules, verify em-dash formatting, count words for reading time, and flag any list formatting violations. It even received a list of grammar and tone best practices.
- The SEO agent was instructed to validate my front matter against constraints, confirm categories and tags were appropriate, and verify any links actually exist and are relevant.
None of that specificity existed in my original request. The orchestrator inferred it from the content of the post and the structure of the repository. Each subagent ended up doing a more thorough job on its slice than a single general-purpose reviewer would have done across the whole surface.
The flip side is that the orchestrator’s decomposition can highlight its blind spots. If it doesn’t think to create a subagent for accessibility concerns or for checking code samples against current API versions, that dimension simply doesn’t get covered. The narrowing effect makes each subagent sharper, but in some cases it may demonstrate gaps in the overall review.
Adversarial agents remove the self-review bias
There’s one more pattern worth mentioning, and it addresses a well-known weakness in LLM-generated content: self-review bias. When you ask an AI agent to generate something and then ask that same agent to critique its own output, the results tend to be generous. The model has an inherent tendency to be charitable about text it just produced – it’s more likely to say “this looks good” than to find the subtle problems a fresh pair of eyes would catch. Part of the problem is that it sees the reasons why it made that choice, so it’s naturally biased to accept that logic as correct.
Adversarial subagents solve this by separating generation from critique. You spin up a second agent whose entire job is to find problems. Its instructions explicitly say things like “assume the plan and implementation are wrong,” “find everything wrong,” or “act as if you’re reviewing code from an untrusted contributor.” Because this agent has no prior investment in the output – it didn’t write it, it has no context about the decisions that went into it – it approaches the review without the bias that comes from ownership.
This matters most for tasks where quality depends on honest feedback: code reviews, security audits, architectural decisions, and technical writing. A single agent asked to “write this function and then check it for bugs” will reliably miss issues that a separate adversarial agent catches, because the critic’s framing shifts its attention toward flaws rather than toward justifying prior choices.
One word of caution: adversarial framing shifts bias rather than eliminating it. An agent told to find problems will try to discover problems, including marginal ones that might have limited value or waste tokens. Calibrating the instructions – “identify high-confidence security vulnerabilities” rather than “find everything wrong” – gives it more actionable goals and targets.
Putting it together
The advantages of multiple subagents stack on top of each other. Each subagent starts with a clean, focused context instead of inheriting the full weight of a long conversation. Non-determinism means agents with identical instructions will produce meaningfully different results, giving you coverage that a single pass can’t match. Even better, the orchestrator’s decomposition process naturally produces more precise, targeted instructions than you wrote in your original request. And when you frame agents as adversarial critics, you can eliminate the self-review bias that undermines single-agent quality checks.
The practical takeaway is straightforward. When a task has multiple distinct quality dimensions – correctness, style, security, performance, architecture – decompose it into focused subagents rather than asking one agent or subagent to examine everything at once. You’ll get sharper feedback, more diverse perspectives, and a review process that is less biased.
