The hidden complexity tax: The problems with generalist multi-agent frameworks
.png)
Building one AI agent is easy. Running dozens together reliably in production is where most teams hit a wall. This piece covers the architectural problems that emerge when you move from prototype to production: context fragmentation, hallucination propagation, audit complexity, access control failures, inadequate governance, and scaling bottlenecks.
- Context fragmentation: Agents operate in isolation, making decisions on incomplete information
- Hallucination propagation: Fabricated data spreads across agents and becomes ground truth
- Audit complexity: Tracing decisions across agent interactions becomes exponentially harder
- Access control failures: Preventing data leaks across concurrent agent sessions
- Inadequate governance: Missing the chain of custody needed for regulated environments
- Scaling bottlenecks: Unpredictable workloads that break traditional scaling strategies
These aren't hypotheticals. They're the technical debt most multi-agent frameworks pass on to you.
The context crisis: When agents can't see what matters
Models need context to function. Multi-agent systems fragment it across execution boundaries.
Split a clinical task across three agents (labs review, medication checks, appointment coordination), and each operates in isolation. Agent A flags elevated creatinine. Agent B approves a nephrotoxic drug because kidney issues aren't in its context window. Agent C schedules a routine follow-up. The output looks coherent but is clinically wrong.
The failure modes:
- Cascading misinterpretation. Agents make locally reasonable decisions based on incomplete information. The errors compound downstream.
- Implicit decision conflicts. Agent A assumes aggressive intervention. Agent B assumes conservative management. Their outputs conflict in subtle ways that slip past review.
- Lost nuance in delegation. Orchestrators can't communicate every relevant detail, guideline, or protocol. Agents duplicate work, leave gaps, or miss critical information.
Context fragmentation isn't a tuning problem. It's an architectural limitation of decomposing tasks across isolated agents that execute on their own context.
The hallucination amplifier: when non-determinism meets scale
LLMs hallucinate. Karpathy denotes them as “dream machines”. Yann LeCun describes how errors in a single LLM can compound exponentially when the model relies on its own outputs over multiple steps.
In multi-agent systems, hallucinations don't just occur. They propagate.
Imagine the following scenario. An agent generates a plausible patient ID instead of retrieving the correct one. Downstream agents consume that data. They have no mechanism to detect the fabrication. The hallucinated ID becomes ground truth for subsequent actions.
Why hallucination gets worse at scale:
- Shared models spread errors - One agent's hallucination becomes input for another. Errors cascade through the system.
- Testing complexity explodes - You need to validate not just each agent's behavior but all possible interaction patterns.
- Edge cases trigger cascades - The system appears to work in testing, then fails in production when an unusual input triggers a chain reaction.
Strong data governance helps, but the root problem is architectural. Multi-agent systems amplify the consequences of non-deterministic outputs.
The audit trail that disappears
Single-agent failures are straightforward to debug. Review the trace, find the error, fix it.
Multi-agent failures require forensics. Which agent made the decision? What information was available? How did agent interactions influence the outcome? The complexity grows combinatorially with agent count and interaction paths.
What you need to answer in regulated environments:
- Who approved the action?
- When was it approved?
- What information was available at that moment?
- Under what permissions?
- Which agents were involved in the decision?
- How did their interactions influence the outcome?
System logs don't answer these questions. You need instrumentation that captures the full decision chain across all agents, for all users, in all contexts.
Without proper architecture, you can't trace how decisions emerged from agent interactions. This isn't just an engineering problem. It's a compliance requirement.
The access control nightmare
A powerful agent serves multiple concurrent users. How do you prevent it from accessing User 2's data while processing User 1's request? The naive approach: trust the LLM to stay in bounds. This fails. An agent that hallucinates a patient ID doesn't know it crossed a boundary. It executes the tool call with the plausible-looking identifier it generated.
Multi-agent systems compound the problem:
- Each agent needs its own access control context
- Each inter-agent communication needs validation
- Each tool call needs parameter verification
- Hallucinated identifiers bypass intent-based security
You can't rely on stochastic model outputs to enforce deterministic security boundaries.
The solution requires programmatic guardrails:
- Hard-coded validation that strips hallucinated IDs
- Explicit permission checks at every boundary
- Authenticated identifier injection through deterministic code (not model-generated parameters)
Implementing this across a mesh of interacting agents requires infrastructure most teams underestimate.
The human-in-the-loop that isn't
Most frameworks offer simple approval mechanisms. Before executing a sensitive action, pause and ask for confirmation.
This satisfies the feature requirement but misses the point. When an agent requests approval to order medication, the human needs context: why this drug, what alternatives were considered, what data informed the decision, which agent made the recommendation, what its confidence level was, who granted permissions.
What's missing from simple approval dialogs:
- Decision rationale across multiple agents
- Alternatives that were considered and rejected
- Data sources that informed the recommendation
- Confidence levels and uncertainty
- Which specific agent made each contribution
A checkbox doesn't provide clinical decision support. In multi-agent systems where recommendations emerge from interactions between specialized agents, reconstructing decision rationale gets harder.
The governance challenge: maintaining a complete, auditable chain of custody for every data access, decision, and action across all agents, users, and contexts. Most frameworks treat this as an afterthought. In healthcare, it's baseline.
The scalability illusion
Multi-agent architectures promise elegant scaling. Need more capacity? Spin up more agents.
The reality: you're scaling model inference, agent orchestration, inter-agent communication, context synchronization, and coordination overhead. More agents mean higher computational demands for communication and coordination.
The scaling challenges:
- Three-layer bottlenecks. Production systems must independently scale LLM inference, MCP tool servers, and orchestration backends. Each has different resource profiles. Tight coupling creates cascading failures where a slow tool server blocks agent execution, exhausts connection pools, stalls the system.
- Unpredictable workload patterns. One query might spawn three agent calls. Another might spawn thirty. Auto-scaling designed for predictable queue depths fails when your workload is a complex graph of interdependent agent tasks.
- Concurrency explosion. Multiple users running multi-agent workflows means N parallel workflows, each spawning M agents, each making P tool calls. The combinatorial explosion (N×M×P) overwhelms systems that handle traditional API traffic without issue.
Agents also struggle to judge appropriate effort for different tasks. Resource consumption becomes unpredictable. Traditional scaling strategies break down when your workload is non-deterministic by design.
What this means for production systems
Multi-agent frameworks come with architecture taxes. Complexity costs paid in engineering time, infrastructure investment, ongoing maintenance. The gap between prototype and production remains wide.
The key questions for evaluation:
- How does the framework handle context? Does it maintain continuous context or fragment it across agents? Can you trace decision lineage across agent interactions?
- What guarantees exist around data access? Are permissions enforced programmatically or through LLM obedience? How are identifiers validated?
- How is governance implemented? Does the system provide parameter-level provenance? Can you audit who approved what action, with what information, in what context?
- What does scaling actually look like? What's the coordination overhead? How do inference costs scale with agent count? What happens under variable load?
These architectural decisions determine whether your multi-agent system becomes a production asset or a maintenance burden.
The path forward
Teams running agents in production at scale rely on foundational capabilities that solve these problems:
- Context management that maintains decision continuity across agent interactions
- Programmatic guardrails that enforce access control through deterministic code, preventing hallucinated IDs from reaching tool execution
- Parameter-level provenance that captures who approved what action, with what information, in what context
- Independent scaling across inference, tooling, and orchestration layers to handle unpredictable multi-agent workloads
- Testing primitives designed for validating emergent multi-agent behavior
This infrastructure represents significant engineering investment. The decision for healthtech teams: solve these fundamental infrastructure problems or build the clinical workflows and features that differentiate your product.
The agent infrastructure platforms that succeed will solve these problems comprehensively, letting application teams focus on clinical innovation rather than reimplementing orchestration, governance, and observability for each new agent-powered feature.