The hidden complexity tax: The problems with generalist multi-agent frameworks

October 22, 2025

X min read

Corti

AI infrastructure for healthcare

Related research

No items found.

Building one AI agent is easy. Running dozens together reliably in production is where most teams hit a wall. This piece covers the architectural problems that emerge when you move from prototype to production: context fragmentation, hallucination propagation, audit complexity, access control failures, inadequate governance, and scaling bottlenecks.

Context fragmentation: Agents operate in isolation, making decisions on incomplete information
Hallucination propagation: Fabricated data spreads across agents and becomes ground truth
Audit complexity: Tracing decisions across agent interactions becomes exponentially harder
Access control failures: Preventing data leaks across concurrent agent sessions
Inadequate governance: Missing the chain of custody needed for regulated environments
Scaling bottlenecks: Unpredictable workloads that break traditional scaling strategies

These aren't hypotheticals. They're the technical debt most multi-agent frameworks pass on to you.

The context crisis: When agents can't see what matters

Models need context to function. Multi-agent systems fragment it across execution boundaries.

Split a clinical task across three agents (labs review, medication checks, appointment coordination), and each operates in isolation. Agent A flags elevated creatinine. Agent B approves a nephrotoxic drug because kidney issues aren't in its context window. Agent C schedules a routine follow-up. The output looks coherent but is clinically wrong.

The failure modes:

Cascading misinterpretation. Agents make locally reasonable decisions based on incomplete information. The errors compound downstream.
Implicit decision conflicts. Agent A assumes aggressive intervention. Agent B assumes conservative management. Their outputs conflict in subtle ways that slip past review.
Lost nuance in delegation. Orchestrators can't communicate every relevant detail, guideline, or protocol. Agents duplicate work, leave gaps, or miss critical information.

Context fragmentation isn't a tuning problem. It's an architectural limitation of decomposing tasks across isolated agents that execute on their own context.

The hallucination amplifier: when non-determinism meets scale

LLMs hallucinate. Karpathy denotes them as “dream machines”. Yann LeCun describes how errors in a single LLM can compound exponentially when the model relies on its own outputs over multiple steps.

In multi-agent systems, hallucinations don't just occur. They propagate.

Imagine the following scenario. An agent generates a plausible patient ID instead of retrieving the correct one. Downstream agents consume that data. They have no mechanism to detect the fabrication. The hallucinated ID becomes ground truth for subsequent actions.

Why hallucination gets worse at scale:

Shared models spread errors - One agent's hallucination becomes input for another. Errors cascade through the system.
Testing complexity explodes - You need to validate not just each agent's behavior but all possible interaction patterns.
Edge cases trigger cascades - The system appears to work in testing, then fails in production when an unusual input triggers a chain reaction.

Strong data governance helps, but the root problem is architectural. Multi-agent systems amplify the consequences of non-deterministic outputs.

The audit trail that disappears

Single-agent failures are straightforward to debug. Review the trace, find the error, fix it.

Multi-agent failures require forensics. Which agent made the decision? What information was available? How did agent interactions influence the outcome? The complexity grows combinatorially with agent count and interaction paths.

What you need to answer in regulated environments:

Who approved the action?
When was it approved?
What information was available at that moment?
Under what permissions?
Which agents were involved in the decision?
How did their interactions influence the outcome?

System logs don't answer these questions. You need instrumentation that captures the full decision chain across all agents, for all users, in all contexts.

Without proper architecture, you can't trace how decisions emerged from agent interactions. This isn't just an engineering problem. It's a compliance requirement.

The access control nightmare

A powerful agent serves multiple concurrent users. How do you prevent it from accessing User 2's data while processing User 1's request? The naive approach: trust the LLM to stay in bounds. This fails. An agent that hallucinates a patient ID doesn't know it crossed a boundary. It executes the tool call with the plausible-looking identifier it generated.

Multi-agent systems compound the problem:

Each agent needs its own access control context
Each inter-agent communication needs validation
Each tool call needs parameter verification
Hallucinated identifiers bypass intent-based security

You can't rely on stochastic model outputs to enforce deterministic security boundaries.

The solution requires programmatic guardrails:

Hard-coded validation that strips hallucinated IDs
Explicit permission checks at every boundary
Authenticated identifier injection through deterministic code (not model-generated parameters)

Implementing this across a mesh of interacting agents requires infrastructure most teams underestimate.

The human-in-the-loop that isn't

Most frameworks offer simple approval mechanisms. Before executing a sensitive action, pause and ask for confirmation.

This satisfies the feature requirement but misses the point. When an agent requests approval to order medication, the human needs context: why this drug, what alternatives were considered, what data informed the decision, which agent made the recommendation, what its confidence level was, who granted permissions.

What's missing from simple approval dialogs:

Decision rationale across multiple agents
Alternatives that were considered and rejected
Data sources that informed the recommendation
Confidence levels and uncertainty
Which specific agent made each contribution

A checkbox doesn't provide clinical decision support. In multi-agent systems where recommendations emerge from interactions between specialized agents, reconstructing decision rationale gets harder.

The governance challenge: maintaining a complete, auditable chain of custody for every data access, decision, and action across all agents, users, and contexts. Most frameworks treat this as an afterthought. In healthcare, it's baseline.

The scalability illusion

Multi-agent architectures promise elegant scaling. Need more capacity? Spin up more agents.

The reality: you're scaling model inference, agent orchestration, inter-agent communication, context synchronization, and coordination overhead. More agents mean higher computational demands for communication and coordination.

The scaling challenges:

Three-layer bottlenecks. Production systems must independently scale LLM inference, MCP tool servers, and orchestration backends. Each has different resource profiles. Tight coupling creates cascading failures where a slow tool server blocks agent execution, exhausts connection pools, stalls the system.
Unpredictable workload patterns. One query might spawn three agent calls. Another might spawn thirty. Auto-scaling designed for predictable queue depths fails when your workload is a complex graph of interdependent agent tasks.
Concurrency explosion. Multiple users running multi-agent workflows means N parallel workflows, each spawning M agents, each making P tool calls. The combinatorial explosion (N×M×P) overwhelms systems that handle traditional API traffic without issue.

Agents also struggle to judge appropriate effort for different tasks. Resource consumption becomes unpredictable. Traditional scaling strategies break down when your workload is non-deterministic by design.

What this means for production systems

Multi-agent frameworks come with architecture taxes. Complexity costs paid in engineering time, infrastructure investment, ongoing maintenance. The gap between prototype and production remains wide.

The key questions for evaluation:

How does the framework handle context? Does it maintain continuous context or fragment it across agents? Can you trace decision lineage across agent interactions?
What guarantees exist around data access? Are permissions enforced programmatically or through LLM obedience? How are identifiers validated?
How is governance implemented? Does the system provide parameter-level provenance? Can you audit who approved what action, with what information, in what context?
What does scaling actually look like? What's the coordination overhead? How do inference costs scale with agent count? What happens under variable load?

These architectural decisions determine whether your multi-agent system becomes a production asset or a maintenance burden.

The path forward

Teams running agents in production at scale rely on foundational capabilities that solve these problems:

Context management that maintains decision continuity across agent interactions
Programmatic guardrails that enforce access control through deterministic code, preventing hallucinated IDs from reaching tool execution
Parameter-level provenance that captures who approved what action, with what information, in what context
Independent scaling across inference, tooling, and orchestration layers to handle unpredictable multi-agent workloads
Testing primitives designed for validating emergent multi-agent behavior

This infrastructure represents significant engineering investment. The decision for healthtech teams: solve these fundamental infrastructure problems or build the clinical workflows and features that differentiate your product.

The agent infrastructure platforms that succeed will solve these problems comprehensively, letting application teams focus on clinical innovation rather than reimplementing orchestration, governance, and observability for each new agent-powered feature.

Recent stories