In most cases, having multiple AI agents collaborate leads to a result that is no better, and often worse, than what the single most competent agent could achieve alone. The only observed exception is when success depends on generating a wide variety of ideas, as agents are good at sharing and adopting different approaches.

Related Insights

Pairing two AI agents to collaborate often fails. Because they share the same underlying model, they tend to agree excessively, reinforcing each other's bad ideas. This creates a feedback loop that fills their context windows with biased agreement, making them resistant to correction and prone to escalating extremism.

Multi-agent systems work well for easily parallelizable, "read-only" tasks like research, where sub-agents gather context independently. They are much trickier for "write" tasks like coding, where conflicting decisions between agents create integration problems.

In an attempt to scale autonomous coding, Cursor discovered that giving multiple AI agents equal status without hierarchy led to failure. The agents avoided difficult tasks, made only minor changes, and failed to take responsibility for major problems, causing the project to churn without meaningful progress.

Contrary to the expectation that more agents increase productivity, a Stanford study found that two AI agents collaborating on a coding task performed 50% worse than a single agent. This "curse of coordination" intensified as more agents were added, highlighting the significant overhead in multi-agent systems.

Despite extensive prompt optimization, researchers found it couldn't fix the "synergy gap" in multi-agent teams. The real leverage lies in designing the communication architecture—determining which agent talks to which and in what sequence—to improve collaborative performance.

The study's finding that adding AI agents diminishes productivity provides a modern validation of Brooks's Law. The overhead required for coordination among agents completely negated any potential speed benefits from parallelizing the work, proving that simply adding more "developers" is counterproductive.

Separating AI agents into distinct roles (e.g., a technical expert and a customer-facing communicator) mirrors real-world team specializations. This allows for tailored configurations, like different 'temperature' settings for creativity versus accuracy, improving overall performance and preventing role confusion.

The performance gap between solo and cooperating AI agents was largest on medium-difficulty tasks. Easy tasks had slack for coordination overhead, while hard tasks failed regardless of collaboration. This suggests mid-level work, requiring a balance of technical execution and cooperation, is most vulnerable to coordination tax.

An experiment showed that given a fixed compute budget, training a population of 16 agents produced a top performer that beat a single agent trained with the entire budget. This suggests that the co-evolution and diversity of strategies in a multi-agent setup can be more effective than raw computational power alone.

Even when an AI agent is an expert on a task, its pre-trained politeness can cause it to defer to less-capable agents. This "averaging" effect prevents the expert from taking a leadership role and harms the team's overall output, a phenomenon observed in Stanford's multi-agent research.