The performance gap between solo and cooperating AI agents was largest on medium-difficulty tasks. Easy tasks had slack for coordination overhead, while hard tasks failed regardless of collaboration. This suggests mid-level work, requiring a balance of technical execution and cooperation, is most vulnerable to coordination tax.
Multi-agent systems work well for easily parallelizable, "read-only" tasks like research, where sub-agents gather context independently. They are much trickier for "write" tasks like coding, where conflicting decisions between agents create integration problems.
In an attempt to scale autonomous coding, Cursor discovered that giving multiple AI agents equal status without hierarchy led to failure. The agents avoided difficult tasks, made only minor changes, and failed to take responsibility for major problems, causing the project to churn without meaningful progress.
Engineer productivity with AI agents hits a "valley of death" at medium autonomy. The tools excel at highly responsive, quick tasks (low autonomy) and fully delegated background jobs (high autonomy). The frustrating middle ground is where it's "not enough to delegate and not fun to wait," creating a key UX challenge.
Contrary to the expectation that more agents increase productivity, a Stanford study found that two AI agents collaborating on a coding task performed 50% worse than a single agent. This "curse of coordination" intensified as more agents were added, highlighting the significant overhead in multi-agent systems.
The rare successes in the CooperBench experiment were not random. They occurred when AI agents spontaneously adopted three behaviors without being prompted: dividing roles with mutual confirmation, defining work with extreme specificity (e.g., line numbers), and negotiating via concrete, non-open-ended options.
The study's finding that adding AI agents diminishes productivity provides a modern validation of Brooks's Law. The overhead required for coordination among agents completely negated any potential speed benefits from parallelizing the work, proving that simply adding more "developers" is counterproductive.
Stanford researchers found the largest category of AI coordination failure (42%) was "expectation failure"—one agent ignoring clearly communicated plans from another. This is distinct from "communication failure" (26%), showing that simply passing messages is insufficient; the receiving agent must internalize and act on the shared information.
To overcome the unproductivity of flat-structured agent teams, developers are adopting hierarchical models like the "Ralph Wiggum loop." This system uses "planner" agents to break down problems and create tasks, while "worker" agents focus solely on executing them, solving coordination bottlenecks and enabling progress.
The hosts distinguish between "spatial" coordination (who works where) and "semantic" coordination (what the final result should be). AIs succeeded at the former, reducing merge conflicts, but failed overall because they lacked a shared understanding of the desired outcome—a common pitfall for human teams as well.
In the Stanford study, AI agents spent up to 20% of their time communicating, yet this yielded no statistically significant improvement in success rates compared to having no communication at all. The messages were often vague and ill-timed, jamming channels without improving coordination.