The next step for agents is self-awareness: understanding the specifics of their "harness"—the tools, APIs, and constraints of their environment. This awareness is a prerequisite for more advanced behaviors like identifying knowledge gaps and eventually modifying their own system prompts.
The ease of creating PRs with AI agents shifts the developer bottleneck from code generation to code validation. The new challenge is not writing the code, but gaining the confidence to merge it, elevating the importance of review, testing, and CI/CD pipelines.
A common failure with AI agents is underspecified prompts leading to incorrect implementations (e.g., a checkbox instead of a toggle). Video demos provide immediate visual feedback, creating a shared artifact that makes these misalignments obvious without needing to run the code locally.
Cursor found an agentic layer combining learnings from models by different providers created a synergistic output, superior to relying on a single, unified model tier. This highlights the value of model diversity in agentic systems, as different models possess unique strengths.
For bug fixes, Cursor's agents can be instructed to first reproduce a bug and create a video of it happening. They then fix it and make a second video showing the same workflow succeeding. This TDD-like "red-green" video proof dramatically increases confidence in the fix.
At Cursor, development is increasingly happening in Slack channels. Team members collectively kick off and redirect a cloud agent in a thread, turning development into a collaborative discussion. The IDE becomes a secondary tool, while communication platforms become the primary surface.
To combat the bottleneck of reviewing massive, AI-generated pull requests, Cursor's agents create video demos of the features they build. This provides a much more accessible entry point for human review than a giant diff, helping to quickly align on the direction.
Cursor discovered that agents need more than just code access. Providing a full VM environment—a "brain in a box" where they can see pixels, run code, and use dev tools like a human—was the step-change needed to tackle entire features, not just minor edits.
Comparing outputs from multiple models ("best of N") is often impractical due to the effort of reviewing huge code diffs. By having parallel agents generate short video demos, developers can quickly watch multiple versions and decide which approach is most promising.
To encourage a shift in user behavior, Cursor "unshipped" a file editor from its web UI. By restricting the ability to make small manual tweaks, they push users to delegate changes to the AI agent, reinforcing the new pattern of high-level instruction over low-level "hand coding."
The focus in AI engineering is shifting from making a single agent faster (latency) to running many agents in parallel (throughput). This "wider pipe" approach gets more total work done but will stress-test existing infrastructure like CI/CD, which wasn't built for this volume.
Cursor's "cloud agent diagnosis" command allows a primary agent to spin up specialized sub-agents that use integrations like Datadog to explore logs and diagnose another agent's failure. This creates a multi-agent system where agents act as external debuggers for each other.
