In the AI era, it's fast to generate features, risking bloat. Braintrust's CEO suggests a "carving" metaphor: start with a large, AI-generated block of functionality and then meticulously remove complexity. Most user complaints are solved not by adding more, but by simplifying and removing what's confusing.
Instead of relying on their lead designer for manual "vibe checks," the Braintrust team translates his qualitative feedback into quantifiable evaluation criteria. This "captures" the expert in the system, allowing his high quality bar to be applied systematically and at scale across the entire product.
Braintrust operates with a "no backlog" mindset, enabled by AI. The productivity gains from agents mean there's "no excuse" not to immediately address performance issues or UI paper cuts that customers report. This shifts the team's focus to continuous improvement rather than letting small issues accumulate.
Evals shift product development from defining the 'how' to defining the 'what'. By creating quantifiable tests and success criteria, evals act like a modern PRD. This allows an AI model to creatively figure out the implementation while the team focuses on defining the desired outcome through concrete examples.
Ankur Goyal argues that AI agents can run far more exhaustive benchmarks and test more algorithms than even the best staff engineers manually could. This eliminates the common practice of prioritizing a few key benchmarks and "bullshitting" the rest, leading to more robust and performant software.
Engineers should define an "agent line": the threshold of tasks an AI agent can handle. By continuously re-evaluating what fits "below the agent line" and delegating it, senior engineers can free up significant time for more strategic, high-level work and creative problem-solving.
When an AI agent performs poorly, the most effective solution isn't clever prompt engineering. Braintrust's CEO's strategy is to "close the session" and rewrite the evaluation script from scratch. This forces clarity on the definition of success, which is often the root cause of the agent's failure.
To handle increased code output from AI agents, engineering teams must shift platform efforts to strengthening their CI/CD pipeline. Braintrust pauses feature work to improve CI, viewing it as earning the right to move faster. A robust CI system is the foundation for AI-driven development.
Braintrust's CEO Ankur Goyal uses AI coding agents to solve deep technical challenges like optimizing database queries. The agents exhaustively test different solutions from database literature, a task too tedious and time-consuming for human engineers, proving AI's value on complex, high-risk problems.
