Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

While traditional Test-Driven Development (TDD) can be cumbersome, an "eval-driven" approach is perfect for AI. The workflow: 1) write a failing evaluation to reproduce a conversational bug, 2) prompt an AI to fix the underlying code/prompt, and 3) confirm the fix by re-running the eval to ensure it passes.

Related Insights

While evals involve testing, their purpose isn't just to report bugs (information), like traditional QA. For an AI PM, evals are a core tool to actively shape and improve the product's behavior and performance (transformation) by iteratively refining prompts, models, and orchestration layers.

Treating AI evaluation like a final exam is a mistake. For critical enterprise systems, evaluations should be embedded at every step of an agent's workflow (e.g., after planning, before action). This is akin to unit testing in classic software development and is essential for building trustworthy, production-ready agents.

Building reliable AI agents requires a developer mindset shift. The most critical task is not writing the agent's code but creating robust evaluations ('evals') that define and verify the desired business outcome. This makes a test-driven development approach non-negotiable for enterprise AI.

Unlike traditional software with deterministic outputs, generative AI systems require a new paradigm. Chip Huyen calls this "evaluation-driven development," where the focus shifts from writing fixed tests to building robust systems and guidelines for evaluating ambiguous, generative outputs.

Notion treats its entire evaluation process as a coding agent problem. The system is designed for an agent to download a dataset, run an eval, identify a failure, debug the issue, and implement a fix, all within an automated loop. This turns quality assurance into a meta-problem for agents to solve.

Building a functional AI agent is just the starting point. The real work lies in developing a set of evaluations ("evals") to test if the agent consistently behaves as expected. Without quantifying failures and successes against a standard, you're just guessing, not iteratively improving the agent's performance.

The prompts for your "LLM as a judge" evals function as a new form of PRD. They explicitly define the desired behavior, edge cases, and quality standards for your AI agent. Unlike static PRDs, these are living documents, derived from real user data and are constantly, automatically testing if the product meets its requirements.

Evals shift product development from defining the 'how' to defining the 'what'. By creating quantifiable tests and success criteria, evals act like a modern PRD. This allows an AI model to creatively figure out the implementation while the team focuses on defining the desired outcome through concrete examples.

When an AI agent performs poorly, the most effective solution isn't clever prompt engineering. Braintrust's CEO's strategy is to "close the session" and rewrite the evaluation script from scratch. This forces clarity on the definition of success, which is often the root cause of the agent's failure.

Traditional evals fall short for sophisticated agents. A more effective method is a built-in evaluation loop where one agent is tasked with grading the output of another. This allows for continuous, automated quality assessment, especially when done in separate context windows to avoid bias.