We scan new podcasts and send you the top 5 insights daily.
Evals shift product development from defining the 'how' to defining the 'what'. By creating quantifiable tests and success criteria, evals act like a modern PRD. This allows an AI model to creatively figure out the implementation while the team focuses on defining the desired outcome through concrete examples.
While evals involve testing, their purpose isn't just to report bugs (information), like traditional QA. For an AI PM, evals are a core tool to actively shape and improve the product's behavior and performance (transformation) by iteratively refining prompts, models, and orchestration layers.
Before building an AI agent, product managers must first create an evaluation set and scorecard. This 'eval-driven development' approach is critical for measuring whether training is improving the model and aligning its progress with the product vision. Without it, you cannot objectively demonstrate progress.
Building non-deterministic AI products fundamentally changes the PM role. Instead of creating detailed, rigid specifications, the PM's primary task becomes defining and codifying "what good looks like." This is done by repeatedly grading AI outputs to train evaluation systems and guide the model's behavior.
Evals transform product specs from ambiguous documents into testable, measurable criteria. This gives product managers more leverage and provides clear targets for engineers, improving alignment and the quality of the final product.
At companies like OpenAI, the "currency of progress" with research teams is "evals" (evaluations). To get researchers excited about improving a specific problem, a PM must be able to frame it as a measurable eval with a clear rubric, test scenarios, and a target state.
Building reliable AI agents requires a developer mindset shift. The most critical task is not writing the agent's code but creating robust evaluations ('evals') that define and verify the desired business outcome. This makes a test-driven development approach non-negotiable for enterprise AI.
The primary bottleneck in improving AI is no longer data or compute, but the creation of 'evals'—tests that measure a model's capabilities. These evals act as product requirement documents (PRDs) for researchers, defining what success looks like and guiding the training process.
Instead of traditional product requirements documents, AI PMs should define success through a set of specific evaluation metrics. Engineers then work to improve the system's performance against these evals in a "hill climbing" process, making the evals the functional specification for the product.
The prompts for your "LLM as a judge" evals function as a new form of PRD. They explicitly define the desired behavior, edge cases, and quality standards for your AI agent. Unlike static PRDs, these are living documents, derived from real user data and are constantly, automatically testing if the product meets its requirements.
Instead of writing detailed Product Requirement Documents (PRDs), use a brief prompt with an AI tool like Vercel's v0. The generated prototype immediately reveals gaps and unstated assumptions in your thinking, allowing you to refine requirements based on the AI's 'misinterpretations' before creating a clearer final spec.