The prompts for your "LLM as a judge" evals function as a new form of PRD. They explicitly define the desired behavior, edge cases, and quality standards for your AI agent. Unlike static PRDs, these are living documents, derived from real user data and are constantly, automatically testing if the product meets its requirements.

Related Insights

Don't treat evals as a mere checklist. Instead, use them as a creative tool to discover opportunities. A well-designed eval can reveal that a product is underperforming for a specific user segment, pointing directly to areas for high-impact improvement that a simple "vibe check" would miss.

AI prototyping doesn't replace the PRD; it transforms its purpose. Instead of being a static document, the PRD's rich context and user stories become the ideal 'master prompt' to feed into an AI tool, ensuring the initial design is grounded in strategic requirements.

Treating AI evaluation like a final exam is a mistake. For critical enterprise systems, evaluations should be embedded at every step of an agent's workflow (e.g., after planning, before action). This is akin to unit testing in classic software development and is essential for building trustworthy, production-ready agents.

Instead of manually refining a complex prompt, create a process where an AI agent evaluates its own output. By providing a framework for self-critique, including quantitative scores and qualitative reasoning, the AI can iteratively enhance its own system instructions and achieve a much stronger result.

The primary bottleneck in improving AI is no longer data or compute, but the creation of 'evals'—tests that measure a model's capabilities. These evals act as product requirement documents (PRDs) for researchers, defining what success looks like and guiding the training process.

Do not blindly trust an LLM's evaluation scores. The biggest mistake is showing stakeholders metrics that don't match their perception of product quality. To build trust, first hand-label a sample of data with binary outcomes (good/bad), then compare the LLM judge's scores against these human labels to ensure agreement before deploying the eval.

When creating an "LLM as a judge" to automate evaluations, resist the urge to use a 1-5 rating scale. This creates ambiguity (what does a 3.2 vs 3.7 mean?). Instead, force the judge to make a binary "pass" or "fail" decision. It's a more painful but ultimately more tractable and actionable way to measure quality.

You don't need to create an automated "LLM as a judge" for every potential failure. Many issues discovered during error analysis can be fixed with a simple prompt adjustment. Reserve the effort of building robust, automated evals for the 4-7 most persistent and critical failure modes that prompt changes alone cannot solve.

Instead of just asking an AI to write a PRD, first provide it with a "Socratic questioning" template. The LLM will then act as a thinking partner, asking challenging, open-ended questions about the problem and solution. This upfront thinking process results in a significantly more robust final document.

To prevent AI coding assistants from hallucinating, developer Terry Lynn uses a two-step process. First, an AI generates a Product Requirements Document (PRD). Then, a separate AI "reviewer" rates the PRD's clarity out of 10, identifying gaps before any code is written, ensuring a higher rate of successful execution.