We scan new podcasts and send you the top 5 insights daily.
When developers are their own users (e.g., building coding tools), intuition is a reliable guide. However, in specialized domains like healthcare, where developers lack subject matter expertise, structured evals are essential to bridge the knowledge gap.
Don't treat evals as a mere checklist. Instead, use them as a creative tool to discover opportunities. A well-designed eval can reveal that a product is underperforming for a specific user segment, pointing directly to areas for high-impact improvement that a simple "vibe check" would miss.
Even roles far from the customer, like engineering, make countless micro-decisions. Without an intuitive understanding of customer pull—what they're trying to achieve and why they're blocked—these decisions will likely miss the mark, even when just following a requirements document.
AI models and frameworks change constantly. A deep understanding of user needs, encoded into a robust evaluation suite, is a lasting asset. This allows you to continuously iterate and improve quality, regardless of which new model or agent framework becomes popular.
Evals transform product specs from ambiguous documents into testable, measurable criteria. This gives product managers more leverage and provides clear targets for engineers, improving alignment and the quality of the final product.
Simple, function-level evals are a "local optimization." Blitzy evaluates system changes by tasking them with completing large, real-world projects (e.g., modifying Apache Spark) and assessing the percentage of completion. This requires human "taste" to judge the gap between functional correctness and true user intent.
AI evaluation shouldn't be confined to engineering silos. Subject matter experts (SMEs) and business users hold the critical domain knowledge to assess what's "good." Providing them with GUI-based tools, like an "eval studio," is crucial for continuous improvement and building trustworthy enterprise AI.
Product managers may lack the expertise to create comprehensive evals from scratch. A better approach is to generate initial outputs with a base model, have subject matter experts review them, and use their direct feedback to define what constitutes a failure. It's easier for experts to spot mistakes than to predict them.
The common mistake in building AI evals is jumping straight to writing automated tests. The correct first step is a manual process called "error analysis" or "open coding," where a product expert reviews real user interaction logs to understand what's actually going wrong. This grounds your entire evaluation process in reality.
Shift the AI development process by starting with workshops for the people who will live with the system, not just those who pay for it. The primary goal is to translate their stories and needs into tangible checks for fairness and feedback before focusing on technical metrics like accuracy and speed.
The prompts for your "LLM as a judge" evals function as a new form of PRD. They explicitly define the desired behavior, edge cases, and quality standards for your AI agent. Unlike static PRDs, these are living documents, derived from real user data and are constantly, automatically testing if the product meets its requirements.