Formal Evals Are Crucial When There is 'Distance' Between Your Team and the End-User

Related Insights

AI Evals Should Be Used Strategically to Uncover Opportunities, Not Just for Quality Control

Don't treat evals as a mere checklist. Instead, use them as a creative tool to discover opportunities. A well-designed eval can reveal that a product is underperforming for a specific user segment, pointing directly to areas for high-impact improvement that a simple "vibe check" would miss.

Al Engineering 101 with Chip Huyen (Nvidia, Stanford, Netflix)

Lenny's Podcast: Product | Career | Growth·8 months ago

Engineers Need a "Ground Theory" on Customer Demand to Avoid Arbitrary Decisions

Even roles far from the customer, like engineering, make countless micro-decisions. Without an intuitive understanding of customer pull—what they're trying to achieve and why they're blocked—these decisions will likely miss the mark, even when just following a requirements document.

The PULL framework (finding PMF)

The Physics of Startups with Rob Snyder·7 months ago

Invest in Evals as Your Durable Moat, Not in Transient LLM or Agent Architectures

AI models and frameworks change constantly. A deep understanding of user needs, encoded into a robust evaluation suite, is a lasting asset. This allows you to continuously iterate and improve quality, regardless of which new model or agent framework becomes popular.

Evals are the new PRD. Here is the playbook with the CEO of the leader in the space (Ankur Goyal, Founder and CEO, Braintrust)

The Growth Podcast·3 months ago

Replace Qualitative PRDs with Quantifiable 'Evals' to Guide AI Product Development

Evals transform product specs from ambiguous documents into testable, measurable criteria. This gives product managers more leverage and provides clear targets for engineers, improving alignment and the quality of the final product.

Evals are the new PRD. Here is the playbook with the CEO of the leader in the space (Ankur Goyal, Founder and CEO, Braintrust)

The Growth Podcast·3 months ago

Evaluate AI Systems on Large-Scale Projects to Assess True Capability, Not Micro-Benchmarks

Simple, function-level evals are a "local optimization." Blitzy evaluates system changes by tasking them with completing large, real-world projects (e.g., modifying Apache Spark) and assessing the percentage of completion. This requires human "taste" to judge the gap between functional correctness and true user intent.

Infinite Code Context: AI Coding at Enterprise Scale w/ Blitzy CEO Brian Elliott & CTO Sid Pardeshi

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·5 months ago

Empower Business Experts with GUI-Based Tools to Evaluate AI Systems

AI evaluation shouldn't be confined to engineering silos. Subject matter experts (SMEs) and business users hold the critical domain knowledge to assess what's "good." Providing them with GUI-based tools, like an "eval studio," is crucial for continuous improvement and building trustworthy enterprise AI.

AI Agents for PMs in 69 Minutes — Masterclass with IBM VP

Product Growth Podcast·10 months ago

Use Domain Experts to Define Failure Criteria After Reviewing Initial AI Outputs

Product managers may lack the expertise to create comprehensive evals from scratch. A better approach is to generate initial outputs with a base model, have subject matter experts review them, and use their direct feedback to define what constitutes a failure. It's easier for experts to spot mistakes than to predict them.

AI Evals Explained Simply by Ankit Shula

The Growth Podcast·4 months ago

Stop Writing Tests First; Effective AI Evals Begin with Manual Error Analysis of User Logs

The common mistake in building AI evals is jumping straight to writing automated tests. The correct first step is a manual process called "error analysis" or "open coding," where a product expert reviews real user interaction logs to understand what's actually going wrong. This grounds your entire evaluation process in reality.

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Lenny's Podcast: Product | Career | Growth·9 months ago

Begin Human-Centered AI Design by Converting End-User Stories into Concrete System Checks

Shift the AI development process by starting with workshops for the people who will live with the system, not just those who pay for it. The primary goal is to translate their stories and needs into tangible checks for fairness and feedback before focusing on technical metrics like accuracy and speed.

E204: Human-Centered AI: Designing Intelligence That Aligns With Us

AI For Pharma Growth·4 months ago

AI Evals Are the New Product Requirements Docs (PRDs), Codifying Desired Behavior

The prompts for your "LLM as a judge" evals function as a new form of PRD. They explicitly define the desired behavior, edge cases, and quality standards for your AI agent. Unlike static PRDs, these are living documents, derived from real user data and are constantly, automatically testing if the product meets its requirements.

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Lenny's Podcast: Product | Career | Growth·9 months ago

Get your free personalized podcast brief

Related Insights