/

© 2026 RiffOn. All rights reserved.

The Growth Podcast
AI Evals Explained Simply by Ankit Shula

AI Evals Explained Simply by Ankit Shula

The Growth Podcast · Feb 19, 2026

AI features fail without proper evals. Ankit Shula provides a masterclass on creating the success criteria and metrics to ship reliable AI.

AI Product Managers Should Use Evaluation Metrics as the PRD for Engineers

Instead of traditional product requirements documents, AI PMs should define success through a set of specific evaluation metrics. Engineers then work to improve the system's performance against these evals in a "hill climbing" process, making the evals the functional specification for the product.

AI Evals Explained Simply by Ankit Shula thumbnail

AI Evals Explained Simply by Ankit Shula

The Growth Podcast·16 hours ago

AI Evals Are a Transformative Product Tool, Not a Rebranded QA Function

While evals involve testing, their purpose isn't just to report bugs (information), like traditional QA. For an AI PM, evals are a core tool to actively shape and improve the product's behavior and performance (transformation) by iteratively refining prompts, models, and orchestration layers.

AI Evals Explained Simply by Ankit Shula thumbnail

AI Evals Explained Simply by Ankit Shula

The Growth Podcast·16 hours ago

Mandate Silent Reading Time at Meetings to Ensure Stakeholders Review Complex AI Documents

Complex documents like evaluation strategies are rarely read beforehand. To ensure alignment, adopt the Amazon practice of dedicating the first 15-20 minutes of a kickoff meeting to silent, focused reading. This forces engagement and leads to a more informed and productive discussion.

AI Evals Explained Simply by Ankit Shula thumbnail

AI Evals Explained Simply by Ankit Shula

The Growth Podcast·16 hours ago

Robust Evals Allow Using Cheaper AI Models Without Sacrificing Quality

PMs often default to the most powerful, expensive models. However, comprehensive evaluations can prove that a significantly cheaper or smaller model can achieve the desired quality for a specific task, drastically reducing operational costs. The evals provide the confidence to make this trade-off.

AI Evals Explained Simply by Ankit Shula thumbnail

AI Evals Explained Simply by Ankit Shula

The Growth Podcast·16 hours ago

Use an LLM to Author Your Final Evaluation Prompts from Human-Defined Criteria

Instead of manually crafting complex evaluation prompts, a more effective workflow is for a human to define the high-level criteria and red flags. Then, feed this guidance into a powerful LLM to generate the final, detailed, and robust prompt for the evaluation system, as AI is often better at prompt construction.

AI Evals Explained Simply by Ankit Shula thumbnail

AI Evals Explained Simply by Ankit Shula

The Growth Podcast·16 hours ago

Employ a Hybrid Evaluation Strategy: Code for Objectivity, LLMs for Subjectivity, and Humans for Ambiguity

A one-size-fits-all evaluation method is inefficient. Use simple code for deterministic checks like word count. Leverage an LLM-as-a-judge for subjective qualities like tone. Reserve costly human evaluation for ambiguous cases flagged by the LLM or for validating new features.

AI Evals Explained Simply by Ankit Shula thumbnail

AI Evals Explained Simply by Ankit Shula

The Growth Podcast·16 hours ago

Traditional ML Metrics Like BLEU/ROUGE Fail to Capture Semantic Meaning in LLM Outputs

Metrics like BLEU and ROUGE compare word overlap, not meaning. An LLM's output like "the cat is on the bed" could be semantically opposite to the ground truth "the cat is on the mat" but still score high. This highlights the need for more sophisticated, meaning-aware evaluations like LLM-as-a-judge.

AI Evals Explained Simply by Ankit Shula thumbnail

AI Evals Explained Simply by Ankit Shula

The Growth Podcast·16 hours ago

Use Domain Experts to Define Failure Criteria After Reviewing Initial AI Outputs

Product managers may lack the expertise to create comprehensive evals from scratch. A better approach is to generate initial outputs with a base model, have subject matter experts review them, and use their direct feedback to define what constitutes a failure. It's easier for experts to spot mistakes than to predict them.

AI Evals Explained Simply by Ankit Shula thumbnail

AI Evals Explained Simply by Ankit Shula

The Growth Podcast·16 hours ago

Treat User Behaviors Like Re-Generation and Quick Exits as Implicit Negative Feedback for AI Evals

Don't just rely on explicit feedback like thumbs up/down. Soft signals are powerful evaluation inputs. A user repeatedly re-generating an answer, quickly abandoning a session, or escalating to human support are strong indicators that your AI is failing, even if they don't explicitly say so.

AI Evals Explained Simply by Ankit Shula thumbnail

AI Evals Explained Simply by Ankit Shula

The Growth Podcast·16 hours ago