RiffOn - How to Do AI Evals Step-by-Step with Real Production Data | Tutorial by Hamel Husain and Shreya Shankar

AI Evals are the most critical new skill for PMs. Learn a step-by-step process for analyzing real production data to find and fix errors.

Separating Product Managers from Prompt Engineering Is a Tragic Mistake

Prompts are written in English and encapsulate the AI's core logic and personality. It is a mistake to treat them as code firewalled within the engineering team. Product managers, as domain experts, should have direct access to edit and experiment with prompts through user-friendly admin interfaces.

How to Do AI Evals Step-by-Step with Real Production Data | Tutorial by Hamel Husain and Shreya Shankar

The Growth Podcast·a month ago

Turn Qualitative AI Failures into Quantitative Priorities via Error Analysis

Systematically review production traces ("open coding"), categorize the observed errors ("axial coding"), and then count them. This simple process transforms subjective "vibe checks" and messy logs into a prioritized, data-backed roadmap for improving your AI application, giving PMs a superpower.

How to Do AI Evals Step-by-Step with Real Production Data | Tutorial by Hamel Husain and Shreya Shankar

The Growth Podcast·a month ago

Don't Outsource AI Error Analysis; It’s How PMs Build a Product's Moat

Assigning error analysis to engineers or external teams is a huge pitfall. The process of reviewing traces and identifying failures is where product taste, domain expertise, and unique user understanding are embedded into the AI. It is a core product management function, not a technical task to be delegated.

How to Do AI Evals Step-by-Step with Real Production Data | Tutorial by Hamel Husain and Shreya Shankar

The Growth Podcast·a month ago

Don't Write an Eval for Every AI Bug You Find

Not every identified error requires building a formal evaluation. Some issues, like a simple formatting error, can be fixed directly in the prompt or code without an accompanying eval. Reserve the effort of building robust evals for systemic, complex problems that you anticipate needing to iterate on over time.

How to Do AI Evals Step-by-Step with Real Production Data | Tutorial by Hamel Husain and Shreya Shankar

The Growth Podcast·a month ago

Start AI Evals with a Simple CSV File, Not a Complex Observability Tool

You don't need a sophisticated and expensive AI observability platform to start doing evals. The most critical first step is logging traces. This can be done simply by writing to a CSV, JSON, or text file. The key is to begin taking notes on your traces, not to implement the perfect tool.

How to Do AI Evals Step-by-Step with Real Production Data | Tutorial by Hamel Husain and Shreya Shankar

The Growth Podcast·a month ago

Simple 'Agreement' Is a Trap Metric for AI Judge Validation

Don't rely on a simple agreement percentage to validate an LLM judge. If failures are rare (e.g., 10% of cases), a judge that always predicts "pass" will have 90% agreement but be useless. Instead, measure its performance on positive and negative cases separately (e.g., True Positive Rate and True Negative Rate).

How to Do AI Evals Step-by-Step with Real Production Data | Tutorial by Hamel Husain and Shreya Shankar

The Growth Podcast·a month ago

No Production Data? Simulate Users with an LLM to Bootstrap AI Evals

If your application isn't live and you lack real user data, you can still perform evals. The best methods are dogfooding and recruiting friends. If that's not possible, use an LLM to simulate user interactions at scale. This generates the necessary traces to begin the crucial error analysis process before launch.

How to Do AI Evals Step-by-Step with Real Production Data | Tutorial by Hamel Husain and Shreya Shankar

The Growth Podcast·a month ago

Use Binary Scores for LLM Judges, Not 1-5 Scales

When using an LLM to evaluate another AI's output, instruct it to return a binary score (e.g., True/False, Pass/Fail) instead of a numbered scale. Binary outputs are easier to align with human preferences and map directly to the binary decisions (e.g., ship or fix) that product teams ultimately make.

How to Do AI Evals Step-by-Step with Real Production Data | Tutorial by Hamel Husain and Shreya Shankar

The Growth Podcast·a month ago

LLMs Miss Critical Nuance; AI Evals Require a Product Manager's Context

AI tools like ChatGPT can analyze traces for basic correctness but miss subtle product experience failures. A product manager's contextual knowledge is essential to identify issues like improper formatting for a specific channel (e.g., markdown in SMS) or failures in user experience that an LLM would deem acceptable.

How to Do AI Evals Step-by-Step with Real Production Data | Tutorial by Hamel Husain and Shreya Shankar

The Growth Podcast·a month ago