Improve AI Quality by Manually Reviewing 100 User Chats Before Building Automated Systems

Related Insights

To Debug AI Agents, Identify and Log Only the First Error in an Interaction Chain

AI interactions often involve multiple steps (e.g., user prompt, tool calls, retrieval). When an error occurs, the entire chain can fail. The most efficient debugging heuristic is to analyze the sequence and stop at the very first mistake. Focusing on this "most upstream problem" addresses the root cause, as downstream failures are merely symptoms.

Evals, error analysis, and better prompts: A systematic approach to improving your AI products | Hamel Husain (ML engineer)

How I AI·9 months ago

Create Specific AI Evals Based on Top Error Categories, Not Generic Metrics like "Helpfulness"

Generic evaluation metrics like "helpfulness" or "conciseness" are vague and untrustworthy. A better approach is to first perform manual error analysis to find recurring problems (e.g., "tour scheduling failures"). Then, build specific, targeted evaluations (evals) that directly measure the frequency of these concrete issues, making metrics meaningful.

Evals, error analysis, and better prompts: A systematic approach to improving your AI products | Hamel Husain (ML engineer)

How I AI·9 months ago

Build Human-in-the-Loop Systems to Ship Imperfect AI Products Faster

Instead of waiting for AI models to be perfect, design your application from the start to allow for human correction. This pragmatic approach acknowledges AI's inherent uncertainty and allows you to deliver value sooner by leveraging human oversight to handle edge cases.

47: From Math Teacher to AI Founder (with Joe Sessions)

AI Product Leader·8 months ago

Use Humans for Context-Rich Eval Notes, Then Use LLMs to Cluster Those Notes into Themes

Don't ask an LLM to perform initial error analysis; it lacks the product context to spot subtle failures. Instead, have a human expert write detailed, freeform notes ("open codes"). Then, leverage an LLM's strength in synthesis to automatically categorize those hundreds of human-written notes into actionable failure themes ("axial codes").

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Lenny's Podcast: Product | Career | Growth·9 months ago

Stop Writing Tests First; Effective AI Evals Begin with Manual Error Analysis of User Logs

The common mistake in building AI evals is jumping straight to writing automated tests. The correct first step is a manual process called "error analysis" or "open coding," where a product expert reviews real user interaction logs to understand what's actually going wrong. This grounds your entire evaluation process in reality.

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Lenny's Podcast: Product | Career | Growth·9 months ago

AI Product Teams Must Analyze Raw, Messy User Inputs, Not Just Clean Test Prompts

Developers often test AI systems with well-formed, correctly spelled questions. However, real users submit vague, typo-ridden, and ambiguous prompts. Directly analyzing these raw logs is the most crucial first step to understanding how your product fails in the real world and where to focus quality improvements.

Evals, error analysis, and better prompts: A systematic approach to improving your AI products | Hamel Husain (ML engineer)

How I AI·9 months ago

Most AI Products Only Need 4 to 7 Core Automated Evals

You don't need to create an automated "LLM as a judge" for every potential failure. Many issues discovered during error analysis can be fixed with a simple prompt adjustment. Reserve the effort of building robust, automated evals for the 4-7 most persistent and critical failure modes that prompt changes alone cannot solve.

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Lenny's Podcast: Product | Career | Growth·9 months ago

High-Signal Fine-Tuning Data Comes From the Difficult Examples Where Your AI Fails

Fine-tuning an AI model is most effective when you use high-signal data. The best source for this is the set of difficult examples where your system consistently fails. The processes of error analysis and evaluation naturally curate this valuable dataset, making fine-tuning a logical and powerful next step after prompt engineering.

Evals, error analysis, and better prompts: A systematic approach to improving your AI products | Hamel Husain (ML engineer)

How I AI·9 months ago

Prevent Recurring AI Model Errors by Creating Custom 'Rules' After 2-3 Mistakes

When an AI model makes the same undesirable output two or three times, treat it as a signal. Create a custom rule or prompt instruction that explicitly codifies the desired behavior. This trains the AI to avoid that specific mistake in the future, improving consistency over time.

The beginner's guide to coding with Cursor | Lee Robinson (Head of AI education)

How I AI·9 months ago

Build Custom Internal Tools to Make Reviewing AI Product Data Frictionless

Reviewing user interaction data is the highest ROI activity for improving an AI product. Instead of relying solely on third-party observability tools, high-performing teams build simple, custom internal applications. These tools are tailored to their specific data and workflow, removing all friction from the process of looking at and annotating traces.

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Lenny's Podcast: Product | Career | Growth·9 months ago

Get your free personalized podcast brief

Related Insights