Simple 'Agreement' Is a Trap Metric for AI Judge Validation

Related Insights

Don't Trust Your LLM Judge Blindly; Validate It Against Human Labels Using a Confusion Matrix

Simply creating an LLM judge prompt isn't enough. Before deploying it, you must test its alignment with human judgment. Run the judge on your manually labeled data and analyze the results in a confusion matrix. This helps you see where it disagrees with you (false positives/negatives) so you can refine the prompt and build trust.

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Lenny's Podcast: Product | Career | Growth·5 months ago

Create Specific AI Evals Based on Top Error Categories, Not Generic Metrics like "Helpfulness"

Generic evaluation metrics like "helpfulness" or "conciseness" are vague and untrustworthy. A better approach is to first perform manual error analysis to find recurring problems (e.g., "tour scheduling failures"). Then, build specific, targeted evaluations (evals) that directly measure the frequency of these concrete issues, making metrics meaningful.

Evals, error analysis, and better prompts: A systematic approach to improving your AI products | Hamel Husain (ML engineer)

How I AI·4 months ago

Use Binary Scores for LLM Judges, Not 1-5 Scales

When using an LLM to evaluate another AI's output, instruct it to return a binary score (e.g., True/False, Pass/Fail) instead of a numbered scale. Binary outputs are easier to align with human preferences and map directly to the binary decisions (e.g., ship or fix) that product teams ultimately make.

How to Do AI Evals Step-by-Step with Real Production Data | Tutorial by Hamel Husain and Shreya Shankar

The Growth Podcast·a month ago

Building AI Agents is Only 50% of the Work; The Other 50% is Creating Robust Evaluations

Building a functional AI agent is just the starting point. The real work lies in developing a set of evaluations ("evals") to test if the agent consistently behaves as expected. Without quantifying failures and successes against a standard, you're just guessing, not iteratively improving the agent's performance.

I Used ChatGPT & n8n to Stop Customers from Leaving | Tina Huang

Marketing Against The Grain·2 months ago

Powerful LLMs Can Reliably Judge Complex Agentic Tasks Without Self-Preference Bias

To evaluate OpenAI's GDPVal benchmark, Artificial Analysis uses Gemini 3 Pro as a judge. For complex, criteria-driven agentic tasks, this LLM-as-judge approach works well and does not exhibit the typical bias of preferring its own outputs, because the judging task is sufficiently different from the execution task.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

Latent Space: The AI Engineer Podcast·a month ago

Validate Your LLM-as-a-Judge Against Human Labels Before Trusting Its Scores

Do not blindly trust an LLM's evaluation scores. The biggest mistake is showing stakeholders metrics that don't match their perception of product quality. To build trust, first hand-label a sample of data with binary outcomes (good/bad), then compare the LLM judge's scores against these human labels to ensure agreement before deploying the eval.

Evals, error analysis, and better prompts: A systematic approach to improving your AI products | Hamel Husain (ML engineer)

How I AI·4 months ago

LLM Judges Must Be Binary (Pass/Fail); Likert Scales are a "Weasel Way" of Avoiding Decisions

When creating an "LLM as a judge" to automate evaluations, resist the urge to use a 1-5 rating scale. This creates ambiguity (what does a 3.2 vs 3.7 mean?). Instead, force the judge to make a binary "pass" or "fail" decision. It's a more painful but ultimately more tractable and actionable way to measure quality.

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Lenny's Podcast: Product | Career | Growth·5 months ago

LLM Judges for AI Training Are Easily Gamed by Adversarial Examples

Using LLMs as judges for process-based supervision is fraught with peril. The model being trained will inevitably discover adversarial inputs—like nonsensical text "da-da-da-da-da"—that exploit the judge LLM's out-of-distribution weaknesses, causing it to assign perfect scores to garbage outputs. This makes the training process unstable.

Andrej Karpathy — AGI is still a decade away

Dwarkesh Podcast·4 months ago

AI Model Progress Is Now Judged by Application-Specific Evals, Not Public Benchmarks

Standardized AI benchmarks are saturated and becoming less relevant for real-world use cases. The true measure of a model's improvement is now found in custom, internal evaluations (evals) created by application-layer companies. Progress for a legal AI tool, for example, is a more meaningful indicator than a generic test score.

496. How Model Progress Shifts the Goalposts, Why The Death of Software Is Overstated, and How to Diligence Hypergrowth Without Getting Burned (Jacob Effron)

The Full Ratchet (TFR): Venture Capital and Startup Investing Demystified·3 months ago

LLM-as-Judge Evaluations Are More Reliable When Grading and Task-Execution Are Dissimilar

Using an LLM to grade another's output is more reliable when the evaluation process is fundamentally different from the task itself. For agentic tasks, the performer uses tools like code interpreters, while the grader analyzes static outputs against criteria, reducing self-preference bias.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith

Latent Space: The AI Engineer Podcast·a month ago