Don't rely on a simple agreement percentage to validate an LLM judge. If failures are rare (e.g., 10% of cases), a judge that always predicts "pass" will have 90% agreement but be useless. Instead, measure its performance on positive and negative cases separately (e.g., True Positive Rate and True Negative Rate).

Related Insights

Simply creating an LLM judge prompt isn't enough. Before deploying it, you must test its alignment with human judgment. Run the judge on your manually labeled data and analyze the results in a confusion matrix. This helps you see where it disagrees with you (false positives/negatives) so you can refine the prompt and build trust.

Generic evaluation metrics like "helpfulness" or "conciseness" are vague and untrustworthy. A better approach is to first perform manual error analysis to find recurring problems (e.g., "tour scheduling failures"). Then, build specific, targeted evaluations (evals) that directly measure the frequency of these concrete issues, making metrics meaningful.

When using an LLM to evaluate another AI's output, instruct it to return a binary score (e.g., True/False, Pass/Fail) instead of a numbered scale. Binary outputs are easier to align with human preferences and map directly to the binary decisions (e.g., ship or fix) that product teams ultimately make.

Building a functional AI agent is just the starting point. The real work lies in developing a set of evaluations ("evals") to test if the agent consistently behaves as expected. Without quantifying failures and successes against a standard, you're just guessing, not iteratively improving the agent's performance.

To evaluate OpenAI's GDPVal benchmark, Artificial Analysis uses Gemini 3 Pro as a judge. For complex, criteria-driven agentic tasks, this LLM-as-judge approach works well and does not exhibit the typical bias of preferring its own outputs, because the judging task is sufficiently different from the execution task.

Do not blindly trust an LLM's evaluation scores. The biggest mistake is showing stakeholders metrics that don't match their perception of product quality. To build trust, first hand-label a sample of data with binary outcomes (good/bad), then compare the LLM judge's scores against these human labels to ensure agreement before deploying the eval.

When creating an "LLM as a judge" to automate evaluations, resist the urge to use a 1-5 rating scale. This creates ambiguity (what does a 3.2 vs 3.7 mean?). Instead, force the judge to make a binary "pass" or "fail" decision. It's a more painful but ultimately more tractable and actionable way to measure quality.

Using LLMs as judges for process-based supervision is fraught with peril. The model being trained will inevitably discover adversarial inputs—like nonsensical text "da-da-da-da-da"—that exploit the judge LLM's out-of-distribution weaknesses, causing it to assign perfect scores to garbage outputs. This makes the training process unstable.

Standardized AI benchmarks are saturated and becoming less relevant for real-world use cases. The true measure of a model's improvement is now found in custom, internal evaluations (evals) created by application-layer companies. Progress for a legal AI tool, for example, is a more meaningful indicator than a generic test score.

Using an LLM to grade another's output is more reliable when the evaluation process is fundamentally different from the task itself. For agentic tasks, the performer uses tools like code interpreters, while the grader analyzes static outputs against criteria, reducing self-preference bias.