Fix Failing AI Agents By Improving Evals, Not Prompting

Related Insights

OpenAI Designs 'Job Interview Evals' to Test Complex Agent Capabilities

Standard benchmarks fall short for multi-turn AI agents. A new approach is the 'job interview eval,' where an agent is given an underspecified problem. It is then graded not just on the solution, but on its ability to ask clarifying questions and handle changing requirements, mimicking how a human developer is evaluated.

⚡️GPT5-Codex-Max: Training Agents with Personality, Tools & Trust — Brian Fioca + Bill Chen, OpenAI

Latent Space: The AI Engineer Podcast·7 months ago

Create Specific AI Evals Based on Top Error Categories, Not Generic Metrics like "Helpfulness"

Generic evaluation metrics like "helpfulness" or "conciseness" are vague and untrustworthy. A better approach is to first perform manual error analysis to find recurring problems (e.g., "tour scheduling failures"). Then, build specific, targeted evaluations (evals) that directly measure the frequency of these concrete issues, making metrics meaningful.

Evals, error analysis, and better prompts: A systematic approach to improving your AI products | Hamel Husain (ML engineer)

How I AI·10 months ago

Evaluate Each Step in an Agentic Workflow, Not Just the Final Output

Treating AI evaluation like a final exam is a mistake. For critical enterprise systems, evaluations should be embedded at every step of an agent's workflow (e.g., after planning, before action). This is akin to unit testing in classic software development and is essential for building trustworthy, production-ready agents.

AI Agents for PMs in 69 Minutes — Masterclass with IBM VP

Product Growth Podcast·a year ago

Force AI Agents to Self-Critique and Improve Their Own System Prompts

Instead of manually refining a complex prompt, create a process where an AI agent evaluates its own output. By providing a framework for self-critique, including quantitative scores and qualitative reasoning, the AI can iteratively enhance its own system instructions and achieve a much stronger result.

How to Build Multi-Agent AI Systems That Actually Work in Production | Tyler Fisk

Product Growth Podcast·10 months ago

Building AI Agents is Only 50% of the Work; The Other 50% is Creating Robust Evaluations

Building a functional AI agent is just the starting point. The real work lies in developing a set of evaluations ("evals") to test if the agent consistently behaves as expected. Without quantifying failures and successes against a standard, you're just guessing, not iteratively improving the agent's performance.

I Used ChatGPT & n8n to Stop Customers from Leaving | Tina Huang

Marketing Against The Grain·7 months ago

A Healthy Evaluation System Should Intentionally Surface Errors to Drive Progress

Don't aim for a 100% accurate evaluation system. A good system reveals a 'healthy percentage' of incorrect outputs. Getting excited when evals are wrong is key, as each failure is a clear, actionable opportunity to improve your AI agent.

How to Run Evals in Claude Code with Aparna Dhinakaran, Founder and CPO of Arize

The Growth Podcast·2 months ago

When an AI Fails, Treat It Like a Direct Report and Ask 'Where Did We Go Wrong?'

When a large language model provides a poor response, a highly effective technique is to treat it like a new employee. Instead of just re-prompting, ask it to explain its reasoning ("Why is that?") to understand the error, then provide clear, corrective feedback.

Shopping with Claude: How to find quality brands, automate returns, and buy things that last 100 years | Nicole Ruiz

How I AI·2 months ago

Evaluating AI Models Requires 'Driving' Them, Not One-Shot Prompts

Comparing AI models based on single, identical prompts is a flawed methodology. A true evaluation involves 'driving' the model through multiple iterations of feedback and correction. This reveals its ability to understand and adapt to your specific intent, which is a far more critical measure of its utility than a single probabilistic output.

Tommy Geoco - The state of the design industry right now

Dive Club 🤿·2 months ago

AI Agents Can Self-Debug by Explaining Their Own Failures

A powerful evaluation technique is to ask an AI agent to analyze its own poor output. The agent can review its context and process, explain why it made a mistake, and even suggest how to update its own instructions to prevent future errors.

From Game Dev to Google: Agentic AI, Zero to One, and the Future of Product Management

Product Talk·3 months ago

When AI-Generated Code Fails, Improve the Agent Pipeline, Not Just the Faulty Code

When an AI-coded feature is flawed, the instinct is to patch the specific output. A more effective, long-term approach is to analyze *why* your agent system produced a bad result and improve the underlying agent, skill, or process that failed.

Claude Code for Non-Technical PMs, with Andre Albuquerque

The Growth Podcast·2 months ago

Get your free personalized podcast brief

Related Insights