Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

Don't treat your test dataset as static. Monitor online eval scores in production. When you see poor performance, filter for those failing examples and add them to your offline dataset. This ensures your testing evolves with real-world usage patterns.

Related Insights

Generic evaluation metrics like "helpfulness" or "conciseness" are vague and untrustworthy. A better approach is to first perform manual error analysis to find recurring problems (e.g., "tour scheduling failures"). Then, build specific, targeted evaluations (evals) that directly measure the frequency of these concrete issues, making metrics meaningful.

Teams often mistakenly debate between using offline evals or online production monitoring. This is a false choice. Evals are crucial for testing against known failure modes before deployment. Production monitoring is essential for discovering new, unexpected failure patterns from real user interactions. Both are required for a robust feedback loop.

The core of an effective AI data flywheel is a process that captures human corrections not as simple fixes, but as perfectly formatted training examples. This structured data, containing the original input, the AI's error, and the human's ground truth, becomes a portable, fine-tuning-ready asset that directly improves the next model iteration.

The critical challenge in AI development isn't just improving a model's raw accuracy but building a system that reliably learns from its mistakes. The gap between an 85% accurate prototype and a 99% production-ready system is bridged by an infrastructure that systematically captures and recycles errors into high-quality training data.

If all your evals pass, you don't know the current limits of your system. Evals that consistently fail act as a benchmark. When a new foundation model is released, rerunning these tests immediately reveals if it has overcome previous limitations.

Developers often test AI systems with well-formed, correctly spelled questions. However, real users submit vague, typo-ridden, and ambiguous prompts. Directly analyzing these raw logs is the most crucial first step to understanding how your product fails in the real world and where to focus quality improvements.

Fine-tuning an AI model is most effective when you use high-signal data. The best source for this is the set of difficult examples where your system consistently fails. The processes of error analysis and evaluation naturally curate this valuable dataset, making fine-tuning a logical and powerful next step after prompt engineering.

Despite mature backtesting frameworks, Intercom repeatedly sees promising offline results fail in production. The "messiness of real human interaction" is unpredictable, making at-scale A/B tests essential for validating AI performance improvements, even for changes as small as a tenth of a percentage point.

Effective teams discuss production examples and eval scores in daily stand-ups. This ritual helps them identify novel failure patterns from real usage, add them to test datasets, and then prioritize daily work to improve performance on those specific issues.

While useful for catching regressions like a unit test, directly optimizing for an eval benchmark is misleading. Evals are, by definition, a lagging proxy for the real-world user experience. Over-optimizing for a metric can lead to gaming it and degrading the actual product.