Intercom Finds Offline Evals Unreliable; Large-Scale A/B Tests Are the Only True Test

Related Insights

AI Evals Should Be Used Strategically to Uncover Opportunities, Not Just for Quality Control

Don't treat evals as a mere checklist. Instead, use them as a creative tool to discover opportunities. A well-designed eval can reveal that a product is underperforming for a specific user segment, pointing directly to areas for high-impact improvement that a simple "vibe check" would miss.

Al Engineering 101 with Chip Huyen (Nvidia, Stanford, Netflix)

Lenny's Podcast: Product | Career | Growth·4 months ago

Salesforce Simulates Enterprise Workflows to Stress-Test AI Agents for Failure

To ensure AI reliability, Salesforce builds environments that mimic enterprise CRM workflows, not game worlds. They use synthetic data and introduce corner cases like background noise, accents, or conflicting user requests to find and fix agent failure points before deployment, closing the "reality gap."

How Salesforce Is Using AI to Power the Enterprise

AI & I·4 months ago

Intercom Credits Context Engineering, Not New Models, for 30-Point AI Performance Gain

The vast majority of Intercom Fin's resolution rate increase came from optimizing retrieval, re-ranking, and prompting. GPT-4 was already intelligent enough for the task; the real gains were unlocked by improving the surrounding architecture, not waiting for better foundation models.

The Customer Service Revolution: Building Fin, with Eoghan McCabe & Fergal Reid of Intercom

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·5 months ago

AI’s Real Power Is Executing Countless 'Micro-Optimizations' Humans Can't

AI agents can continuously experiment with variables like subject lines, send times, and offers for each individual user. This level of granular, ongoing A/B testing is impossible to manage manually, unlocking significant performance lifts that compound over time.

S14 E5: How AI Is Rewriting the Rules of Ecom Marketing (with Liam Millward, Co-Founder and CEO of Instant)

Limited Supply·4 months ago

Fixer AI Used Human Assistants to Train and Benchmark Its AI Replacement

To ensure product quality, Fixer pitted its AI against 10 of its own human executive assistants on the same tasks. They refused to launch features until the AI could consistently outperform the humans on accuracy, using their service business as a direct training and validation engine.

454: Fyxer: From Executive Assistant Agency to $18M ARR AI SaaS - with Richard Hollingsworth

The SaaS Podcast: Build, Launch & Scale Your SaaS·5 months ago

Stop Writing Tests First; Effective AI Evals Begin with Manual Error Analysis of User Logs

The common mistake in building AI evals is jumping straight to writing automated tests. The correct first step is a manual process called "error analysis" or "open coding," where a product expert reviews real user interaction logs to understand what's actually going wrong. This grounds your entire evaluation process in reality.

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Lenny's Podcast: Product | Career | Growth·5 months ago

Validate Your LLM-as-a-Judge Against Human Labels Before Trusting Its Scores

Do not blindly trust an LLM's evaluation scores. The biggest mistake is showing stakeholders metrics that don't match their perception of product quality. To build trust, first hand-label a sample of data with binary outcomes (good/bad), then compare the LLM judge's scores against these human labels to ensure agreement before deploying the eval.

Evals, error analysis, and better prompts: A systematic approach to improving your AI products | Hamel Husain (ML engineer)

How I AI·4 months ago

AI Product Teams Must Analyze Raw, Messy User Inputs, Not Just Clean Test Prompts

Developers often test AI systems with well-formed, correctly spelled questions. However, real users submit vague, typo-ridden, and ambiguous prompts. Directly analyzing these raw logs is the most crucial first step to understanding how your product fails in the real world and where to focus quality improvements.

Evals, error analysis, and better prompts: A systematic approach to improving your AI products | Hamel Husain (ML engineer)

How I AI·4 months ago

Replace Vanity Metrics with Conversational Quality to Measure AI Performance

Open and click rates are ineffective for measuring AI-driven, two-way conversations. Instead, leaders should adopt new KPIs: outcome metrics (e.g., meetings booked), conversational quality (tracking an agent's 'I don't know' rate to measure trust), and, ultimately, customer lifetime value.

#782: Saleforce Marketing Cloud CMO Bobby Jania on the end of "Do No Reply" marketing

The Agile Brand with Greg Kihlström®: Expert Mode Marketing Technology, AI, & CX·2 months ago

Benchmark New AI Feature Success Against Your Most Popular Pre-AI Feature

To set realistic success metrics for new AI tools, Descript used its most popular pre-AI feature, "remove filler words," as the baseline. They compared adoption and retention of new AI features against this known winner, providing a clear, internal benchmark for what "good" looks like instead of guessing at targets.

She went from IC PM to CEO of $550M AI company Descript in 3 years

The Growth Podcast·2 months ago