Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

To maintain a quality bar across diverse products, use a simple framework. "Bad" errors are critical and irrecoverable (e.g., a crash), while "Sad" errors are recoverable annoyances (e.g., UI flicker). Each team defines what constitutes Bad vs. Sad for their area, enabling a high-level, comparable view of product health.

Related Insights

Systematically review production traces ("open coding"), categorize the observed errors ("axial coding"), and then count them. This simple process transforms subjective "vibe checks" and messy logs into a prioritized, data-backed roadmap for improving your AI application, giving PMs a superpower.

Generic evaluation metrics like "helpfulness" or "conciseness" are vague and untrustworthy. A better approach is to first perform manual error analysis to find recurring problems (e.g., "tour scheduling failures"). Then, build specific, targeted evaluations (evals) that directly measure the frequency of these concrete issues, making metrics meaningful.

AI product quality is highly dependent on infrastructure reliability, which is less stable than traditional cloud services. Jared Palmer's team at Vercel monitored key metrics like 'error-free sessions' in near real-time. This intense, data-driven approach is crucial for building a reliable agentic product, as inference providers frequently drop requests.

Unlike traditional PMs who manage deterministic products (a button click always does the same thing), AI PMs manage probabilistic systems where the same input can yield different outputs. The core skill becomes defining acceptable error rates and designing for inconsistent results.

The most impactful quality metrics are not internal measures like bug counts but those directly linked to customer and business outcomes. QA professionals increase their influence by framing their findings in terms of business impact, financial exposure, and customer risk.

To ensure model robustness, OpenAI uses a "worst at N" evaluation metric. They sample a model's output multiple times (e.g., 20) on a given problem and measure the performance of the single worst response. This focuses development on eliminating low-quality outliers and ensuring a high floor for safety and consistency, rather than just optimizing for average performance.

To manage non-deterministic AI products, Shopify created an internal tool where PMs grade AI-generated outputs. This creates a "ground truth" dataset of what "good" looks like, which is then used to fine-tune a separate LLM that acts as an automated quality judge for new features and updates.

You don't need to create an automated "LLM as a judge" for every potential failure. Many issues discovered during error analysis can be fixed with a simple prompt adjustment. Reserve the effort of building robust, automated evals for the 4-7 most persistent and critical failure modes that prompt changes alone cannot solve.

To create a shared language for quality, Wealthsimple developed a hierarchy: 1) functionality, 2) reliability, 3) performance, and finally, 4) an excellent experience. This framework helps teams make trade-off decisions and align on what to prioritize first.

Anthropic has flipped the traditional development process. Instead of debating quality at the mock or discussion stage, they push teams to build a working version first. Quality decisions are then made based on hands-on usage of the live product, which provides much richer and more accurate feedback.