Stripe Uses an "LLM as Judge" to Generate Labels Where No Ground Truth Exists

Related Insights

Don't Trust Your LLM Judge Blindly; Validate It Against Human Labels Using a Confusion Matrix

Simply creating an LLM judge prompt isn't enough. Before deploying it, you must test its alignment with human judgment. Run the judge on your manually labeled data and analyze the results in a confusion matrix. This helps you see where it disagrees with you (false positives/negatives) so you can refine the prompt and build trust.

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Lenny's Podcast: Product | Career | Growth·5 months ago

Improve Classifiers by Adding a "Soft Block" Middle Ground for Ambiguous Cases

Binary decisions are brittle. For payments that are neither clearly safe nor clearly fraudulent, Stripe uses a "soft block." This triggers a 3DS authentication step, allowing legitimate users to proceed while stopping fraudsters, resolving ambiguity without losing revenue.

Stripe's Payments Foundation Model: How Data & Infra Create Compounding Advantage, w/ Emily Sands

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·5 months ago

Stripe Accelerates AI Adoption by Adding Foundation Model Embeddings to Existing ML Systems

Stripe avoids costly system rebuilds by treating its new payments foundation model as a modular component. Its powerful embeddings are simply added as new features to many existing ML classifiers, instantly boosting their performance with minimal engineering effort.

Stripe's Payments Foundation Model: How Data & Infra Create Compounding Advantage, w/ Emily Sands

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·5 months ago

Stripe's Foundation Model Treats Payments as a Unique Modality for Superhuman Fraud Detection

Stripe's AI model processes payments as a distinct data type, not just text. It analyzes transaction sequences across buyers, cards, devices, and merchants to uncover complex fraud patterns invisible to humans, boosting card testing detection from 59% to 97%.

Stripe's Payments Foundation Model: How Data & Infra Create Compounding Advantage, w/ Emily Sands

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·5 months ago

ImageNet Ensured Data Quality by Embedding 'Gold Standard' Tasks to Vet Workers

To combat poor quality on Amazon Mechanical Turk, the ImageNet team secretly included pre-labeled images within worker task flows. By checking performance on these "gold standard" examples, they could implicitly monitor accuracy and filter out unreliable contributors, ensuring high-quality data at scale.

#839: Dr. Fei-Fei Li, The Godmother of AI — Asking Audacious Questions, Civilizational Technology, and Finding Your North Star ( #839)

The Tim Ferriss Show·2 months ago

Validate Your LLM-as-a-Judge Against Human Labels Before Trusting Its Scores

Do not blindly trust an LLM's evaluation scores. The biggest mistake is showing stakeholders metrics that don't match their perception of product quality. To build trust, first hand-label a sample of data with binary outcomes (good/bad), then compare the LLM judge's scores against these human labels to ensure agreement before deploying the eval.

Evals, error analysis, and better prompts: A systematic approach to improving your AI products | Hamel Husain (ML engineer)

How I AI·4 months ago

LLM Judges Must Be Binary (Pass/Fail); Likert Scales are a "Weasel Way" of Avoiding Decisions

When creating an "LLM as a judge" to automate evaluations, resist the urge to use a 1-5 rating scale. This creates ambiguity (what does a 3.2 vs 3.7 mean?). Instead, force the judge to make a binary "pass" or "fail" decision. It's a more painful but ultimately more tractable and actionable way to measure quality.

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Lenny's Podcast: Product | Career | Growth·5 months ago

Stripe’s Foundation Model Catches Fraud That Traditional ML Misses

By creating dense embeddings for every transaction, Stripe's model identifies subtle patterns of card testing (e.g., tiny, repetitive charges) hidden within high-volume merchants' traffic. These attacks are invisible to traditional ML but appear as distinct clusters to the foundation model, boosting detection on large users from 59% to 97%.

The Agents Economy Backbone - with Emily Glassberg Sands, Head of Data & AI at Stripe

Latent Space: The AI Engineer Podcast·4 months ago

AI Evals Are the New Product Requirements Docs (PRDs), Codifying Desired Behavior

The prompts for your "LLM as a judge" evals function as a new form of PRD. They explicitly define the desired behavior, edge cases, and quality standards for your AI agent. Unlike static PRDs, these are living documents, derived from real user data and are constantly, automatically testing if the product meets its requirements.

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Lenny's Podcast: Product | Career | Growth·5 months ago

Stripe Blends AI Models with Simple Rules to Improve Fraud Detection Accuracy

Purely model-based or rule-based systems have flaws. Stripe combines them for better results. For instance, a transaction with a CVC code mismatch (a rule) is only blocked if its model-generated risk score is also elevated, preventing rejection of good customers who make simple mistakes.

Stripe's Payments Foundation Model: How Data & Infra Create Compounding Advantage, w/ Emily Sands

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·5 months ago