For complex cases like "friendly fraud," traditional ground truth labels are often missing. Stripe uses an LLM to act as a judge, evaluating the quality of AI-generated labels for suspicious payments. This creates a proxy for ground truth, enabling faster model iteration.
Simply creating an LLM judge prompt isn't enough. Before deploying it, you must test its alignment with human judgment. Run the judge on your manually labeled data and analyze the results in a confusion matrix. This helps you see where it disagrees with you (false positives/negatives) so you can refine the prompt and build trust.
Binary decisions are brittle. For payments that are neither clearly safe nor clearly fraudulent, Stripe uses a "soft block." This triggers a 3DS authentication step, allowing legitimate users to proceed while stopping fraudsters, resolving ambiguity without losing revenue.
Stripe avoids costly system rebuilds by treating its new payments foundation model as a modular component. Its powerful embeddings are simply added as new features to many existing ML classifiers, instantly boosting their performance with minimal engineering effort.
Stripe's AI model processes payments as a distinct data type, not just text. It analyzes transaction sequences across buyers, cards, devices, and merchants to uncover complex fraud patterns invisible to humans, boosting card testing detection from 59% to 97%.
To combat poor quality on Amazon Mechanical Turk, the ImageNet team secretly included pre-labeled images within worker task flows. By checking performance on these "gold standard" examples, they could implicitly monitor accuracy and filter out unreliable contributors, ensuring high-quality data at scale.
Do not blindly trust an LLM's evaluation scores. The biggest mistake is showing stakeholders metrics that don't match their perception of product quality. To build trust, first hand-label a sample of data with binary outcomes (good/bad), then compare the LLM judge's scores against these human labels to ensure agreement before deploying the eval.
When creating an "LLM as a judge" to automate evaluations, resist the urge to use a 1-5 rating scale. This creates ambiguity (what does a 3.2 vs 3.7 mean?). Instead, force the judge to make a binary "pass" or "fail" decision. It's a more painful but ultimately more tractable and actionable way to measure quality.
By creating dense embeddings for every transaction, Stripe's model identifies subtle patterns of card testing (e.g., tiny, repetitive charges) hidden within high-volume merchants' traffic. These attacks are invisible to traditional ML but appear as distinct clusters to the foundation model, boosting detection on large users from 59% to 97%.
The prompts for your "LLM as a judge" evals function as a new form of PRD. They explicitly define the desired behavior, edge cases, and quality standards for your AI agent. Unlike static PRDs, these are living documents, derived from real user data and are constantly, automatically testing if the product meets its requirements.
Purely model-based or rule-based systems have flaws. Stripe combines them for better results. For instance, a transaction with a CVC code mismatch (a rule) is only blocked if its model-generated risk score is also elevated, preventing rejection of good customers who make simple mistakes.