We scan new podcasts and send you the top 5 insights daily.
To build its legal benchmark without violating client confidentiality, Harvey used AI agents to generate realistic synthetic documents. This agent-led first draft was then refined by human legal experts, creating a scalable pipeline for high-quality, proprietary data in a data-scarce industry.
When building a PII detector for e-commerce giant Rakuten, Goodfire AI had to train on synthetic data due to privacy rules. This forced them to solve the difficult "synthetic to real" transfer problem to ensure performance on actual customer data, a common enterprise hurdle.
In data-scarce verticals like law, Harvey AI overcomes the lack of public training data by using coding models to create synthetic documents. This pipeline is so effective that even lawyers can't tell the difference, unlocking the ability to post-train specialized models.
Harvey created and open-sourced "Legal Agent Bench" to measure AI agent performance on legal tasks. This establishes them as a thought leader, rallies the community to improve on their vertical's problems, and creates a moat by defining the standard of performance for the entire industry.
Insurers lack the historical loss data required to price novel AI risks. The solution is to use red teaming and systematic evaluations to create a large pool of "synthetic data" on how an AI product behaves and fails. This data on failure frequency and severity can be directly plugged into traditional actuarial models.
To test complex AI prompts for tasks like customer persona generation without exposing sensitive company data, first ask the AI to create realistic, synthetic data (e.g., fake sales call notes). This allows you to safely develop and refine prompts before applying them to real, proprietary information, overcoming data privacy hurdles in experimentation.
The future of valuable AI lies not in models trained on the abundant public internet, but in those built on scarce, proprietary data. For fields like robotics and biology, this data doesn't exist to be scraped; it must be actively created, making the data generation process itself the key competitive moat.
Instead of using sensitive company information, you can prompt an AI model to create realistic, fake data for your business. This allows you to experiment with powerful data visualization and analysis workflows without any privacy or security risks.
To test their idea, Harvey's founders used GPT-3 to answer questions from the r/legaladvice subreddit. They sent the AI-generated responses to lawyers for review without revealing the source. When 86% were approved without edits, they knew they had a viable product.
Expect 2026 to be the breakout year for synthetic data. Companies in highly regulated sectors like healthcare and finance are realizing it offers a compliant and low-risk method to test and train AI models without compromising sensitive customer information, enabling innovation in marketing, research, and CX.
Harvey is building agentic AI for law by modeling it on the human workflow where a senior partner delegates a high-level task to a junior associate. The associate (or AI agent) then breaks it down, researches, drafts, and seeks feedback, with the entire client matter serving as the reinforcement learning environment.