We scan new podcasts and send you the top 5 insights daily.
In data-scarce verticals like law, Harvey AI overcomes the lack of public training data by using coding models to create synthetic documents. This pipeline is so effective that even lawyers can't tell the difference, unlocking the ability to post-train specialized models.
When building a PII detector for e-commerce giant Rakuten, Goodfire AI had to train on synthetic data due to privacy rules. This forced them to solve the difficult "synthetic to real" transfer problem to ensure performance on actual customer data, a common enterprise hurdle.
To ensure accuracy in its legal AI, LexisNexis unexpectedly hired a large number of lawyers, not just data scientists. These legal experts are crucial for reviewing AI output, identifying errors, and training the models, highlighting the essential role of human domain expertise in specialized AI.
To break the data bottleneck in AI protein engineering, companies now generate massive synthetic datasets. By creating novel "synthetic epitopes" and measuring their binding, they can produce thousands of validated positive and negative training examples in a single experiment, massively accelerating model development.
Synthetic data serves as an efficient first step for training specialized AI, particularly when a larger model teaches a smaller one. However, it is insufficient on its own. The final, crucial stage always requires expensive "human signal"—feedback from subject matter experts—to achieve true performance.
Advanced model training is not just about scraping the web. It's a multi-stage process that starts with massive web data, is refined by human-created examples and ratings (SFT), and is then scaled using reinforcement learning on data generated by the model itself. This synthetic data loop is now a critical component.
Unlike coding with its verifiable unit tests, complex legal work lacks a binary success metric. Harvey addresses this reinforcement learning challenge by treating senior partner feedback and edits as the "reward function," mirroring how quality is judged in the real world. The ultimate verification is long-term success, like a merger avoiding future litigation.
For tools like Harvey AI, the primary technical challenge is connecting all necessary context for a lawyer's task—emails, private documents, case law—before even considering model customization. The data plumbing is paramount and precedes personalization.
Harvey intentionally avoids self-serve and focuses on the most complex enterprise legal work first. The strategy is to build a business around problems so difficult they will outlast the next decade of foundational model advancements, preventing commoditization.
Harvey is building agentic AI for law by modeling it on the human workflow where a senior partner delegates a high-level task to a junior associate. The associate (or AI agent) then breaks it down, researches, drafts, and seeks feedback, with the entire client matter serving as the reinforcement learning environment.
The CEO contrasts general-purpose AI with their "courtroom-grade" solution, built on a proprietary, authoritative data set of 160 billion documents. This ensures outputs are grounded in actual case law and verifiable, addressing the core weaknesses of consumer models for professional use.