Unlike coding with its verifiable unit tests, complex legal work lacks a binary success metric. Harvey addresses this reinforcement learning challenge by treating senior partner feedback and edits as the "reward function," mirroring how quality is judged in the real world. The ultimate verification is long-term success, like a merger avoiding future litigation.
Treating AI evaluation like a final exam is a mistake. For critical enterprise systems, evaluations should be embedded at every step of an agent's workflow (e.g., after planning, before action). This is akin to unit testing in classic software development and is essential for building trustworthy, production-ready agents.
Many AI projects fail to reach production because of reliability issues. The vision for continual learning is to deploy agents that are 'good enough,' then use RL to correct behavior based on real-world errors, much like training a human. This solves the final-mile reliability problem and could unlock a vast market.
The frontier of AI training is moving beyond humans ranking model outputs (RLHF). Now, high-skilled experts create detailed success criteria (like rubrics or unit tests), which an AI then uses to provide feedback to the main model at scale, a process called RLAIF.
Reinforcement Learning with Human Feedback (RLHF) is a popular term, but it's just one method. The core concept is reinforcing desired model behavior using various signals. These can include AI feedback (RLAIF), where another AI judges the output, or verifiable rewards, like checking if a model's answer to a math problem is correct.
Harvey's initial product was a tool for individual lawyers. The company found greater value by shifting focus to the productivity of entire legal teams and firms, tackling enterprise-level challenges like workflow orchestration, governance, and secure collaboration, which go far beyond simple model intelligence.
When creating an "LLM as a judge" to automate evaluations, resist the urge to use a 1-5 rating scale. This creates ambiguity (what does a 3.2 vs 3.7 mean?). Instead, force the judge to make a binary "pass" or "fail" decision. It's a more painful but ultimately more tractable and actionable way to measure quality.
Mercore's $500M revenue in 17 months highlights a shift in AI training. The focus is moving from low-paid data labelers to a marketplace of elite experts like doctors and lawyers providing high-quality, nuanced data. This creates a new, lucrative gig economy for top-tier professionals.
As reinforcement learning (RL) techniques mature, the core challenge shifts from the algorithm to the problem definition. The competitive moat for AI companies will be their ability to create high-fidelity environments and benchmarks that accurately represent complex, real-world tasks, effectively teaching the AI what matters.
OpenPipe's 'Ruler' library leverages a key insight: GRPO only needs relative rankings, not absolute scores. By having an LLM judge stack-rank a group of agent runs, one can generate effective rewards. This approach works phenomenally well, even with weaker judge models, effectively solving the reward assignment problem.
Harvey is building agentic AI for law by modeling it on the human workflow where a senior partner delegates a high-level task to a junior associate. The associate (or AI agent) then breaks it down, researches, drafts, and seeks feedback, with the entire client matter serving as the reinforcement learning environment.