Designing the reward function for an RL pricing model isn't just a technical task; it's a political one. It forces different departments (sales, operations, finance) to agree on a single definition of "good," thereby exposing and resolving hidden disagreements about strategic priorities like margin stability versus demand fulfillment.
To prevent engineers from gaming output-based pay, 10X assigns a "Technical Strategist" to each project. The engineer is paid for output, but the strategist is incentivized by client retention and account growth (NRR), creating a healthy tension that ensures high-quality work is delivered.
A major organizational red flag is when the people who decide on pricing are different from those who decide feature priorities. This disconnect indicates a broken strategy loop where value creation and value capture are managed in separate, unaligned silos.
AI startups should choose their pricing model based on a 2x2 matrix of autonomy (human-in-the-loop vs. fully automated) and attribution (how clearly its value can be measured). Low levels lead to seat-based pricing, while high levels of both unlock outcome-based models.
Reinforcement Learning with Human Feedback (RLHF) is a popular term, but it's just one method. The core concept is reinforcing desired model behavior using various signals. These can include AI feedback (RLAIF), where another AI judges the output, or verifiable rewards, like checking if a model's answer to a math problem is correct.
Unlike coding with its verifiable unit tests, complex legal work lacks a binary success metric. Harvey addresses this reinforcement learning challenge by treating senior partner feedback and edits as the "reward function," mirroring how quality is judged in the real world. The ultimate verification is long-term success, like a merger avoiding future litigation.
The theoretical need for an RL model to 'explore' new strategies is perceived by organizations as unpredictable, high-risk volatility. To gain trust, exploration cannot be a hidden technical function. It must be reframed and managed as a controlled, bounded, and explainable business decision with clear guardrails and manageable consequences.
When determining what data an RL model should consider, resist including every available feature. Instead, observe how experienced human decision-makers reason about the problem. Their simplified mental models reveal the core signals that truly drive outcomes, leading to more stable, faster-learning, and more interpretable AI systems.
As reinforcement learning (RL) techniques mature, the core challenge shifts from the algorithm to the problem definition. The competitive moat for AI companies will be their ability to create high-fidelity environments and benchmarks that accurately represent complex, real-world tasks, effectively teaching the AI what matters.
OpenPipe's 'Ruler' library leverages a key insight: GRPO only needs relative rankings, not absolute scores. By having an LLM judge stack-rank a group of agent runs, one can generate effective rewards. This approach works phenomenally well, even with weaker judge models, effectively solving the reward assignment problem.
In the age of AI, software is shifting from a tool that assists humans to an agent that completes tasks. The pricing model should reflect this. Instead of a subscription for access (a license), charge for the value created when the AI successfully achieves a business outcome.