DocuSign's AI Accuracy Dropped 15 Points When Applied to Real Private Contracts

Related Insights

Forget Public Benchmarks; Enterprise AI Adoption Hinges on 99% Accuracy on Niche Tasks

While public benchmarks show general model improvement, they are almost orthogonal to enterprise adoption. Enterprises don't care about general capabilities; they need near-perfect precision on highly specific, internal workflows. This requires extensive fine-tuning and validation, not chasing leaderboard scores.

20VC: Enterprises Will Not Adopt AI without Forward-Deployed Engineers | Who Wins the Data Labelling Race: How Does it Shake Out? | Lessons Learned Hitting $200M ARR with Matt Fitzpatrick, CEO of Invisible Technologies

The Twenty Minute VC (20VC): Venture Capital | Startup Funding | The Pitch·2 months ago

AI Models Ace Benchmarks But Fail at Simple Real-World Tasks

There's a significant gap between AI performance in simulated benchmarks and in the real world. Despite scoring highly on evaluations, AIs in real deployments make "silly mistakes that no human would ever dream of doing," suggesting that current benchmarks don't capture the messiness and unpredictability of reality.

Can Grok and Claude run a business? We just did it

AI Pod by Wes Roth and Dylan Curious | Artificial Intelligence News and Interviews With Experts·2 months ago

The Next AI Breakthroughs Will Come From Proprietary Enterprise Data, Not Public Data

Public internet data has been largely exhausted for training AI models. The real competitive advantage and source for next-generation, specialized AI will be the vast, untapped reservoirs of proprietary data locked inside corporations, like R&D data from pharmaceutical or semiconductor companies.

From Ghaziabad to Silicon Valley: Nikhil Kamath x Nikesh Arora | People by WTF | Ep. 11

People by WTF·8 months ago

AI's Ultimate Moat Is Proprietary Outcome Data, Not Public Training Data

A key competitive advantage for AI companies lies in capturing proprietary outcomes data by owning a customer's end-to-end workflow. This data, such as which legal cases are won or lost, is not publicly available. It creates a powerful feedback loop where the AI gets smarter at predicting valuable outcomes, a moat that general models cannot replicate.

Big Ideas 2026: The Enterprise Orchestration Layer

The a16z Show·2 months ago

AI Benchmarks Must Shift from Academic Puzzles to Economically Valuable Tasks

The most significant gap in AI research is its focus on academic evaluations instead of tasks customers value, like medical diagnosis or legal drafting. The solution is using real-world experts to define benchmarks that measure performance on economically relevant work.

Brendan Foody on Teaching AI and the Future of Knowledge Work

Conversations with Tyler·a month ago

Analyzing an AI Model's Failures Is More Valuable Than Perfect Performance Metrics

The researchers' failure case analysis is highlighted as a key contribution. Understanding why the model fails—due to ambiguous data or unusual inputs—provides a realistic scope of application and a clear roadmap for improvement, which is more useful for practitioners than high scores alone.

How Multi-Stage Reasoning Helps AI Understand What Cities Mean

Machine Learning Tech Brief By HackerNoon·22 days ago

Enterprise AI Fails When It Can't Digitize a Company's Specific Human Judgment

Off-the-shelf AI models can only go so far. The true bottleneck for enterprise adoption is "digitizing judgment"—capturing the unique, context-specific expertise of employees within that company. A document's meaning can change entirely from one company to another, requiring internal labeling.

First interview with Scale AI’s CEO: $14B Meta deal, what’s working in enterprise AI, and what frontier labs are building next | Jason Droege

Lenny's Podcast: Product | Career | Growth·4 months ago

Chinese AI Models Trail US Counterparts on Idiosyncratic, Real-World Tasks

Despite strong benchmark scores, top Chinese AI models (from ZAI, Kimi, DeepSeek) are "nowhere close" to US models like Claude or Gemini on complex, real-world vision tasks, such as accurately reading a messy scanned document. This suggests benchmarks don't capture a significant real-world performance gap.

AMA Part 1: Is Claude Code AGI? Are we in a bubble? Plus Live Player Analysis

"The Cognitive Revolution" | AI Builders, Researchers, and Live Player Analysis·a month ago

AI Model Progress Is Now Judged by Application-Specific Evals, Not Public Benchmarks

Standardized AI benchmarks are saturated and becoming less relevant for real-world use cases. The true measure of a model's improvement is now found in custom, internal evaluations (evals) created by application-layer companies. Progress for a legal AI tool, for example, is a more meaningful indicator than a generic test score.

496. How Model Progress Shifts the Goalposts, Why The Death of Software Is Overstated, and How to Diligence Hypergrowth Without Getting Burned (Jacob Effron)

The Full Ratchet (TFR): Venture Capital and Startup Investing Demystified·3 months ago

High-Stakes AI Rejects Consumer Models for "Courtroom-Grade" Grounded Solutions

The CEO contrasts general-purpose AI with their "courtroom-grade" solution, built on a proprietary, authoritative data set of 160 billion documents. This ensures outputs are grounded in actual case law and verifiable, addressing the core weaknesses of consumer models for professional use.

LexisNexis CEO says the AI law era is already here

Decoder with Nilay Patel·4 months ago