OpenAI's "GDP-val" Benchmark Signals a Shift from Measuring AI IQ to Real-World Job Task Competency

Related Insights

AI Model Training Has Shifted From Simple Tasks to Hours-Long Projects by PhDs

Early AI training involved simple preference tasks. Now, training frontier models requires PhDs and top professionals to perform complex, hours-long tasks like building entire websites or explaining nuanced cancer topics. The demand is for deep, specialized expertise, not just generalist labor.

First interview with Scale AI’s CEO: $14B Meta deal, what’s working in enterprise AI, and what frontier labs are building next | Jason Droege

Lenny's Podcast: Product | Career | Growth·4 months ago

AI Leaders Are Operationalizing the Path to AGI with a New Quantifiable Framework

A consortium including leaders from Google and DeepMind has defined AGI as matching the cognitive versatility of a "well-educated adult" across 10 domains. This new framework moves beyond abstract debate, showing a concrete 30-point leap in AGI score from GPT-4 (27%) to a projected GPT-5 (57%).

#176: ChatGPT Atlas, ChatGPT Atlas Security Issues, Letter to Pause Superintelligence, Amazon’s Plan to Automate 600,000 Jobs & New Data on AI Relationships

The Artificial Intelligence Show·4 months ago

Tie AI Competency to Performance Reviews with Activity-Based Goals

Formal AI competency frameworks are still emerging. In their place, innovative companies are assessing employee AI skills with concrete, activity-based targets like "build three custom GPTs for your role" or completing specific certifications, directly linking these achievements to performance reviews.

#181: AI Answers - Measuring AI Skills, Aligning Leaders, AI Literacy Frameworks, Overcoming Resistance & Preparing for AI Agents

The Artificial Intelligence Show·3 months ago

New 'Remote Labor Index' Shows Top AI Agents Automate Less Than 3% of Jobs

A benchmark testing AI agents against paid freelance jobs found the best performers could only autonomously complete 2.5% of the work. This provides a crucial reality check, showing that while AI excels at discrete tasks, full job automation by general-purpose agents is still far from reality.

#178: OpenAI’s Automated AI Researcher, OpenAI Restructuring, The Fed Warns About AI’s Impact on Hiring, Nvidia Hits $5 Trillion & Wharton Data on AI ROI

The Artificial Intelligence Show·3 months ago

OpenAI's GDPVal Proves Top AI Models Match Human Experts at 1% of the Cost

OpenAI's new GDPVal framework evaluates AI on real-world knowledge work. It found frontier models produce work rated equal to or better than human experts nearly 50% of the time, while being 100 times faster and cheaper. This provides a direct measure of impending economic transformation.

#170: How ChatGPT Is Used at Work, New GDPval Benchmark, AI “Workslop,” ChatGPT Pulse, Meta Vibes & More AI Economy Warnings

The Artificial Intelligence Show·5 months ago

AI Models Are Over-Specialized 'Competitive Programmers'

Current AI models resemble a student who grinds 10,000 hours on a narrow task. They achieve superhuman performance on benchmarks but lack the broad, adaptable intelligence of someone with less specific training but better general reasoning. This explains the gap between eval scores and real-world utility.

Ilya Sutskever – The age of scaling is over

Dwarkesh Podcast·3 months ago

AI 'Evals' Are the New Product Requirement Documents for Models

The primary bottleneck in improving AI is no longer data or compute, but the creation of 'evals'—tests that measure a model's capabilities. These evals act as product requirement documents (PRDs) for researchers, defining what success looks like and guiding the training process.

Why experts writing AI evals is creating the fastest-growing companies in history | Brendan Foody (CEO of Mercor)

Lenny's Podcast: Product | Career | Growth·5 months ago

The Frontier of AI Training Is Now Defining Better Benchmarks, Not Better Algorithms

As reinforcement learning (RL) techniques mature, the core challenge shifts from the algorithm to the problem definition. The competitive moat for AI companies will be their ability to create high-fidelity environments and benchmarks that accurately represent complex, real-world tasks, effectively teaching the AI what matters.

How Cognition Built the World's First AI Coding Agent—Before Claude Code

AI & I·5 months ago

AI Model Progress Is Now Judged by Application-Specific Evals, Not Public Benchmarks

Standardized AI benchmarks are saturated and becoming less relevant for real-world use cases. The true measure of a model's improvement is now found in custom, internal evaluations (evals) created by application-layer companies. Progress for a legal AI tool, for example, is a more meaningful indicator than a generic test score.

496. How Model Progress Shifts the Goalposts, Why The Death of Software Is Overstated, and How to Diligence Hypergrowth Without Getting Burned (Jacob Effron)

The Full Ratchet (TFR): Venture Capital and Startup Investing Demystified·3 months ago

Build Internal AI Benchmarks for Core Job Roles Instead of Waiting for Public Ones

Instead of waiting for external reports, companies should develop their own AI model evaluations. By defining key tasks for specific roles and testing new models against them with standard prompts, businesses can create a relevant, internal benchmark.

#172: Sora 2, Claude Sonnet 4.5, ChatGPT Instant Checkout, How OpenAI Uses AI, Grokipedia & Mercor’s AI Productivity Index

The Artificial Intelligence Show·4 months ago