Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

By ranking engineers on AI token consumption, Meta is experiencing Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." Employees reportedly build bots to needlessly burn tokens for status, demonstrating how gamifying a proxy metric can backfire and disconnect from actual business impact.

Related Insights

The proliferation of AI leaderboards incentivizes companies to optimize models for specific benchmarks. This creates a risk of "acing the SATs" where models excel on tests but don't necessarily make progress on solving real-world problems. This focus on gaming metrics could diverge from creating genuine user value.

Current AI benchmarks have become targets for competition, an example of Goodhart's Law. Models are optimized to top leaderboards rather than develop the general capabilities the benchmarks were designed to measure, creating a false sense of progress and failing to predict real-world performance.

When a useful metric like "average handling time" becomes a performance target, employees game the system. Reps may hang up on customers to meet quotas, destroying the metric's ability to reflect actual customer satisfaction.

Focusing on individual performance metrics can be counterproductive. As seen in the "super chicken" experiment, top individual performers often succeed by suppressing others. This lowers team collaboration and harms long-term group output, which can be up to 160% more productive than a group of siloed high-achievers.

A trend called "tokenmaxxing" is emerging in Silicon Valley, where companies like Meta use leaderboards to track employee AI token usage. This reflects a corporate bet that higher token consumption correlates with increased productivity, turning AI usage into a new, albeit gameable, performance metric for engineers.

According to Goodhart's Law, when a measure becomes a target, it ceases to be a good measure. If you incentivize employees on AI-driven metrics like 'emails sent,' they will optimize for the number, not quality, corrupting the data and giving false signals of productivity.

Gamification backfires when it rewards unintended actions. For example, when Visual Studio's badge system inadvertently incentivized developers to write curse words in code comments. This shows the need to understand the second-order effects of any incentive system before implementation.

Alan Chang argues that incentivizing metrics can have negative second-order effects. For example, a recruiter bonused on 'hires per month' may be motivated to convince hiring managers to lower the talent bar just to hit their target, which is detrimental to the company's long-term goals.

Labs are incentivized to climb leaderboards like LM Arena, which reward flashy, engaging, but often inaccurate responses. This focus on "dopamine instead of truth" creates models optimized for tabloids, not for advancing humanity by solving hard problems.

At companies like Meta, a new practice called "token maxing" is being used to measure productivity, where engineers compete on leaderboards to consume the most AI tokens. Promoted by leaders from Nvidia and Meta, this metric is criticized for being easily gamed and not necessarily reflecting true productivity.