When complex entities like universities are judged by simplified rankings (e.g., U.S. News), they learn to manipulate the specific inputs to the ranking formula. This optimizes their score without necessarily making them better institutions, substituting genuine improvement for the appearance of it.

Related Insights

The proliferation of AI leaderboards incentivizes companies to optimize models for specific benchmarks. This creates a risk of "acing the SATs" where models excel on tests but don't necessarily make progress on solving real-world problems. This focus on gaming metrics could diverge from creating genuine user value.

Public leaderboards like LM Arena are becoming unreliable proxies for model performance. Teams implicitly or explicitly "benchmark" by optimizing for specific test sets. The superior strategy is to focus on internal, proprietary evaluation metrics and use public benchmarks only as a final, confirmatory check, not as a primary development target.

Elite universities with massive endowments and shrinking acceptance rates are betraying their public service mission. By failing to expand enrollment, they function more like exclusive 'hedge funds offering classes' that manufacture scarcity to protect their brand prestige, rather than educational institutions aiming to maximize societal impact.

According to Goodhart's Law, when a measure becomes a target, it ceases to be a good measure. If you incentivize employees on AI-driven metrics like 'emails sent,' they will optimize for the number, not quality, corrupting the data and giving false signals of productivity.

Don't trust academic benchmarks. Labs often "hill climb" or game them for marketing purposes, which doesn't translate to real-world capability. Furthermore, many of these benchmarks contain incorrect answers and messy data, making them an unreliable measure of true AI advancement.

Despite average test scores on a consistent exam dropping by 10 points over 20 years, 60% of all grades at Harvard are now A's, up from 25%. This trend suggests a significant devaluation of academic credentials, where grades no longer accurately reflect student mastery.

Good Star Labs found GPT-5's performance in their Diplomacy game skyrocketed with optimized prompts, moving it from the bottom to the top. This shows a model's inherent capability can be masked or revealed by its prompt, making "best model" a context-dependent title rather than an absolute one.

Labs are incentivized to climb leaderboards like LM Arena, which reward flashy, engaging, but often inaccurate responses. This focus on "dopamine instead of truth" creates models optimized for tabloids, not for advancing humanity by solving hard problems.

When complex situations are reduced to a single metric, strategy shifts from achieving the original goal to maximizing the metric itself. During the Vietnam War, using "body counts" as a proxy for success led to military decisions designed to increase casualties, not to win the war.

Instead of trying to climb the traditional university rankings ladder—a game viewed as unwinnable and misguided—ASU President Michael Crow opted out. He created a new competitive framework for ASU focused on scale, speed, innovation, and societal impact, effectively inventing a different game to play.