To ensure AI labs don't provide specially optimized private endpoints for evaluation, the firm creates anonymous accounts to test the same public models everyone else uses. This "mystery shopper" policy maintains the integrity and independence of their results.

Related Insights

The proliferation of AI leaderboards incentivizes companies to optimize models for specific benchmarks. This creates a risk of "acing the SATs" where models excel on tests but don't necessarily make progress on solving real-world problems. This focus on gaming metrics could diverge from creating genuine user value.

Companies with valuable proprietary data should not license it away. A better strategy to guide foundation model development is to keep the data private but release public benchmarks and evaluations based on it. This incentivizes LLM providers to train their models on the specific tasks you care about, improving their performance for your product.

The company provides public benchmarks for free to build trust. It monetizes by selling private benchmarking services and subscription-based enterprise reports, ensuring AI labs cannot pay for better public scores and thus maintaining objectivity.

Public leaderboards like LM Arena are becoming unreliable proxies for model performance. Teams implicitly or explicitly "benchmark" by optimizing for specific test sets. The superior strategy is to focus on internal, proprietary evaluation metrics and use public benchmarks only as a final, confirmatory check, not as a primary development target.

Traditional benchmarks often reward guessing. Artificial Analysis's "Omniscience Index" changes the incentive by subtracting points for wrong answers but not for "I don't know" responses. This encourages models to demonstrate calibration instead of fabricating facts.

LM Arena, known for its public AI model rankings, generates revenue by selling custom, private evaluation services to the same AI companies it ranks. This data helps labs improve their models before public release, but raises concerns about a "pay-to-play" dynamic that could influence public leaderboard performance.

To maintain independence and trust, their public benchmarks are free and cannot be influenced by payments. The company generates revenue by selling detailed reports and insight subscriptions to enterprises, and by conducting private, custom benchmarking for AI companies, separating their public good from their commercial offerings.

Don't trust academic benchmarks. Labs often "hill climb" or game them for marketing purposes, which doesn't translate to real-world capability. Furthermore, many of these benchmarks contain incorrect answers and messy data, making them an unreliable measure of true AI advancement.

To ensure they're testing publicly available models, Artificial Analysis creates anonymous accounts to run benchmarks without the provider's knowledge. Labs agree to this policy because it guarantees fairness and prevents their competitors from receiving special treatment or manipulating results, creating a stable, trusted equilibrium.

To maintain trust, Arena's public leaderboard is treated as a "charity." Model providers cannot pay to be listed, influence their scores, or be removed. This commitment to unbiased evaluation is a core principle that differentiates them from pay-to-play analyst firms.