RiffOn - Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith

Artificial Analysis founders share their journey from a side project to the 'Gartner of AI,' detailing their independent benchmarking and new evals.

Artificial Analysis Maintains Independence By Selling Insights, Not Benchmark Rankings

The company provides public benchmarks for free to build trust. It monetizes by selling private benchmarking services and subscription-based enterprise reports, ensuring AI labs cannot pay for better public scores and thus maintaining objectivity.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith

Latent Space: The AI Engineer Podcast·a month ago

A Novel 'Omniscience' Benchmark Fights Hallucination by Penalizing Incorrect Answers

Traditional benchmarks often reward guessing. Artificial Analysis's "Omniscience Index" changes the incentive by subtracting points for wrong answers but not for "I don't know" responses. This encourages models to demonstrate calibration instead of fabricating facts.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith

Latent Space: The AI Engineer Podcast·a month ago

AI Analytics Firm 'Artificial Analysis' Began as a Founder's Side Project

The company originated not as a grand vision, but as a practical tool the founders built for themselves while developing a legal AI assistant. They needed a way to benchmark LLMs for their own use case, and the project grew from there into a full-fledged company.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith

Latent Space: The AI Engineer Podcast·a month ago

Popular LLM Benchmarks Inadvertently Steer Research Towards Narrow Optimization

Once a benchmark becomes a standard, research efforts naturally shift to optimizing for that specific metric. This can lead to models that excel on the test but don't necessarily improve in general, real-world capabilities—a classic example of Goodhart's Law in AI.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith

Latent Space: The AI Engineer Podcast·a month ago

AI Benchmarking Firm Artificial Analysis Uses "Mystery Shoppers" to Prevent Cheating

To ensure AI labs don't provide specially optimized private endpoints for evaluation, the firm creates anonymous accounts to test the same public models everyone else uses. This "mystery shopper" policy maintains the integrity and independence of their results.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith

Latent Space: The AI Engineer Podcast·a month ago

Smarter LLMs Are Not Necessarily Less Prone to Hallucination

Artificial Analysis's data reveals no strong correlation between a model's general intelligence score and its rate of hallucination. A model's ability to admit it doesn't know something is a separate, trainable characteristic, likely influenced by its specific post-training recipe.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith

Latent Space: The AI Engineer Podcast·a month ago

Minimalist Agent Harnesses Outperform Major Chatbot Platforms on Complex Tasks

By providing a model with a few core tools (context management, web search, code execution), Artificial Analysis found it performed better on complex tasks than the integrated agentic systems within major web chatbots. This suggests leaner, focused toolsets can be more effective.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith

Latent Space: The AI Engineer Podcast·a month ago

'Token Efficiency' Is Replacing 'Reasoning Model' as a Key Metric for LLMs

The binary distinction between "reasoning" and "non-reasoning" models is becoming obsolete. The more critical metric is now "token efficiency"—a model's ability to use more tokens only when a task's difficulty requires it. This dynamic token usage is a key differentiator for cost and performance.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith

Latent Space: The AI Engineer Podcast·a month ago

Frontier AI Models Show Performance Correlates with Total, Not Active, Parameters

Data from benchmarks shows an MoE model's performance is more correlated with its total parameter count than its active parameter count. With models like Kimi K2 running at just 3% active parameters, this suggests there is still significant room to increase sparsity and efficiency.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith

Latent Space: The AI Engineer Podcast·a month ago

The Paradox of AI Costs: Per-Unit Intelligence is Plummeting While Overall Spend Skyrockets

While the cost for GPT-4 level intelligence has dropped over 100x, total enterprise AI spend is rising. This is driven by multipliers: using larger frontier models for harder tasks, reasoning-heavy workflows that consume more tokens, and complex, multi-turn agentic systems.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith

Latent Space: The AI Engineer Podcast·a month ago

LLM-as-Judge Evaluations Are More Reliable When Grading and Task-Execution Are Dissimilar

Using an LLM to grade another's output is more reliable when the evaluation process is fundamentally different from the task itself. For agentic tasks, the performer uses tools like code interpreters, while the grader analyzes static outputs against criteria, reducing self-preference bias.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith

Latent Space: The AI Engineer Podcast·a month ago

An LLM's Factual Recall Correlates Almost Perfectly with Its Total Parameter Count

The "Omniscience" accuracy benchmark, which measures pure factual knowledge, tracks more closely with a model's total parameters than any other metric. This suggests embedded knowledge is a direct function of model size, distinct from reasoning abilities developed via training techniques.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah-Hill Smith

Latent Space: The AI Engineer Podcast·a month ago