RiffOn - Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

Artificial Analysis founders discuss their independent LLM benchmarking, new evals for hallucination & agentic tasks, and AI cost trends.

Smarter LLMs Are Not Necessarily Less Prone to Hallucination

Benchmarking revealed no strong correlation between a model's general intelligence and its tendency to hallucinate. This suggests that a model's "honesty" is a distinct characteristic shaped by its post-training recipe, not just a byproduct of having more knowledge.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

Latent Space: The AI Engineer Podcast·a month ago

Artificial Analysis's 'Gartner for AI' Model Balances Free Public Data with Paid Enterprise Services

They provide extensive free benchmarks to build credibility and community trust. Monetization comes from enterprise subscriptions for deeper insights and private, custom benchmarking for AI companies, ensuring the public data remains independent.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

Latent Space: The AI Engineer Podcast·a month ago

LLMs Are "Teaching to the Test," Forcing a Constant Evolution of Benchmarks

As benchmarks become standard, AI labs optimize models to excel at them, leading to score inflation without necessarily improving generalized intelligence. The solution isn't a single perfect test, but continuously creating new evals that measure capabilities relevant to real-world user needs.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

Latent Space: The AI Engineer Podcast·a month ago

Artificial Analysis's Omniscience Index Penalizes LLMs for Hallucinating Factual Answers

Traditional benchmarks reward models for attempting every question, encouraging educated guesses. The Omniscience Index changes this by deducting points for wrong answers but not for "I don't know" responses. This creates an incentive for labs to train models that are less prone to factual hallucination.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

Latent Space: The AI Engineer Podcast·a month ago

Artificial Analysis Began as a Side Project to Solve the Founders' Own LLM Benchmarking Needs

While building a legal AI tool, the founders discovered that optimizing each component was a complex benchmarking challenge involving trade-offs between accuracy, speed, and cost. They built an internal tool that quickly gained public traction as the number of models exploded.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

Latent Space: The AI Engineer Podcast·a month ago

Artificial Analysis Uses a "Mystery Shopper" Policy to Prevent AI Labs from Gaming Benchmarks

To ensure they are testing the same models available to the public, they register anonymous accounts to run evals. This prevents labs from providing specially tuned private endpoints that perform better than their publicly available APIs, thereby maintaining the integrity of their independent analysis.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

Latent Space: The AI Engineer Podcast·a month ago

Minimalist Agent Frameworks Can Outperform Complex Ones for Capable Models

An open-source harness with just basic tools like web search and a code interpreter enabled models to score higher on the GDPVal benchmark than when using their own integrated chatbot interfaces. This implies that for highly capable models, a less restrictive framework allows for better performance.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

Latent Space: The AI Engineer Podcast·a month ago

Model Providers' Self-Reported Benchmarks Are Unreliable Due to Inconsistent Prompting Techniques

AI labs often use different, optimized prompting strategies when reporting performance, making direct comparisons impossible. For example, Google used an unpublished 32-shot chain-of-thought method for Gemini 1.0 to boost its MMLU score. This highlights the need for neutral third-party evaluation.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

Latent Space: The AI Engineer Podcast·a month ago

LLM Factual Recall Directly Correlates with Total Parameter Count, Aiding Size Estimation

Artificial Analysis found its knowledge-based "Amnesian's" accuracy benchmark tracks closely with an LLM's total parameter count. By plotting open-weight models on this curve, they can reasonably estimate the size of closed models, suggesting leading frontier models are in the 5-10 trillion parameter range.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

Latent Space: The AI Engineer Podcast·a month ago

For AI Agents, Task Resolution Speed is a More Critical Cost Metric Than Per-Token Price

When evaluating AI agents, the total cost of task completion is what matters. A model with a higher per-token cost can be more economical if it resolves a user's query in fewer turns than a cheaper, less capable model. This makes "number of turns" a primary efficiency metric.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

Latent Space: The AI Engineer Podcast·a month ago

LLM Performance Correlates with Total, Not Active, Parameters, Suggesting Sparsity Can Increase Further

Performance on knowledge-intensive benchmarks correlates strongly with an MoE model's total parameter count, not its active parameter count. With leading models like Kimi K2 reportedly using only ~3% active parameters, this suggests there is significant room to increase sparsity and efficiency without degrading factual recall.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

Latent Space: The AI Engineer Podcast·a month ago

Powerful LLMs Can Reliably Judge Complex Agentic Tasks Without Self-Preference Bias

To evaluate OpenAI's GDPVal benchmark, Artificial Analysis uses Gemini 3 Pro as a judge. For complex, criteria-driven agentic tasks, this LLM-as-judge approach works well and does not exhibit the typical bias of preferring its own outputs, because the judging task is sufficiently different from the execution task.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

Latent Space: The AI Engineer Podcast·a month ago

AI Inference Costs Exhibit a "Smiling Curve": Per-Unit Intelligence is Cheaper, but Total Spend Soars

While the cost to achieve a fixed capability level (e.g., GPT-4 at launch) has dropped over 100x, overall enterprise spending is increasing. This paradox is explained by powerful multipliers: demand for frontier models, longer reasoning chains, and multi-step agentic workflows that consume exponentially more tokens.

Artificial Analysis: The Independent LLM Analysis House — with George Cameron and Micah Hill-Smith

Latent Space: The AI Engineer Podcast·a month ago