Open-Source AI Fails on Deep Questions Due to Shallow Training Data

Related Insights

AI Models Excel on Benchmarks But Fail in Reality Due to 'Teaching to the Test'

AI models show impressive performance on evaluation benchmarks but underwhelm in real-world applications. This gap exists because researchers, focused on evals, create reinforcement learning (RL) environments that mirror test tasks. This leads to narrow intelligence that doesn't generalize, a form of human-driven reward hacking.

Dwarkesh and Ilya Sutskever on What Comes After Scaling

The a16z Show·7 months ago

Training Data Contamination in LLMs Appears as Insightful Reasoning, Not Just Regurgitation

Contamination in coding benchmarks is subtle. Instead of just spitting out a known solution, models like GPT-5.2 use implicit knowledge from their training data (e.g., popular codebases) to reason about unstated requirements. This makes it hard to distinguish true capability from memorization, as the model's 'chain of thought' appears logical while relying on leaked information.

⚡️SWE-Bench-Dead: The End of SWE-Bench Verified — Mia Glaese & Olivia Watkins, OpenAI Frontier Evals & Human Data

Latent Space: The AI Engineer Podcast·4 months ago

LLMs Fail at Common Sense Because They Are Trained on the 'Maybe Sphere' of Debatable Text

Large Language Models struggle with obvious, real-world facts because their training data (text) over-represents uncertain topics open to debate—the 'maybe sphere.' Bedrock, common-sense knowledge is rarely written down, leaving a significant gap in the AI's world model and creating a need for human oversight on obvious matters.

David Shor and Byrne Hobart on the Politics of a White-Collar Wipeout

Odd Lots·3 months ago

SWE-Bench Coding Benchmark 'Died' from Training Data Contamination, Not Just Saturation

The SWE-bench benchmark is now obsolete primarily because its open-source problems were absorbed into models' training data. This allowed models to 'cheat' by memorizing solutions rather than demonstrating true reasoning, leading to artificially high and meaningless scores.

[LIVE] Anthropic Distillation & How Models Cheat (SWE-Bench Dead) | Nathan Lambert & Sebastian Raschka

Latent Space: The AI Engineer Podcast·4 months ago

AI Models Are Over-Specialized 'Competitive Programmers'

Current AI models resemble a student who grinds 10,000 hours on a narrow task. They achieve superhuman performance on benchmarks but lack the broad, adaptable intelligence of someone with less specific training but better general reasoning. This explains the gap between eval scores and real-world utility.

Ilya Sutskever – The age of scaling is over

Dwarkesh Podcast·7 months ago

Scientific AI's Biggest Hurdle Is the Vast, Undocumented Knowledge Within Labs

The internet is an insufficient training ground for scientific AI because most crucial information—including failed experiments, negative data, and nuanced procedural details—is never published. This undocumented knowledge, what scientists call "good hands," represents a major data bottleneck for building truly intelligent scientific models.

Molly Gibson: Superintelligence and the Future of Drug Development

Behind the Breakthroughs·3 months ago

Large LLM Context Windows Don't Guarantee Recall; Models Often Fail "Needle in the Haystack" Tests

Simply having a large context window is insufficient. Models may fail to "see" or recall specific facts embedded deep within the context, a phenomenon exposed by "needle in the haystack" evaluations. Effective reasoning capability across the entire window is a separate, critical factor.

959: Building Agents 101: Design Patterns, Evals and Optimization (with Sinan Ozdemir)

Super Data Science: ML & AI Podcast with Jon Krohn·5 months ago

AI Models Are Over-trained 'Competitive Programmers' Who Lack Real-World Judgment

AI models excel at specific tasks (like evals) because they are trained exhaustively on narrow datasets, akin to a student practicing 10,000 hours for a coding competition. While they become experts in that domain, they fail to develop the broader judgment and generalization skills needed for real-world success.

Dwarkesh and Ilya Sutskever on What Comes After Scaling

The a16z Show·7 months ago

Poor Generalization is the Fundamental Flaw Holding Back Current AI Models

The central challenge for current AI is not merely sample efficiency but a more profound failure to generalize. Models generalize 'dramatically worse than people,' which is the root cause of their brittleness, inability to learn from nuanced instruction, and unreliability compared to human intelligence. Solving this is the key to the next paradigm.

Dwarkesh and Ilya Sutskever on What Comes After Scaling

The a16z Show·7 months ago

AI Fails at Simple Tasks Like "Buy Shoes" Due to Missing "Reasoning Chains"

Seemingly simple user requests require a complex sequence of reasoning, tool use, and contextual understanding that is absent from internet training data. AI must be explicitly taught the implicit logic of how a human assistant would research preferences, evaluate options, and use various tools.

Turing CEO Jonathan Siddharth - The $30 Trillion Knowledge Work Market, Training Frontier AI Models and Building Stage Five Culture

"World of DaaS"·5 months ago

Get your free personalized podcast brief

Related Insights