We scan new podcasts and send you the top 5 insights daily.
Current LLM agents are effective at executing and optimizing experiments within a defined research track, like hyperparameter tuning. However, they lack the crucial scientific skill of 'lateral thinking'—recognizing when a research path is a dead end and strategically pivoting to a fundamentally new approach.
Frontier labs like OpenAI are now focused on building autonomous AI agents capable of conducting research and running experiments. This "auto researcher" is seen as the "final boss battle" to accelerate AI development itself.
The boom from LLMs was a 'shortcut' that mined intelligence from existing human data. This has limits. To achieve novel breakthroughs beyond that corpus, the field now re-integrates the original DeepMind philosophy of agents learning through interaction (like reinforcement learning) to generate truly new knowledge.
While powerful, Shopify's auto-research tool has limitations. It excels at performing tasks that are "obvious" but tedious for humans, like finding derivative datasets or suboptimal code. However, it's not yet capable of generating completely out-of-the-box solutions that require deep, multi-day thinking.
AI agents have become proficient at following a pre-defined strategy to execute tasks. The next major frontier, and a significant bottleneck, is the ability to explore open-ended environments and generate novel strategies independently. This is the core capability that benchmarks like ARC AGI v3 are designed to test.
Generating truly novel and valid scientific hypotheses requires a specialized, multi-stage AI process. This involves using a reasoning model for idea generation, a literature-grounded model for validation, and a third system for checking originality against existing research. This layered approach overcomes the limitations of a single, general-purpose LLM.
In high-stakes fields like pharma, AI's ability to generate more ideas (e.g., drug targets) is less valuable than its ability to aid in decision-making. Physical constraints on experimentation mean you can't test everything. The real need is for tools that help humans evaluate, prioritize, and gain conviction on a few key bets.
Unlike humans who have an intuitive sense of when to stop searching, agents can get stuck in expensive, fruitless loops trying to find information that may not exist. Teaching models the judgment to abandon a task is a new and vital frontier for reliable agentic AI.
A major frontier for AI in science is developing 'taste'—the human ability to discern not just if a research question is solvable, but if it is genuinely interesting and impactful. Models currently struggle to differentiate an exciting result from a boring one.
After two decades of experience and carefully tuning a model by hand, Karpathy was surprised when his automated research agent, running overnight, discovered superior hyperparameter configurations he had missed. This shows AI's power to surpass deep human expertise in objective optimization tasks.
Current LLMs fail at science because they lack the ability to iterate. True scientific inquiry is a loop: form a hypothesis, conduct an experiment, analyze the result (even if incorrect), and refine. AI needs this same iterative capability with the real world to make genuine discoveries.