Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

Vision Language Action models (VLAs) have not yet produced a 'ChatGPT moment' for robotics. Consequently, investor enthusiasm and capital are increasingly flowing towards the alternative 'World Model' approach, which learns physics from video, even though it has yet to demonstrate superior tangible results.

Related Insights

Insiders in top robotics labs are witnessing fundamental breakthroughs. These “signs of life,” while rudimentary now, are clear precursors to a rapid transition from research to widely adopted products, much like AI before ChatGPT’s public release.

While language models understand the world through text, Demis Hassabis argues they lack an intuitive grasp of physics and spatial dynamics. He sees 'world models'—simulations that understand cause and effect in the physical world—as the critical technology needed to advance AI from digital tasks to effective robotics.

Startups and major labs are focusing on "world models," which simulate physical reality, cause, and effect. This is seen as the necessary step beyond text-based LLMs to create agents that can truly understand and interact with the physical world, a key step towards AGI.

A key trend to watch is the rise of Vision-Language-Action (VLA) models, which are critical for robotics. These models take an instruction (language), understand a scene (vision), and then manipulate the environment (action). This represents a new paradigm that combines "read" and "write" access to the physical world, often requiring edge-ready compute.

New AI lab Odyssey is not building a direct robot controller. Instead, its 'foundation world model' acts as a general-purpose 'physics engine' for AI, learning the rules of reality from data. This foundational layer can then be licensed and used by other companies to build their specific action-oriented robot models.

Large Language Models are limited because they lack an understanding of the physical world. The next evolution is 'World Models'—AI trained on real-world sensory data to understand physics, space, and context. This is the foundational technology required to unlock physical AI like advanced robotics.

Physical Intelligence demonstrated an emergent capability where its robotics model, after reaching a certain performance threshold, significantly improved by training on egocentric human video. This solves a major bottleneck by leveraging vast, existing video datasets instead of expensive, limited teleoperated data.

The robotics field has a scalable recipe for AI-driven manipulation (like GPT), but hasn't yet scaled it into a polished, mass-market consumer product (like ChatGPT). The current phase focuses on scaling data and refining systems, not just fundamental algorithm discovery, to bridge this gap.

Neurobotics posits that true physical AI requires more than just vision-language models; it needs a "nervous system" and reflexes. They advocate for training robots in physical "gyms" to collect embodied data, arguing that complex physical tasks cannot be learned solely by watching videos.

Dr. Fei-Fei Li, a leading AI scientist, believes world models are deeply underappreciated. The reason isn't a lack of vision but the sheer novelty and technical difficulty of the field. As the "next frontier of AI," it hasn't had time to mature or be understood by the broader market in the way that LLMs have.