By solving the core "intelligence" problem with a foundation model, the barrier to entry for creating novel robotic applications and form factors will dramatically decrease. This will enable a "Cambrian explosion" of hardware creativity, as builders will no longer need to solve AI from scratch for each new idea.
A flashy robot demo typically uses a highly controlled, pristine environment tailored to one task. True progress lies in a robot performing a mundane task reliably in any novel situation—a feat of generalization that is much harder to showcase visually and less exciting to a layperson.
The two greatest AI achievements are generative AI (mimicking human knowledge) and deep reinforcement learning (discovering superhuman strategies). The grand challenge, and the future of AI, is to fuse these two threads into a single system that can both leverage existing knowledge and innovate beyond it.
Instead of loading robots with costly sensors for touch or force, powerful learning models can infer physical properties from simple cameras. A wrist camera can act as a "touch sensor in disguise" by observing local deformations, dramatically lowering hardware costs and complexity for scalable robotics.
According to Moravec's paradox, tasks that are deeply ingrained in human evolution, especially nuanced physical and social interaction with other people (like childcare or elder care), will be the final frontier for robotics. These intuitive, high-stakes tasks are far more complex than structured industrial challenges.
For unpredictable situations where a robot has no prior training data (e.g., a "gas leak" sign), multimodal LLMs can provide the necessary world knowledge to reason and act appropriately. This solves the long-standing robotics problem of how to handle the long tail of real-world scenarios.
The Physical Intelligence thesis is that a foundation model learning from diverse data can achieve a "physical understanding" of the world, making it easier to adapt to new tasks than building single-purpose robots from scratch. Generality leverages broader data, which is ultimately a more scalable approach.
Robots have become so capable at low-level physical tasks that the primary bottleneck has shifted to "mid-level reasoning"—interpreting a scene and choosing the correct next action. This means improvement can come from high-level language-based coaching, not just more physical demonstration data, which is a major breakthrough.
Neurological studies show the human brain maps a tool's tip as if it were our hand. This implies that a powerful physical intelligence should not be tied to a specific body (e.g., a humanoid) but should be a general "brain" capable of controlling any embodiment, from a bulldozer to a multi-fingered hand.
A core controversy in robotics is whether to follow AI's "bitter lesson"—that general methods using massive data outperform systems with hand-coded knowledge. Many roboticists still argue for programming in physics for reliability, resisting a purely end-to-end learning approach that relies solely on data.
![Sergey Levine - Building LLMs for the Physical World - [Invest Like the Best, EP.465]](https://megaphone.imgix.net/podcasts/fdcd4328-2c77-11f1-a72b-977309fd08f1/image/b1bb4368d6e13a4a804924681ffe3ab1.jpg?ixlib=rails-4.3.1&max-w=3000&max-h=3000&fit=crop&auto=format,compress)