Surgeons perform intricate tasks without tactile feedback, relying on visual cues of tissue deformation. This suggests robotics could achieve complex manipulation by advancing visual interpretation of physical interactions, bypassing the immense difficulty of creating and integrating artificial touch sensors.
Ken Goldberg's company, Ambi Robotics, successfully uses simple suction cups for logistics. He argues that the industry's focus on human-like hands is misplaced, as simpler grippers are more practical, reliable, and capable of performing immensely complex tasks today.
Drawing a parallel to the Cambrian Explosion, where vision evolved alongside nervous systems, Dr. Li argues that perception's primary purpose is to enable action and interaction. This principle suggests that for AI to advance, particularly in robotics, computer vision must be developed as the foundation for embodied intelligence, not just for classification.
Leading roboticist Ken Goldberg clarifies that while legged robots show immense progress in navigation, fine motor skills for tasks like tying shoelaces are far beyond current capabilities. This is due to challenges in sensing and handling deformable, unpredictable objects in the real world.
Physical Intelligence demonstrated an emergent capability where its robotics model, after reaching a certain performance threshold, significantly improved by training on egocentric human video. This solves a major bottleneck by leveraging vast, existing video datasets instead of expensive, limited teleoperated data.
While autonomous driving is complex, roboticist Ken Goldberg argues it's an easier problem than dexterous manipulation. Driving fundamentally involves avoiding contact with objects, whereas manipulation requires precisely controlled contact and interaction with them, a much harder challenge.
Ken Goldberg quantifies the challenge: the text data used to train LLMs would take a human 100,000 years to read. Equivalent data for robot manipulation (vision-to-control signals) doesn't exist online and must be generated from scratch, explaining the slower progress in physical AI.
Instead of simulating photorealistic worlds, robotics firm Flexion trains its models on simplified, abstract representations. For example, it uses perception models like Segment Anything to 'paint' a door red and its handle green. By training on this simplified abstraction, the robot learns the core task (opening doors) in a way that generalizes across all real-world doors, bypassing the need for perfect simulation.
World Labs co-founder Fei-Fei Li posits that spatial intelligence—the ability to reason and interact in 3D space—is a distinct and complementary form of intelligence to language. This capability is essential for tasks like robotic manipulation and scientific discovery that cannot be reduced to linguistic descriptions.
Self-driving cars, a 20-year journey so far, are relatively simple robots: metal boxes on 2D surfaces designed *not* to touch things. General-purpose robots operate in complex 3D environments with the primary goal of *touching* and manipulating objects. This highlights the immense, often underestimated, physical and algorithmic challenges facing robotics.
Classical robots required expensive, rigid, and precise hardware because they were blind. Modern AI perception acts as 'eyes', allowing robots to correct for inaccuracies in real-time. This enables the use of cheaper, compliant, and inherently safer mechanical components, fundamentally changing hardware design philosophy.