Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

By training on a trillion action tokens from video game controller and keyboard inputs, General Intuition is creating AIs that can operate any system with a similar interface. This novel approach allows their models to control robots and industrial machines as if they were playing a video game.

Related Insights

To build generalist robots, the most effective approach is pre-training foundation models on internet-scale video datasets, not just simulation or tele-operated data. This vast, diverse data provides a deep, implicit understanding of physics and object interaction that is impossible to replicate in controlled environments, enabling true generalization.

The Physical Intelligence thesis is that a foundation model learning from diverse data can achieve a "physical understanding" of the world, making it easier to adapt to new tasks than building single-purpose robots from scratch. Generality leverages broader data, which is ultimately a more scalable approach.

A bot that plays Minecraft by generating text prompts for the GPT-4 API has become a best-in-class robotic planning system. This novel approach suggests that specialized, standalone planning systems for robots could be replaced by interacting with a general-purpose LLM.

GI is not trying to solve robotics in general. Their strategy is to focus on robots whose actions can be mapped to a game controller. This constraint dramatically simplifies the problem, allowing their foundation models trained on gaming data to be directly applicable, shifting the burden for robotics companies from expensive pre-training to more manageable fine-tuning.

Large Language Models are limited because they lack an understanding of the physical world. The next evolution is 'World Models'—AI trained on real-world sensory data to understand physics, space, and context. This is the foundational technology required to unlock physical AI like advanced robotics.

GI's founder argues game footage is a superior data source for spatial reasoning compared to real-world videos. Gaming directly links visual perception to hand-eye motor control ("simulating optical dynamics with your hand"), avoiding the information loss inherent in interpreting passive video, which requires solving for pose estimation and inverse dynamics.

To protect user privacy, GI's system translates raw keyboard inputs (e.g., 'W' key) into their corresponding in-game actions (e.g., 'move forward'). This privacy-by-design approach has a key ML benefit: it removes noisy, user-specific key bindings and provides a standardized, canonical action space for training more generalizable agents.

The computer serves as a universal actuator for human work across diverse environments. This makes screen recordings an existing, large-scale dataset perfectly suited for pre-training base models for agency. This approach aims to create a foundational model for action by replicating human input (keystrokes, mouse moves) and output.

Instead of using traditional, rule-based simulators, Comma AI trains its driving agent inside a learned "world model." This generative model creates photorealistic, diverse driving scenarios and, crucially, responds accurately to the agent's simulated actions—a key requirement for effective robotics training.

Intuition Robotics' core bet is that the transfer from simulated to physical worlds is unlocked by a shared action interface. Since many real-world robots like drones and arms are already operated with game controllers, an agent trained in diverse gaming environments only needs to adapt to a new visual world, not an entirely new action space.

General Intuition Uses Gaming Data to Create AIs That 'Play the World' | RiffOn