We scan new podcasts and send you the top 5 insights daily.
A pure 'pixels in, actions out' model is insufficient for full autonomy. Waymo augments its end-to-end learning with structured, intermediate representations (like objects and road concepts). This provides crucial knobs for scalable simulation, safety validation, and defining reward functions.
The move from Waymo's 4th to 5th generation driver was a discontinuous jump. Waymo abandoned smaller, specialized ML models for a single AI backbone trained on a massive, nationwide dataset. This generalizable stack, rather than city-specific tuning, enabled its recent rapid scaling across the US.
Waymo demonstrated that a standard Vision Language Model (VLM) can be fine-tuned to output driving trajectories instead of text. While unsafe for public roads, it drives 'pretty darn well' in normal conditions, showing the surprising generalizability of foundational vision-language understanding.
Waymo’s system starts with a large, off-board foundation model understanding the physical world. This is specialized into three 'teacher' models: the Driver, the Simulator, and the Critic. These teachers then train smaller, efficient 'student' models that run in the vehicle.
Rivian's CEO explains that early autonomous systems, which were based on rigid rules-based "planners," have been superseded by end-to-end AI. This new approach uses a large "foundation model for driving" that can improve continuously with more data, breaking through the performance plateau of the older method.
A pure "pixels-in, actions-out" model is insufficient for full autonomy. While easy to start, this approach is extremely inefficient to simulate and validate for safety-critical edge cases. Waymo augments its end-to-end system with intermediate representations (like objects and road signs) to make simulation and validation tractable.
The AI's ability to handle novel situations isn't just an emergent property of scale. Waive actively trains "world models," which are internal generative simulators. This enables the AI to reason about what might happen next, leading to sophisticated behaviors like nudging into intersections or slowing in fog.
Waymo uses a foundation model to create specialized, high-capacity "teacher" models (Driver, Simulator, Critic) offline. These teachers then distill their knowledge into smaller, efficient "student" models that can run in real-time on the vehicle, balancing massive computational power with on-device constraints.
Instead of simulating photorealistic worlds, robotics firm Flexion trains its models on simplified, abstract representations. For example, it uses perception models like Segment Anything to 'paint' a door red and its handle green. By training on this simplified abstraction, the robot learns the core task (opening doors) in a way that generalizes across all real-world doors, bypassing the need for perfect simulation.
Instead of using traditional, rule-based simulators, Comma AI trains its driving agent inside a learned "world model." This generative model creates photorealistic, diverse driving scenarios and, crucially, responds accurately to the agent's simulated actions—a key requirement for effective robotics training.
Comma AI's architecture is "end-to-end," meaning its model takes raw video and directly outputs driving commands like acceleration and steering angle. This avoids the traditional, more brittle pipeline of separately detecting lanes, traffic lights, and other objects as intermediate steps before planning a path.