Sora's "Space-Time Tokens" Are the Voxel-Like Building Blocks for Video World Models

Related Insights

AI's Animation Mastery Signals Smarter Algorithms, Not Just More Training Data

AI generating high-quality animation is more impressive than photorealism because of the extreme scarcity of training data (thousands of hours vs. millions for video). Sora 2's success suggests a fundamental improvement in its learning efficiency, not just a brute-force data advantage.

How Political Theater and Economic Turmoil Drove America’s Government Shutdown

Tom Bilyeu's Impact Theory·5 months ago

World Labs' Marble Uses Gaussian Splats as an Atomic Unit for Real-Time 3D Worlds

Unlike video models that generate frame-by-frame, Marble natively outputs Gaussian splats—tiny, semi-transparent particles. This data structure enables real-time rendering, interactive editing, and precise camera control on client devices like mobile phones, a fundamental architectural advantage for interactive 3D experiences.

After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs

Latent Space: The AI Engineer Podcast·3 months ago

OpenAI Sora 2's Real Breakthrough Is Simulating Physics, Not Just Photorealism

Sora 2's most significant advancement is not its visual quality, but its ability to understand and simulate physics. The model accurately portrays how water splashes or vehicles kick up snow, demonstrating a grasp of cause and effect crucial for true world-building.

How Political Theater and Economic Turmoil Drove America’s Government Shutdown

Tom Bilyeu's Impact Theory·5 months ago

World Models: The Missing Link for Spatial and Embodied AI

Large language models are insufficient for tasks requiring real-world interaction and spatial understanding, like robotics or disaster response. World models provide this missing piece by generating interactive, reason-able 3D environments. They represent a foundational shift from language-based AI to a more holistic, spatially intelligent AI.

The Godmother of AI on jobs, robots & why world models are next | Dr. Fei-Fei Li

Lenny's Podcast: Product | Career | Growth·3 months ago

Video Data's Low Intelligence-Per-Bit Is Offset by Its Immense Volume

The Sora team views video as having lower "intelligence per bit" compared to text. However, the total volume of available video data is vastly larger and less tapped. This suggests that, unlike LLMs facing a data crunch, video models can scale with more data for a very long time.

OpenAI Sora 2 Team: How Generative Video Will Unlock Creativity and World Models

Training Data·3 months ago

Transformer Models Natively Operate on Sets, Not Sequences

A common misconception is that Transformers are sequential models like RNNs. Fundamentally, they are permutation-equivariant and operate on sets of tokens. Sequence information is artificially injected via positional embeddings, making the architecture inherently flexible for non-linear data like 3D scenes or graphs.

After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs

Latent Space: The AI Engineer Podcast·3 months ago

Descartes' Mirage Achieves Real-Time Video by Generating Frame-by-Frame Like an LLM

Traditional video models process an entire clip at once, causing delays. Descartes' Mirage model is autoregressive, predicting only the next frame based on the input stream and previously generated frames. This LLM-like approach is what enables its real-time, low-latency performance.

This AI Makes a Video Game World in 40 Milliseconds

AI & I·6 months ago

Transformers Are Fundamentally Set Models, Not Sequence Models

The core transformer architecture is permutation-equivariant and operates on sets of tokens, not ordered sequences. Sequentiality is an add-on via positional embeddings, making transformers naturally suited for non-linear data structures like 3D worlds, a concept many practitioners overlook.

What Comes After ChatGPT? The Mother of ImageNet Predicts The Future

a16z Podcast·2 months ago

Spatial AI Requires a Fundamentally New 3D Native Architecture

Current multimodal models shoehorn visual data into a 1D text-based sequence. True spatial intelligence is different. It requires a native 3D/4D representation to understand a world governed by physics, not just human-generated language. This is a foundational architectural shift, not an extension of LLMs.

The Frontier of Spatial Intelligence with Fei-Fei Li

a16z Podcast·3 months ago

Transformers Are Fundamentally Models of Sets, Not Sequences

Contrary to common perception shaped by their use in language, Transformers are not inherently sequential. Their core architecture operates on sets of tokens, with sequence information only injected via positional embeddings. This makes them powerful for non-sequential data like 3D objects or other unordered collections.

After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs

Latent Space: The AI Engineer Podcast·3 months ago