OpenAI intentionally releases powerful technologies like Sora in stages, viewing it as the "GPT-3.5 moment for video." This approach avoids "dropping bombshells" and allows society to gradually understand, adapt to, and establish norms for the technology's long-term impact.
The long-term vision for the Sora app extends beyond entertainment. The "Cameo" feature is the first, low-bandwidth step toward creating detailed user profiles. The goal is an "alternate reality" where digital clones can interact, perform knowledge work, and run simulations.
Learning from Instagram's evolution towards passive consumption, the Sora team intentionally designs its social feed to inspire creation, not just scrolling. This fundamentally changes the platform's incentives and is proving successful, with high rates of daily active creation and posting.
The OpenAI team believes generative video won't just create traditional feature films more easily. It will give rise to entirely new mediums and creator classes, much like the film camera created cinema, a medium distinct from the recorded stage plays it was first used for.
A key advancement in Sora 2 is its failure mode. When a generated agent fails (e.g., a basketball player), the model simulates a physically plausible outcome (the ball bouncing off the rim) rather than forcing an unrealistic success. This shows a deeper, more robust internal world model.
The Sora team views video as having lower "intelligence per bit" compared to text. However, the total volume of available video data is vastly larger and less tapped. This suggests that, unlike LLMs facing a data crunch, video models can scale with more data for a very long time.
The key to Sora's social app wasn't just generating beautiful videos, but allowing users to inject themselves and friends via "cameos." This non-obvious feature transformed the experience from a tech demo into a human-centric social platform, achieving immediate internal product-market fit.
Sora doesn't process pixels or frames individually. Instead, it uses "space-time tokens" — small cuboids of video data combining spatial and temporal information. This voxel-like representation is the fundamental unit, enabling the model to understand properties like object permanence through global attention.
