With the release of OpenAI's new video generation model, Sora 2, a surprising inversion has occurred. The generated video is so realistic that the accompanying AI-generated audio is now the more noticeable and identifiable artificial component, signaling a new frontier in multimedia synthesis.
OpenAI frames the current Sora model as analogous to GPT-3.5: a promising but flawed early version. This signals they know how to build the 'GPT-4 equivalent' for video and expect the pace of improvement to be even faster than it was for large language models.
AI generating high-quality animation is more impressive than photorealism because of the extreme scarcity of training data (thousands of hours vs. millions for video). Sora 2's success suggests a fundamental improvement in its learning efficiency, not just a brute-force data advantage.
Sora 2's most significant advancement is not its visual quality, but its ability to understand and simulate physics. The model accurately portrays how water splashes or vehicles kick up snow, demonstrating a grasp of cause and effect crucial for true world-building.
AI video tools like Sora optimize for high production value, but popular internet content often succeeds due to its message and authenticity, not its polish. The assumption that better visuals create better engagement is a risky product bet, as it iterates on an axis that users may not value.
Ben Thompson argues that ChatGPT succeeded because the creator was also the consumer, receiving immediate, personalized value. In contrast, AI video is created for an audience. He questions whether Sora's easily-made content is compelling enough for anyone other than the creator to watch, posing a major consumption hurdle.
While today's focus is on text-based LLMs, the true, defensible AI battleground will be in complex modalities like video. Generating video requires multiple interacting models and unique architectures, creating far greater potential for differentiation and a wider competitive moat than text-based interfaces, which will become commoditized.
The Sora team views video as having lower "intelligence per bit" compared to text. However, the total volume of available video data is vastly larger and less tapped. This suggests that, unlike LLMs facing a data crunch, video models can scale with more data for a very long time.
For a generative video model like OpenAI's Sora 2 to achieve viral adoption, it needs a universally appealing, simple-to-execute prompt, much like DALL-E's "Studio Ghibli moment." A feature like "upload your profile picture and turn it into a video" would engage a mass audience far more effectively than just showcasing raw technical capabilities.
The OpenAI team believes generative video won't just create traditional feature films more easily. It will give rise to entirely new mediums and creator classes, much like the film camera created cinema, a medium distinct from the recorded stage plays it was first used for.
The hosts' visceral reactions to Sora—describing it as making their "skin crawl" and feeling "unsafe"—suggest the Uncanny Valley is a psychological hurdle. Overcoming this negative, almost primal response to AI-generated humans may be a bigger challenge for adoption than achieving perfect photorealism.