/

Why Video Agent models are next — Ethan He, xAI Grok Imagine

Latent Space: The AI Engineer Podcast · Jun 1, 2026

xAI's Ethan He on building video models: from data pipelines to real-time world models, and why video agents driven by LLMs are the future.

Training Video Models Requires First Building a Foundational Image Model

Video models are bootstrapped from image models because the denser, cheaper language-to-image data provides a stronger foundation for understanding human intent, a prerequisite for complex video generation.

Why Video Agent models are next — Ethan He, xAI Grok Imagine thumbnail

Why Video Agent models are next — Ethan He, xAI Grok Imagine

Latent Space: The AI Engineer Podcast·a month ago

Generative Video Models Depend Entirely on Synthetic Text-Video Pairs for Training

Raw internet videos lack direct textual descriptions. To train a video model, teams must first create synthetic datasets by using VLMs or human labelers to generate detailed captions that precisely describe the visual content.

Why Video Agent models are next — Ethan He, xAI Grok Imagine thumbnail

Why Video Agent models are next — Ethan He, xAI Grok Imagine

Latent Space: The AI Engineer Podcast·a month ago

Daily Iteration Speed is More Critical to Model Training Than Novel Algorithms

The primary driver of success in large-scale model training is the ability to conduct numerous experiments daily. A robust infrastructure that minimizes cycle time for testing hypotheses provides a greater advantage than focusing solely on developing new algorithms.

Why Video Agent models are next — Ethan He, xAI Grok Imagine thumbnail

Why Video Agent models are next — Ethan He, xAI Grok Imagine

Latent Space: The AI Engineer Podcast·a month ago

Fixing Small Data Pipeline Bugs Yields Greater Model Gains Than New Algorithms

Contrary to popular belief, many significant boosts in AI model quality don't originate from novel algorithms. Instead, they come from the less glamorous work of identifying and fixing subtle bugs within the data and model training pipelines.

Why Video Agent models are next — Ethan He, xAI Grok Imagine thumbnail

Why Video Agent models are next — Ethan He, xAI Grok Imagine

Latent Space: The AI Engineer Podcast·a month ago

xAI Shipped its First Multimodal Model From Scratch in Just Three Months

A small team at xAI went from no infrastructure, data, or model to a fully released multimodal product (GrokImagine 0.9) in only three months. This speed was enabled by leveraging strong foundational infra, high talent density, and minimal communication overhead.

Why Video Agent models are next — Ethan He, xAI Grok Imagine thumbnail

Why Video Agent models are next — Ethan He, xAI Grok Imagine

Latent Space: The AI Engineer Podcast·a month ago

Future User Interfaces Will Be Rendered Directly from User Intent via Diffusion Models

Instead of AI writing code that then gets rendered, future interfaces will be generated directly by diffusion models. This "intention-to-pixel" paradigm allows for hyper-personalized, real-time UIs, effectively making the diffusion model the new front-end.

Why Video Agent models are next — Ethan He, xAI Grok Imagine thumbnail

Why Video Agent models are next — Ethan He, xAI Grok Imagine

Latent Space: The AI Engineer Podcast·a month ago

Real-Time Video Models Must Sacrifice Compression Efficiency for Interactivity

While compressing video across the temporal dimension offers higher efficiency, it inherently introduces latency. For real-time, interactive applications like "world models," a less efficient frame-by-frame compression approach is necessary to enable immediate responsiveness.

Why Video Agent models are next — Ethan He, xAI Grok Imagine thumbnail

Why Video Agent models are next — Ethan He, xAI Grok Imagine

Latent Space: The AI Engineer Podcast·a month ago

A True "World Model" Requires Real-Time, Interactive, and Long-Horizon Video

A "world model" transcends simple video generation. It is defined by three key capabilities: real-time responsiveness to user input (e.g., mouse clicks), long-horizon consistency over minutes or hours, and interactivity via multiple modalities like keyboard and voice.

Why Video Agent models are next — Ethan He, xAI Grok Imagine thumbnail

Why Video Agent models are next — Ethan He, xAI Grok Imagine

Latent Space: The AI Engineer Podcast·a month ago

Petabyte-Scale Data Storage is a Major Hidden Cost in Video Model Training

While GPU costs for video model training are well-known, data storage represents a massive, often underestimated expense. A billion-video dataset, along with its compressed features, can require tens of petabytes, leading to storage and egress costs of millions per year.

Why Video Agent models are next — Ethan He, xAI Grok Imagine thumbnail

Why Video Agent models are next — Ethan He, xAI Grok Imagine

Latent Space: The AI Engineer Podcast·a month ago

Video Generation Quality Hinges on Language Models, Not the Video Model Itself

The perceived intelligence of video generation models is often an illusion. The heavy lifting is done by a large language model that rewrites simple user prompts into highly detailed scenes. The video diffusion model itself is less intelligent, simply executing these detailed instructions literally.

Why Video Agent models are next — Ethan He, xAI Grok Imagine thumbnail

Why Video Agent models are next — Ethan He, xAI Grok Imagine

Latent Space: The AI Engineer Podcast·a month ago

The Future of Video Creation Lies with AI Agents That Iteratively Use Tools

The next leap in video generation won't come from monolithic models but from AI agents. These LLM-driven agents will use a suite of tools—including diffusion models, video editors like FFmpeg, and image editors—to iteratively create and refine complex, long-form videos.

Why Video Agent models are next — Ethan He, xAI Grok Imagine thumbnail

Why Video Agent models are next — Ethan He, xAI Grok Imagine

Latent Space: The AI Engineer Podcast·a month ago

Powerful Coding Models Shift the AI Research Bottleneck Back to Compute

Previously, implementing a new algorithm could take weeks, leaving compute idle. With advanced coding assistants, ideas can be prototyped in hours, making the availability of compute resources to run experiments the primary limiting factor for progress again.

Why Video Agent models are next — Ethan He, xAI Grok Imagine thumbnail

Why Video Agent models are next — Ethan He, xAI Grok Imagine

Latent Space: The AI Engineer Podcast·a month ago