Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

While GPU costs for video model training are well-known, data storage represents a massive, often underestimated expense. A billion-video dataset, along with its compressed features, can require tens of petabytes, leading to storage and egress costs of millions per year.

Related Insights

TurboPuffer achieved its massive cost savings by building on slow S3 storage. While this increased write latency by 1000x—unacceptable for transactional systems—it was a perfectly acceptable trade-off for search and AI workloads, which prioritize fast reads over fast writes.

The proliferation of sensors, especially cameras, will generate massive amounts of video data. This data must be uploaded to cloud AI models for processing, making robust upstream bandwidth—not just downstream—the critical new infrastructure bottleneck and a significant opportunity for telecom companies.

An advanced user reveals their largest new expense from building AI agents isn't tokens, but database and storage costs. AI makes vast amounts of previously inert data useful, creating a surge in demand for storage solutions, which is where the real economic leverage lies.

The computational requirements for generative media scale dramatically across modalities. If a 200-token LLM prompt costs 1 unit of compute, a single image costs 100x that, and a 5-second video costs another 100x on top of that—a 10,000x total increase. 4K video adds another 10x multiplier.

The Sora team views video as having lower "intelligence per bit" compared to text. However, the total volume of available video data is vastly larger and less tapped. This suggests that, unlike LLMs facing a data crunch, video models can scale with more data for a very long time.

While OpenFold trains on public datasets, the pre-processing and distillation to make the data usable requires massive compute resources. This "data prep" phase can cost over $15 million, creating a significant, non-obvious barrier to entry for academic labs and startups wanting to build foundational models.

Data is becoming more expensive not from scarcity, but because the work has evolved. Simple labeling is over. Costs are now driven by the need for pricey domain experts for specialized data preparation and creative teams to build complex, synthetic environments for training agents.

The infrastructure demands of AI have caused an exponential increase in data center scale. Two years ago, a 1-megawatt facility was considered a good size. Today, a large AI data center is a 1-gigawatt facility—a 1000-fold increase. This rapid escalation underscores the immense and expensive capital investment required to power AI.

The next wave of data growth will be driven by countless sensors (like cameras) sending video upstream for AI processing. This requires a fundamental shift to symmetrical networks, like fiber, that have robust upstream capacity.

Counterintuitively, the capital expenditure for building AI data centers can be significantly higher than for manufacturing complex physical hardware like rockets and satellites. SpaceX's xAI division spent 50% more on CapEx than its rocket and satellite divisions combined, highlighting the immense cost of AI infrastructure at scale.