We scan new podcasts and send you the top 5 insights daily.
VC Josh Wolfe argues the AI narrative will shift from data center dominance to on-device inference. Citing Apple research on running LLMs on flash memory, he predicts a coming glut in data center capacity and a scarcity of on-device memory, favoring players like Micron and Samsung.
The AI inference process involves two distinct phases: "prefill" (reading the prompt, which is compute-bound) and "decode" (writing the response, which is memory-bound). NVIDIA GPUs excel at prefill, while companies like Grok optimize for decode. The Grok-NVIDIA deal signals a future of specialized, complementary hardware rather than one-size-fits-all chips.
While focus is on massive supercomputers for training next-gen models, the real supply chain constraint will be 'inference' chips—the GPUs needed to run models for billions of users. As adoption goes mainstream, demand for everyday AI use will far outstrip the supply of available hardware.
The growth of AI is constrained not by chip design but by inputs like energy and High Bandwidth Memory (HBM). This shifts power to component suppliers and energy providers, allowing them to gain leverage, demand equity, and influence the entire AI ecosystem, much like a central bank controls money.
The next wave of AI silicon may pivot from today's compute-heavy architectures to memory-centric ones optimized for inference. This fundamental shift would allow high-performance chips to be produced on older, more accessible 7-14nm manufacturing nodes, disrupting the current dependency on cutting-edge fabs.
OpenAI is buying 3-4 times more memory than it needs for short-term operations. While this could be aggressive future-proofing, a less charitable view suggests a strategic move to corner the DRAM supply, artificially inflating costs and killing the nascent on-device AI market before it can compete.
While NVIDIA's GPUs have been the primary AI constraint, the bottleneck is now moving to other essential subsystems. Memory, networking interconnects, and power management are emerging as the next critical choke points, signaling a new wave of investment opportunities in the hardware stack beyond core compute.
The intense power demands of AI inference will push data centers to adopt the "heterogeneous compute" model from mobile phones. Instead of a single GPU architecture, data centers will use disaggregated, specialized chips for different tasks to maximize power efficiency, creating a post-GPU era.
Despite record profits driven by AI demand for High-Bandwidth Memory, chip makers are maintaining a "conservative investment approach" and not rapidly expanding capacity. This strategic restraint keeps prices for critical components high, maximizing their profitability and effectively controlling the pace of the entire AI hardware industry.
The intense demand for memory chips for AI is causing a shortage so severe that NVIDIA is delaying a new gaming GPU for the first time in 30 years. This demonstrates a major inflection point where the AI industry's hardware needs are creating significant, tangible ripple effects on adjacent, multi-billion dollar consumer markets.
The narrative of endless demand for NVIDIA's high-end GPUs is flawed. It will be cracked by two forces: the shift of AI inference to on-device flash memory, reducing cloud reliance, and Google's ability to give away its increasingly powerful Gemini AI for free, undercutting the revenue models that fuel GPU demand.