Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

Etched's novel "low voltage inference" technology dramatically lowers a chip's voltage, improving power efficiency. This allows them to pack more computational units (flops) onto the chip without it overheating, a key innovation to solve the physical limits of current GPU performance for AI inference.

Related Insights

The performance gains from Nvidia's Hopper to Blackwell GPUs come from increased size and power, not efficiency. This signals a potential scaling limit, creating an opportunity for radically new hardware primitives and neural network architectures beyond today's matrix-multiplication-centric models.

Separating inference into "prefill" (memory-bound) and "decode" (bandwidth-bound) tasks is a game-changer for hardware longevity. It allows older GPUs to be used for prefill tasks indefinitely, extending their useful economic life from 3-4 years to 10-15 years, a boon for data centers and their financiers.

For two decades, silicon chips have been thermally constrained to a power density of about 1 watt per square millimeter. New R&D efforts are finally overcoming this barrier, which could lead to smaller, more powerful chips, despite significant thermal and electrical engineering challenges.

When power (watts) is the primary constraint for data centers, the total cost of compute becomes secondary. The crucial metric is performance-per-watt. This gives a massive pricing advantage to the most efficient chipmakers, as customers will pay anything for hardware that maximizes output from their limited power budget.

The GPU architecture is economically optimized for slow AI inference, offering a very low cost per token. However, this efficiency plummets when speed is required, as the cost and power per token increase exponentially, creating a market for alternative architectures in high-speed applications.

Model architecture decisions directly impact inference performance. AI company Zyphra pre-selects target hardware and then chooses model parameters—such as a hidden dimension with many powers of two—to align with how GPUs split up workloads, maximizing efficiency from day one.

Adding more FLOPS to current AI chips is useless due to thermal throttling. Etched realized the solution is lowering voltage, which quadratically reduces power consumption. Inspired by bitcoin miners, they created a new power delivery system enabling chips to run at under half the voltage of GPUs.

Leveraging technology developed for satellites, Akash Systems places a thin layer of synthetic diamond—the world's most thermally conductive material—directly onto GPUs. This dramatically lowers temperatures, increases inference speed, and reduces data center energy costs without expensive liquid cooling systems.

Unlike general-purpose NVIDIA GPUs, Microsoft's custom Maya 200 chip focuses specifically on running existing AI models (inference). Microsoft claims this makes it cheaper for certain tasks, like its own Copilot tools, creating a cost-saving value proposition for potential customers like Anthropic.

Instead of focusing on on-chip memory bandwidth, Etched optimized for cluster-scale memory. They built a custom interconnect that cuts chip-to-chip latency by over 5x compared to GPUs. This allows the memory of the entire cluster to function as a single, low-latency pool, dramatically improving performance.

Chipmaker Etched Uses Low-Voltage Inference to Sidestep Thermal GPU Bottlenecks | RiffOn