/

Reiner Pope – Chip design from the bottom up

Dwarkesh Podcast · May 22, 2026

How do AI chips work? This deep dive explains chip design from logic gates to systolic arrays, revealing the constant trade-off between computation and data movement.

AI Chips' Core Operation is Multiply-Accumulate, Directly Mirroring Matrix Math

The fundamental primitive for AI chips isn't arbitrary; it's the multiply-accumulate (MAC) operation. This is because it directly maps to the innermost computational loop of matrix multiplication (output += input1 * input2), which is the foundational computation for most neural networks.

Reiner Pope – Chip design from the bottom up thumbnail

Reiner Pope – Chip design from the bottom up

Dwarkesh Podcast·2 months ago

Multiplier Area on a Chip Scales Quadratically with Bit-Width, Explaining Low-Precision AI Gains

The physical area a multiplier circuit requires on a chip grows quadratically with the number of bits (e.g., p*q). This non-linear scaling is the fundamental reason why lower-precision formats like FP4 and FP8 offer disproportionately large performance and efficiency gains for AI workloads compared to a linear improvement.

Reiner Pope – Chip design from the bottom up thumbnail

Reiner Pope – Chip design from the bottom up

Dwarkesh Podcast·2 months ago

On-Chip Data Movement From Registers Can Cost More Area Than The Actual Computation

The multiplexer (MUX) circuits required to select and move data from a register file to a logic unit can consume significantly more silicon area than the logic unit performing the actual calculation. This illustrates that data movement is a dominant cost, even at the micro-architectural level.

Reiner Pope – Chip design from the bottom up thumbnail

Reiner Pope – Chip design from the bottom up

Dwarkesh Podcast·2 months ago

Systolic Arrays Amortize Data Movement Costs by Storing Weight Matrices Locally

Systolic arrays (like NVIDIA's Tensor Cores) overcome the high cost of data movement by storing the large, reusable weight matrix directly within the compute fabric. This avoids repeatedly fetching weights from a distant register file, dramatically improving the ratio of computation to communication.

Reiner Pope – Chip design from the bottom up thumbnail

Reiner Pope – Chip design from the bottom up

Dwarkesh Podcast·2 months ago

AI Chips Prioritize Low-Bandwidth Weight Loading to Save Die Area

Since the weight matrix in a systolic array is reused many times, it doesn't need to be loaded quickly. Chip designers can use slow, low-bandwidth connections to "trickle feed" the weights, minimizing the required wiring and thus saving precious die area. This prioritizes area efficiency over initial load latency.

Reiner Pope – Chip design from the bottom up thumbnail

Reiner Pope – Chip design from the bottom up

Dwarkesh Podcast·2 months ago

Feedback Loops, Not Logic Depth, Ultimately Limit a Chip's Maximum Clock Speed

While you can insert registers (pipelining) to shorten simple logic paths and increase clock speed, you cannot easily do this with a feedback loop (e.g., an accumulator). The time it takes for a signal to traverse this recurring loop becomes the fundamental constraint that dictates the entire chip's maximum clock frequency.

Reiner Pope – Chip design from the bottom up thumbnail

Reiner Pope – Chip design from the bottom up

Dwarkesh Podcast·2 months ago

Aggressive Pipelining for Faster Clocks Sacrifices Silicon Area for Actual Logic

While adding pipeline registers can increase a chip's clock speed, the registers themselves consume significant silicon area. Over-pipelining can lead to a chip where most of the area is dedicated to registers, not useful logic, resulting in lower overall throughput despite the high clock frequency.

Reiner Pope – Chip design from the bottom up thumbnail

Reiner Pope – Chip design from the bottom up

Dwarkesh Podcast·2 months ago

FPGAs are Inefficient Because They Emulate Simple Gates with Large Lookup Tables

An FPGA's inefficiency stems from its programmable nature. A simple 3-gate 'AND' circuit in a custom ASIC is implemented on an FPGA using a generic lookup table (LUT). This LUT, which is essentially a multiplexer, might require over 30 gates to build, creating a ~10x overhead in area and power.

Reiner Pope – Chip design from the bottom up thumbnail

Reiner Pope – Chip design from the bottom up

Dwarkesh Podcast·2 months ago

AI Accelerators Use Software-Managed Scratchpads for Deterministic Latency

Unlike CPUs that use hardware-managed caches leading to unpredictable latency, AI accelerators like TPUs often use software-managed scratchpads. This gives the programmer explicit control over data placement, ensuring deterministic memory access times critical for synchronizing large parallel computations.

Reiner Pope – Chip design from the bottom up thumbnail

Reiner Pope – Chip design from the bottom up

Dwarkesh Podcast·2 months ago

A GPU is Architecturally Like a Grid of Many Small TPUs

At a high level, a GPU's architecture consists of many replicated, smaller compute units (SMs), each with its own logic and memory. A TPU has a more centralized, coarse-grained design with a few very large, specialized units. One can think of a GPU as a collection of many tiny TPUs tiled across a chip.

Reiner Pope – Chip design from the bottom up thumbnail

Reiner Pope – Chip design from the bottom up

Dwarkesh Podcast·2 months ago