/

Inferact: Building the Infrastructure That Runs Modern AI

The a16z Show · Jan 22, 2026

VLLM creators discuss the hidden complexities of AI inference, building a critical open-source standard, and their new company, Infraqt.

LLM Inference Broke the Predictable Computing Paradigm with Dynamic Workloads

Unlike traditional computing where inputs were standardized, LLMs handle requests of varying lengths and produce outputs of non-deterministic duration. This unpredictability creates massive scheduling and memory management challenges on GPUs that were not designed for such chaotic, real-time workloads.

Inferact: Building the Infrastructure That Runs Modern AI thumbnail

Inferact: Building the Infrastructure That Runs Modern AI

The a16z Show·a month ago

AI Inference Is Getting Harder Due to Scale, Diversity, and Agentic Workloads

Contrary to the idea that infrastructure problems get commoditized, AI inference is growing more complex. This is driven by three factors: (1) increasing model scale (multi-trillion parameters), (2) greater diversity in model architectures and hardware, and (3) the shift to agentic systems that require managing long-lived, unpredictable state.

Inferact: Building the Infrastructure That Runs Modern AI thumbnail

Inferact: Building the Infrastructure That Runs Modern AI

The a16z Show·a month ago

AI Infra Project VLLM Grew from a Side Project to Fix a Slow Demo

The critical open-source inference engine VLLM began in 2022, pre-ChatGPT, as a small side project. The goal was simply to optimize a slow demo for Meta's now-obscure OPT model, but the work uncovered deep, unsolved systems problems in autoregressive model inference that took years to tackle.

Inferact: Building the Infrastructure That Runs Modern AI thumbnail

Inferact: Building the Infrastructure That Runs Modern AI

The a16z Show·a month ago

LLM Serving Requires Continuous Batching for Non-Deterministic Requests

Traditional ML used "micro-batching" by normalizing inputs to the same size. LLMs break this model due to variable input/output lengths. The core innovation is continuous processing, handling one token at a time across all active requests, which creates complex scheduling and memory challenges solved by techniques like PagedAttention.

Inferact: Building the Infrastructure That Runs Modern AI thumbnail

Inferact: Building the Infrastructure That Runs Modern AI

The a16z Show·a month ago

VLLM's Open Source Success Stems from Aligning Incentives Across the AI Stack

VLLM thrives by creating a multi-sided ecosystem where stakeholders contribute for their own self-interest. Model providers contribute to ensure their models run well. Silicon providers (NVIDIA, AMD) contribute to support their hardware. This flywheel effect establishes the platform as a de facto standard, benefiting the entire ecosystem.

Inferact: Building the Infrastructure That Runs Modern AI thumbnail

Inferact: Building the Infrastructure That Runs Modern AI

The a16z Show·a month ago

Agentic AI Introduces Unpredictable State, Breaking KV Cache Management

Agentic workflows involving tool use or human-in-the-loop steps break the simple request-response model. The system no longer knows when a "conversation" is truly over, creating an unsolved cache invalidation problem. State (like the KV cache) might need to be preserved for seconds, minutes, or hours, disrupting memory management patterns.

Inferact: Building the Infrastructure That Runs Modern AI thumbnail

Inferact: Building the Infrastructure That Runs Modern AI

The a16z Show·a month ago

VLLM's Development Velocity Forces Companies to Abandon Internal Inference Engines

The collective innovation pace of the VLLM open-source community is so rapid that even well-resourced internal corporate teams cannot keep up. Companies find that maintaining an internal fork or proprietary engine is unsustainable, making adoption of the open standard the only viable long-term strategy to stay on the cutting edge.

Inferact: Building the Infrastructure That Runs Modern AI thumbnail

Inferact: Building the Infrastructure That Runs Modern AI

The a16z Show·a month ago

VLLM's CI/CD Pipeline Costs Over $100K a Month to Ensure Reliability at Scale

Maintaining production-grade open-source AI software is extremely expensive. VLLM's continuous integration (CI) bill exceeds $100k per month to ensure every commit is tested and reliable enough for deployment on potentially millions of GPUs. This highlights the significant, often-invisible financial overhead required to steward critical open-source infrastructure.

Inferact: Building the Infrastructure That Runs Modern AI thumbnail

Inferact: Building the Infrastructure That Runs Modern AI

The a16z Show·a month ago

RiffOn - Inferact: Building the Infrastructure That Runs Modern AI | The a16z Show