/
© 2026 RiffOn. All rights reserved.

Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

  1. Complex Systems with Patrick McKenzie (patio11)
  2. Inference engineering and the real-world deployment of LLMs, with Philip Kiely
Inference engineering and the real-world deployment of LLMs, with Philip Kiely

Inference engineering and the real-world deployment of LLMs, with Philip Kiely

Complex Systems with Patrick McKenzie (patio11) · Mar 12, 2026

Explore inference engineering, the engine driving real-world LLM deployment. Learn about agentic systems, open models, and future AI adoption.

Enterprise AI Will Become as Ubiquitous as 'If' Statements, Not a Separate Department

Large companies will adopt LLMs not as siloed products but as fundamental primitives integrated into every process, much like 'if' statements and 'for' loops are integral to all software. If a business process lacks AI integration by 2026, it will be considered a catastrophic failure.

Inference engineering and the real-world deployment of LLMs, with Philip Kiely thumbnail

Inference engineering and the real-world deployment of LLMs, with Philip Kiely

Complex Systems with Patrick McKenzie (patio11)·4 days ago

Sophisticated AI Systems Will Use Cheap Models as Intelligent Routers

Advanced AI architectures will use small, fast, and cheap local models to act as intelligent routers. These models will first analyze a complex request, formulate a plan, and then delegate different sub-tasks to a fleet of more powerful or specialized models, optimizing for cost and performance.

Inference engineering and the real-world deployment of LLMs, with Philip Kiely thumbnail

Inference engineering and the real-world deployment of LLMs, with Philip Kiely

Complex Systems with Patrick McKenzie (patio11)·4 days ago

Markdown Has Become the Accidental Lingua Franca for LLM Tooling

Markdown, originally designed for blogging, has emerged as the de facto standard for interaction between LLMs and tools. This happened not by design, but because it's human-readable, highly token-efficient compared to alternatives like HTML, and familiar to the early adopters who trained the models.

Inference engineering and the real-world deployment of LLMs, with Philip Kiely thumbnail

Inference engineering and the real-world deployment of LLMs, with Philip Kiely

Complex Systems with Patrick McKenzie (patio11)·4 days ago

LLMs Don't Act; They Ask a Software 'Harness' to Act for Them

A common misconception is that LLMs can directly perform actions. In reality, a model can only output text. This text is a request to an external software system, called a 'harness,' which then interprets the request and executes the action (e.g., calling an API) on the model's behalf.

Inference engineering and the real-world deployment of LLMs, with Philip Kiely thumbnail

Inference engineering and the real-world deployment of LLMs, with Philip Kiely

Complex Systems with Patrick McKenzie (patio11)·4 days ago

AI Enables Economically Rational 'Disposable Code' for Single-Use Tasks

LLMs make it feasible to generate complex software intended to be executed only once. This 'disposable code' automates tasks previously too niche or time-consuming to justify manual software development, such as writing a custom script to alphabetize a book's appendix for a single use.

Inference engineering and the real-world deployment of LLMs, with Philip Kiely thumbnail

Inference engineering and the real-world deployment of LLMs, with Philip Kiely

Complex Systems with Patrick McKenzie (patio11)·4 days ago

The AI Future Is Heterogeneous, Letting Companies 'Own Their Intelligence'

Contrary to fears of a monopoly, the AI market is heading toward a diverse ecosystem. The proliferation of open-weight models and specialized tooling allows companies to build and control their own differentiated AI systems rather than simply renting intelligence token-by-token from a handful of large labs.

Inference engineering and the real-world deployment of LLMs, with Philip Kiely thumbnail

Inference engineering and the real-world deployment of LLMs, with Philip Kiely

Complex Systems with Patrick McKenzie (patio11)·4 days ago

Software Engineering Is Shifting From Writing Code to Directing AI Tools

The craft of software engineering is evolving away from precise code editing. Much like compilers abstracted away assembly language, modern AI coding tools are a new abstraction layer, turning engineers into directors who guide AI to write and edit code on their behalf.

Inference engineering and the real-world deployment of LLMs, with Philip Kiely thumbnail

Inference engineering and the real-world deployment of LLMs, with Philip Kiely

Complex Systems with Patrick McKenzie (patio11)·4 days ago

Agentic AI Will Cause an Explosion in Inference Demand

The shift from simple chatbots (one user request, one API call) to agentic AI systems will decouple inference requests from direct user actions. A single user request could trigger hundreds or thousands of automated model calls, leading to an exponential increase in compute demand and cost.

Inference engineering and the real-world deployment of LLMs, with Philip Kiely thumbnail

Inference engineering and the real-world deployment of LLMs, with Philip Kiely

Complex Systems with Patrick McKenzie (patio11)·4 days ago

AI Observability Is Paradoxically Worsening Due to Advanced Optimizations

While 'chain of thought' provides some transparency, advanced inference techniques like speculative decoding are making AI systems less observable. These methods operate on abstract 'hidden states' rather than human-readable text, creating a new challenge for monitoring and debugging that requires specialized tooling.

Inference engineering and the real-world deployment of LLMs, with Philip Kiely thumbnail

Inference engineering and the real-world deployment of LLMs, with Philip Kiely

Complex Systems with Patrick McKenzie (patio11)·4 days ago

Benchmark Saturation Signals a Shift From Seeking Intelligence to Cutting Costs

When multiple models can solve a task reliably ('benchmark saturation'), the strategic goal is no longer to find the most intelligent model. Instead, it becomes an optimization problem: select the smallest, cheapest, and fastest model that still meets the performance bar, creating a major competitive advantage in inference.

Inference engineering and the real-world deployment of LLMs, with Philip Kiely thumbnail

Inference engineering and the real-world deployment of LLMs, with Philip Kiely

Complex Systems with Patrick McKenzie (patio11)·4 days ago

Author Self-Published 'Inference Engineering' to Beat Obsolete Publishing Timelines

The author self-published his technical book on AI inference because traditional publishers' 12-18 month timelines are unacceptably slow for such a fast-moving field. The decision was a strategic trade-off, prioritizing the low-latency delivery of timely knowledge over the high-throughput processes of established publishing houses.

Inference engineering and the real-world deployment of LLMs, with Philip Kiely thumbnail

Inference engineering and the real-world deployment of LLMs, with Philip Kiely

Complex Systems with Patrick McKenzie (patio11)·4 days ago

Early Enterprise AI Adoption Mirrors the Web's 'HTML Forms' Era

Current AI adoption in large companies focuses on porting existing business processes into an AI substrate, similar to how early websites were just digital versions of paper forms. The true disruption will come from AI-native firms that build entirely new business models, like DoorDash is to an online order form.

Inference engineering and the real-world deployment of LLMs, with Philip Kiely thumbnail

Inference engineering and the real-world deployment of LLMs, with Philip Kiely

Complex Systems with Patrick McKenzie (patio11)·4 days ago