We scan new podcasts and send you the top 5 insights daily.
Unlike traditional ML where models are repeatedly trained on a fixed dataset, each frontier LLM pre-training run uses more compute than ever before. This makes it a one-shot endeavor where success hinges on accurately predicting final performance from smaller-scale experiments using scaling laws.
A 10x increase in compute may only yield a one-tier improvement in model performance. This appears inefficient but can be the difference between a useless "6-year-old" intelligence and a highly valuable "16-year-old" intelligence, unlocking entirely new economic applications.
The "Bitter Lesson" is not just about using more compute, but leveraging it scalably. Current LLMs are inefficient because they only learn during a discrete training phase, not during deployment where most computation occurs. This reliance on a special, data-intensive training period is not a scalable use of computational resources.
The enormous compute budget for the original AlphaGo was not about finding the most efficient training method, but about proving a method could work at all. Once a breakthrough is made and the path is clear, subsequent efforts can focus on optimization and achieve similar results with far less compute.
AI model capabilities follow a predictable, non-linear scaling law: increasing training compute by 10x roughly doubles a model's capabilities. This exponential relationship, rather than an incremental one, is what will drive underappreciated and disruptive advancements across many industries.
The relationship between computing power and AI model capability is not linear. According to established 'scaling laws,' a tenfold increase in the compute used for training large language models (LLMs) results in roughly a doubling of the model's capabilities, highlighting the immense resources required for incremental progress.
Instead of waiting days for a training checkpoint to evaluate an LLM's performance, use Monte Carlo simulations on its initial reward trajectories. This allows you to predict the model's final performance within the first hour and terminate failing experiments, saving significant time and compute.
The Chinchilla scaling law optimizes pre-training compute alone. However, production models must also account for inference costs. By training smaller models on much more data (~100x the Chinchilla optimum), labs create models that are cheaper to run for users, effectively amortizing the higher training cost over the model's lifetime.
To minimize the total cost for a certain level of performance, the compute budgets for a model's lifecycle stages should be balanced. A powerful heuristic is to equalize the costs: the compute spent on pre-training should roughly equal the compute for RL/fine-tuning, and also equal the total compute for user inference.
The market often misinterprets AI progress as linear. However, a clear 'scaling law' dictates that a tenfold increase in the computing power used to train LLMs results in a twofold capability improvement. This exponential relationship means future advancements will be far more disruptive and surprising than incremental projections suggest.
The rapid, step-change improvements in LLMs are likely slowing down. This is because models have already been trained on most of the available internet, and the compute budget required for each incremental improvement is increasing exponentially to an unsustainable degree. A new architectural breakthrough, not just more data and compute, is needed for the next leap.