We scan new podcasts and send you the top 5 insights daily.
Instead of just copying outputs for supervised fine-tuning, Chinese labs use frontier US models as automated evaluators in their reinforcement learning loops. This allows their own models to develop capabilities within their native distributions and potentially surpass the teacher model.
When a company distills knowledge from a competitor's AI, it's not just scraping pre-training data. It's a highly efficient process of extracting the model's intelligence, reasoning patterns, and skills. This is more akin to an apprentice directly interacting with and learning from a world-class expert than simply reading the same textbooks the expert used.
Chinese AI models appear close to the frontier primarily because they are trained on the outputs of leading U.S. models. This creates a dependency loop: they can only catch up by using the latest from the West, ensuring they remain followers rather than innovators who can achieve a true breakthrough.
Reinforcement learning achieves superhuman results not by inventing alien concepts, but by surfacing and combining rare behaviors that are already possible within a model's vast pre-trained distribution. The goal of pre-training is to make this search for novel solutions more efficient and less random.
RL fine-tuning is less likely to cause catastrophic forgetting than SFT because it works within the model's existing pre-trained pathways, or "grooves." SFT, by contrast, makes much larger weight updates that can aggressively overwrite and destroy latent knowledge.
The argument that LLMs are just "stochastic parrots" is outdated. Current frontier models are trained via Reinforcement Learning, where the signal is not "did you predict the right token?" but "did you get the right answer?" This is based on complex, often qualitative criteria, pushing models beyond simple statistical correlation.
China is gaining an efficiency edge in AI by using "distillation"—training smaller, cheaper models from larger ones. This "train the trainer" approach is much faster and challenges the capital-intensive US strategy, highlighting how inefficient and "bloated" current Western foundational models are.
AI labs like Anthropic find that mid-tier models can be trained with reinforcement learning to outperform their largest, most expensive models in just a few months, accelerating the pace of capability improvements.
Basic supervised fine-tuning (SFT) only adjusts a model's style. The real unlock for enterprises is reinforcement fine-tuning (RFT), which leverages proprietary datasets to create state-of-the-art models for specific, high-value tasks, moving beyond mere 'tone improvements.'
Leading Chinese AI models like Kimi appear to be primarily trained on the outputs of US models (a process called distillation) rather than being built from scratch. This suggests China's progress is constrained by its ability to scrape and fine-tune American APIs, indicating the U.S. still holds a significant architectural and innovation advantage in foundational AI.
Companies building infrastructure to A/B test models or evaluate prompts have already built most of what's needed for reinforcement learning. The core mechanism of measuring performance against a goal is the same. The next logical step is to use that performance signal to update the model's weights.