The performance gap between Chinese and American frontier AI models is not due to a lack of talent or different training techniques. Instead, it is primarily constrained by access to massive-scale compute and the capital required to procure it.
The AI development cycle of experimentation and bottleneck-solving is already a form of recursive self-improvement. Kyle Corbitt argues this loop is currently constrained by human intelligence. Once AIs become better at directing this process, progress will accelerate rapidly.
Instead of just copying outputs for supervised fine-tuning, Chinese labs use frontier US models as automated evaluators in their reinforcement learning loops. This allows their own models to develop capabilities within their native distributions and potentially surpass the teacher model.
Companies building RL environments are lucrative but likely poor long-term venture investments. Their core assets—the environments—quickly become "saturated" and depreciate as models master them, requiring constant creation of new, non-durable assets to maintain value.
The GRPO algorithm wasn't a huge theoretical leap over its predecessors. It became famous because DeepSeek did the significant engineering work to scale it and, crucially, released a high-performing model that proved the method's practical viability.
In narrow-domain RL, reward hacking is less of a threat than commonly feared. Models exploit reward loopholes so aggressively that the unwanted behavior becomes obvious and easy to patch. Its flagrant nature makes it visible and correctable through iterative rubric adjustments.
RL fine-tuning is less likely to cause catastrophic forgetting than SFT because it works within the model's existing pre-trained pathways, or "grooves." SFT, by contrast, makes much larger weight updates that can aggressively overwrite and destroy latent knowledge.
Instead of pinpointing which specific action led to a good outcome, the GRPO algorithm solves the credit assignment problem with a simple heuristic: it assumes rare tokens in a high-scoring output were responsible and upweights all of them. This "unsatisfying" but practical approach works surprisingly well.
The most compelling business reason for enterprises to adopt custom fine-tuning is the need for low latency. For real-time applications like voice bots, large frontier models are too slow. This practical constraint forces companies to use smaller, specialized open-source models.
Reinforcement learning achieves superhuman results not by inventing alien concepts, but by surfacing and combining rare behaviors that are already possible within a model's vast pre-trained distribution. The goal of pre-training is to make this search for novel solutions more efficient and less random.
Reinforcement learning with low-rank adapters (LoRa) is efficient enough that you can "stuff" multiple, even unrelated, tasks into a single model without them interfering. A small LoRa adapter provides sufficient capacity for several tasks without saturating, avoiding performance degradation from cross-training.
Frontier labs deliberately source reinforcement learning environments from many small vendors rather than one large one. This strategy provides a broader diversity of tasks and underlying assumptions, which helps prevent models from learning non-generalizable hacks from a single, homogenous source.
