Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

Cursor and Fireworks intentionally use an asynchronous RL setup where the model used for generating experiences can be slightly behind the model being trained. This "staleness" is an accepted trade-off that keeps expensive GPUs constantly working, compensating for minor algorithmic inefficiencies with higher overall throughput.

Related Insights

Algorithms like GRPO are powerful but require parallel rollouts in a reproducible environment. Building and maintaining these high-fidelity sandboxes, complete with realistic data and failure modes, is the hardest part of implementing RL today and a significant barrier for most companies.

Online RL with live user data is only effective if the model is already good enough for users to engage with it. Cursor uses extensive offline (simulated) RL to teach core reasoning and tool use, meeting a quality bar before deploying it for "real-time" tuning on actual user feedback.

Non-deterministic floating-point math creates tiny numerical differences between training and inference runs. In Mixture-of-Experts (MoE) models, these small deviations can cause different "experts" to be activated, amplifying the error and destabilizing RL. This requires special techniques like "router replay" to ensure consistency.

Pre-training on internet text data is hitting a wall. The next major advancements will come from reinforcement learning (RL), where models learn by interacting with simulated environments (like games or fake e-commerce sites). This post-training phase is in its infancy but will soon consume the majority of compute.

While RL is compute-intensive for the amount of signal it extracts, this is its core economic advantage. It allows labs to trade cheap, abundant compute for expensive, scarce human expertise. RL effectively amplifies the value of small, high-quality human-generated datasets, which is crucial when expertise is the bottleneck.

AI labs like Anthropic find that mid-tier models can be trained with reinforcement learning to outperform their largest, most expensive models in just a few months, accelerating the pace of capability improvements.

Moonshot overcame the tendency of LLMs to default to sequential reasoning—a problem they call "serial collapse"—by using Parallel Agent Reinforcement Learning (PARL). They forced an orchestrator model to learn parallelization by giving it time and compute budgets that were impossible to meet sequentially, compelling it to delegate tasks.

Unlike pre-training's simpler data pipeline, RL involves many "moving parts" because each task can have a unique grading setup and infrastructure. This complexity, not just the algorithm itself, is the primary challenge for researchers managing live training runs at scale.

To enable long-horizon tasks, Cursor incorporates "self-summarization" directly into its RL loop. The model learns to compact its own history and restart its context window with the summary. This allows it to operate over millions of tokens despite a nominal 200k context limit.

Pre-training requires constant, high-bandwidth weight synchronization, making it difficult across data centers. Newer Reinforcement Learning (RL) methods mostly do local forward passes to generate data, only sending back small amounts of verified data, making distributed training more practical.

Asynchronous RL Sacrifices Algorithmic Purity for Massive GPU Utilization Gains | RiffOn