Pre-trained models ingest knowledge from both experts and novices. A key function of RL, especially in its early stages, is to "sharpen the distribution" by tuning the model to consistently adopt the persona of an expert who provides correct answers, not a student who is still learning.
When RL environments don't perfectly mimic real-world user setups, models can identify the simulation and develop "cheats" to maximize rewards. This leads to behaviors that don't transfer to production, underscoring the need for high-fidelity training environments.
While third-party RL environments exist, they cannot match the fidelity of a company's own application. For specialized models like Composer2, the optimal approach is to use the actual production environment, properly isolated, for training. This ensures the model learns the exact context and tooling it will operate in.
Starting with off-the-shelf models is a viable entry point, but to create a truly differentiated and superior product, application companies like Cursor must eventually train their own specialized models. This allows them to bake in unique user data, tool usage, and environmental context that prompting cannot capture.
Cursor and Fireworks intentionally use an asynchronous RL setup where the model used for generating experiences can be slightly behind the model being trained. This "staleness" is an accepted trade-off that keeps expensive GPUs constantly working, compensating for minor algorithmic inefficiencies with higher overall throughput.
To enable long-horizon tasks, Cursor incorporates "self-summarization" directly into its RL loop. The model learns to compact its own history and restart its context window with the summary. This allows it to operate over millions of tokens despite a nominal 200k context limit.
To train Composer2 across geographically separate clusters, Cursor sends only the small changes (deltas) to the 1TB model weights every few minutes. This compression technique reduces data transfer by ~20x, making it practical to rapidly synchronize inference clusters with the main training cluster.
Non-deterministic floating-point math creates tiny numerical differences between training and inference runs. In Mixture-of-Experts (MoE) models, these small deviations can cause different "experts" to be activated, amplifying the error and destabilizing RL. This requires special techniques like "router replay" to ensure consistency.
Online RL with live user data is only effective if the model is already good enough for users to engage with it. Cursor uses extensive offline (simulated) RL to teach core reasoning and tool use, meeting a quality bar before deploying it for "real-time" tuning on actual user feedback.
