To manage the overwhelming pace of AI advancements, the Minimax team built an internal AI agent. This tool automatically tracks new articles, papers, and blogs, then dispatches, summarizes, and analyzes them. This "internal researcher" filters the information firehose for the human team.
Instead of a single "think then act" cycle, Minimax trains its M2 model to repeatedly pause and rethink after receiving feedback from the environment. This iterative "interleaved thinking" approach improves robustness and performance on long-horizon tasks where tool responses or conditions are unpredictable.
Minimax discovered that robust AI agent generalization comes from systematically varying the model's entire operational environment—including system prompts, chat templates, and tool responses—not just by increasing the number of tools it's trained on. They use a dedicated perturbation pipeline to ensure this variance.
A Minimax researcher explains that unlike academia, work at the industry's frontier involves problems so new that no literature exists. The job shifts from applying existing papers to deep, fundamental, first-principles thinking to find novel solutions for entirely unsolved challenges.
Minimax enhances its reinforcement learning process by treating its own expert developers as scalable reward models. These developers participate directly in the training cycle, identifying desirable behaviors and providing precise feedback on complex coding tasks, which creates a model tailored to professional workflows.
Minimax builds both foundation models and user-facing applications in-house. This structure enables research and engineering teams to work side-by-side, getting direct feedback from internal developers to rapidly identify and address model weaknesses, ensuring models meet real-world needs.
A researcher from Minimax describes the volatile nature of training large models, where a single day can swing dramatically between highs and lows. They joke about having "ICU in the morning and then KTV at night," reflecting how promising results can suddenly turn into critical bugs, and vice versa.
While debugging stalled model accuracy, Minimax's team found that running the LM head in FP32 precision during reinforcement learning was critical. Lower precision created a gap between the theoretical algorithm and practical implementation, preventing the model from improving and highlighting the importance of low-level engineering details.
