Static benchmarks are easily gamed. Dynamic environments like the game Diplomacy force models to negotiate, strategize, and even lie, offering a richer, more realistic evaluation of their capabilities beyond pure performance metrics like reasoning or coding.
The proliferation of AI leaderboards incentivizes companies to optimize models for specific benchmarks. This creates a risk of "acing the SATs" where models excel on tests but don't necessarily make progress on solving real-world problems. This focus on gaming metrics could diverge from creating genuine user value.
Public leaderboards like LM Arena are becoming unreliable proxies for model performance. Teams implicitly or explicitly "benchmark" by optimizing for specific test sets. The superior strategy is to focus on internal, proprietary evaluation metrics and use public benchmarks only as a final, confirmatory check, not as a primary development target.
When Good Star Labs streamed their AI Diplomacy game on Twitch, it attracted 50,000 viewers from the gaming community. Watching AIs make mistakes, betray allies, and strategize made the technology more relatable and less intimidating, helping to bridge the gap between AI experts and the general public.
Beyond supervised fine-tuning (SFT) and human feedback (RLHF), reinforcement learning (RL) in simulated environments is the next evolution. These "playgrounds" teach models to handle messy, multi-step, real-world tasks where current models often fail catastrophically.
Don't trust academic benchmarks. Labs often "hill climb" or game them for marketing purposes, which doesn't translate to real-world capability. Furthermore, many of these benchmarks contain incorrect answers and messy data, making them an unreliable measure of true AI advancement.
Good Star Labs is not a consumer gaming company. Its business model focuses on B2B services for AI labs. They use games like Diplomacy to evaluate new models, generate unique training data to fix model weaknesses, and collect human feedback, creating a powerful improvement loop for AI companies.
Good Star Labs' next game will be a subjective, 'Cards Against Humanity'-style experience. This is a strategic move away from objective games like Diplomacy to specifically target and create training data for a key LLM weakness: humor. The goal is to build an environment that improves a difficult, subjective skill.
As reinforcement learning (RL) techniques mature, the core challenge shifts from the algorithm to the problem definition. The competitive moat for AI companies will be their ability to create high-fidelity environments and benchmarks that accurately represent complex, real-world tasks, effectively teaching the AI what matters.
Good Star Labs found GPT-5's performance in their Diplomacy game skyrocketed with optimized prompts, moving it from the bottom to the top. This shows a model's inherent capability can be masked or revealed by its prompt, making "best model" a context-dependent title rather than an absolute one.
To build robust social intelligence, AIs cannot be trained solely on positive examples of cooperation. Like pre-training an LLM on all of language, social AIs must be trained on the full manifold of game-theoretic situations—cooperation, competition, team formation, betrayal. This builds a foundational, generalizable model of social theory of mind.