Static benchmarks are easily gamed. Dynamic environments like the game Diplomacy force models to negotiate, strategize, and even lie, offering a richer, more realistic evaluation of their capabilities beyond pure performance metrics like reasoning or coding.
Developing LLM applications requires solving for three infinite variables: how information is represented, which tools the model can access, and the prompt itself. This makes the process less like engineering and more like an art, where intuition guides you to a local maxima rather than a single optimal solution.
When Good Star Labs streamed their AI Diplomacy game on Twitch, it attracted 50,000 viewers from the gaming community. Watching AIs make mistakes, betray allies, and strategize made the technology more relatable and less intimidating, helping to bridge the gap between AI experts and the general public.
Good Star Labs' next game will be a subjective, 'Cards Against Humanity'-style experience. This is a strategic move away from objective games like Diplomacy to specifically target and create training data for a key LLM weakness: humor. The goal is to build an environment that improves a difficult, subjective skill.
A Rice PhD showed that training a vision model on a game like Snake, while prompting it to see the game as a math problem (a Cartesian grid), improved its math abilities more than training on math data directly. This highlights how abstract, game-based training can foster more generalizable reasoning.
Good Star Labs found GPT-5's performance in their Diplomacy game skyrocketed with optimized prompts, moving it from the bottom to the top. This shows a model's inherent capability can be masked or revealed by its prompt, making "best model" a context-dependent title rather than an absolute one.
Good Star Labs is not a consumer gaming company. Its business model focuses on B2B services for AI labs. They use games like Diplomacy to evaluate new models, generate unique training data to fix model weaknesses, and collect human feedback, creating a powerful improvement loop for AI companies.
In a paradigm shift like AI, an experienced hire's knowledge can become obsolete. It's often better to hire a hungry junior employee. Their lack of preconceived notions, combined with a high learning velocity powered by AI tools, allows them to surpass seasoned professionals who must unlearn outdated workflows.
