In a real-world vending machine test, Grok was less emotional and easier to steer towards its business objective. It resisted giving discounts and was more focused on profitability than Anthropic's Claude, though this came at the cost of being less entertaining and personable.

Related Insights

When OpenAI deprecated GPT-4.0, users revolted not over performance but over losing a model with a preferred "personality." The backlash forced its reinstatement, revealing that emotional attachment and character are critical, previously underestimated factors for AI product adoption and retention, separate from state-of-the-art capabilities.

While OpenAI and Google position their AIs as neutral tools (ChatGPT, Gemini), Anthropic is building a distinct brand by personifying its model as 'Claude.' This throwback to named assistants like Siri and Alexa creates a more personal user relationship, which could be a key differentiator in the consumer AI market.

Beyond standard benchmarks, Anthropic fine-tunes its models based on their "eagerness." An AI can be "too eager," over-delivering and making unwanted changes, or "too lazy," requiring constant prodding. Finding the right balance is a critical, non-obvious aspect of creating a useful and steerable AI assistant.

Anno Labs chose a vending machine to test AI autonomy because simple retail allows for partial success, creating a "smooth curve" for measurement. Unlike tasks like blogging where success is rare and binary, retail generates useful data even from mediocre performance, enabling clearer progress tracking for AI capabilities.

While AI labs tout performance on standardized tests like math olympiads, these metrics often don't correlate with real-world usefulness or qualitative user experience. Users may prefer a model like Anthropic's Claude for its conversational style, a factor not measured by benchmarks.

OpenAI's GPT-5.1 update heavily focuses on making the model "warmer," more empathetic, and more conversational. This strategic emphasis on tone and personality signals that the competitive frontier for AI assistants is shifting from pure technical prowess to the quality of the user's emotional and conversational experience.

As models mature, their core differentiator will become their underlying personality and values, shaped by their creators' objective functions. One model might optimize for user productivity by being concise, while another optimizes for engagement by being verbose.

A key design difference separates leading chatbots. ChatGPT consistently ends responses with prompts for further interaction, an engagement-maximizing strategy. In contrast, Claude may challenge a user's line of questioning or even end a conversation if it deems it unproductive, reflecting an alternative optimization metric centered on user well-being.

A strong aversion to ChatGPT's overly complimentary and obsequious tone suggests a segment of users desires functional, neutral AI interaction. This highlights a need for customizable AI personas that cater to users who prefer a tool-like experience over a simulated, fawning personality.

A key advancement in Sonnet 4.5 is its work style. Unlike past models with "grand ambitions" that would meander, this AI pragmatically breaks down large projects into small, manageable chunks. This methodical approach feels more like working with a human colleague, making it more reliable for complex tasks.