The host's personal "vibe check" rankings of AI models were the inverse of the scores from an automated, LLM-judged benchmark. This highlights the gap between quantitative metrics and subjective human taste, suggesting that relying solely on AI judges misses crucial aspects of quality and real-world usability.
Instead of relying on generic public benchmarks, the host used Claude Code to create a personalized evaluation suite tailored to his specific workflows. This meta-use of AI—building tools to test other AIs—allows for more relevant and repeatable model comparisons that reflect real-world use cases.
When using LLMs to judge other models' output, they consistently rate towards the middle of the curve, akin to humans giving a generic "7 out of 10." These AI judges are not "spiky" enough, failing to recognize unique or exceptional qualities that a human evaluator with strong taste would identify.
An "agentic bug tracking task" included in the benchmark proved to be a poor differentiator because all top frontier models performed well. This suggests that as models improve, standard coding challenges become table stakes, requiring more complex or novel benchmarks to reveal meaningful performance differences.
The host demonstrated a power-user technique by instructing Claude Code to analyze his entire history of past sessions. This allows the AI to learn his work style and preferences, providing more tailored and context-aware recommendations for new projects. This treats the conversation history as a persistent knowledge base.
Despite being the focus of the review and positioned as a near-Opus level model, Sonnet 5 performed poorly in the host's final, human-weighted evaluation. The episode, intended to showcase the new model, ironically concluded with it at the bottom of the personal preference leaderboard, behind older models.
