/

© 2026 RiffOn. All rights reserved.

Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

How I AI
Sonnet 5 review: I ran 64 generations to find out if it's worth it

Sonnet 5 review: I ran 64 generations to find out if it's worth it

How I AI · Jun 30, 2026

A deep dive into Anthropic's Sonnet 5 using a custom benchmark reveals surprising results, where human taste clashes with AI-judged metrics.

Human "Vibe Checks" Routinely Contradict Automated LLM Benchmark Scores

The host's personal "vibe check" rankings of AI models were the inverse of the scores from an automated, LLM-judged benchmark. This highlights the gap between quantitative metrics and subjective human taste, suggesting that relying solely on AI judges misses crucial aspects of quality and real-world usability.

Sonnet 5 review: I ran 64 generations to find out if it's worth it thumbnail

Sonnet 5 review: I ran 64 generations to find out if it's worth it

How I AI·2 days ago

Use Coding Assistants Like Claude Code to Build Custom AI Model Benchmarks

Instead of relying on generic public benchmarks, the host used Claude Code to create a personalized evaluation suite tailored to his specific workflows. This meta-use of AI—building tools to test other AIs—allows for more relevant and repeatable model comparisons that reflect real-world use cases.

Sonnet 5 review: I ran 64 generations to find out if it's worth it thumbnail

Sonnet 5 review: I ran 64 generations to find out if it's worth it

How I AI·2 days ago

LLMs Used as Evaluators Tend to Be Overly Generous and Lack Nuanced Taste

When using LLMs to judge other models' output, they consistently rate towards the middle of the curve, akin to humans giving a generic "7 out of 10." These AI judges are not "spiky" enough, failing to recognize unique or exceptional qualities that a human evaluator with strong taste would identify.

Sonnet 5 review: I ran 64 generations to find out if it's worth it thumbnail

Sonnet 5 review: I ran 64 generations to find out if it's worth it

How I AI·2 days ago

Standard Agentic Coding Tasks No Longer Differentiate Top AI Models

An "agentic bug tracking task" included in the benchmark proved to be a poor differentiator because all top frontier models performed well. This suggests that as models improve, standard coding challenges become table stakes, requiring more complex or novel benchmarks to reveal meaningful performance differences.

Sonnet 5 review: I ran 64 generations to find out if it's worth it thumbnail

Sonnet 5 review: I ran 64 generations to find out if it's worth it

How I AI·2 days ago

Leverage Past AI Chat Sessions as Persistent Context for Future Work

The host demonstrated a power-user technique by instructing Claude Code to analyze his entire history of past sessions. This allows the AI to learn his work style and preferences, providing more tailored and context-aware recommendations for new projects. This treats the conversation history as a persistent knowledge base.

Sonnet 5 review: I ran 64 generations to find out if it's worth it thumbnail

Sonnet 5 review: I ran 64 generations to find out if it's worth it

How I AI·2 days ago

Anthropic's New Sonnet 5 Ranked Last in a Human-Weighted Evaluation

Despite being the focus of the review and positioned as a near-Opus level model, Sonnet 5 performed poorly in the host's final, human-weighted evaluation. The episode, intended to showcase the new model, ironically concluded with it at the bottom of the personal preference leaderboard, behind older models.

Sonnet 5 review: I ran 64 generations to find out if it's worth it thumbnail

Sonnet 5 review: I ran 64 generations to find out if it's worth it

How I AI·2 days ago

RiffOn - Sonnet 5 review: I ran 64 generations to find out if it's worth it | How I AI