Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

Current benchmarks focus on whether code passes tests. The future of AI evaluation must assess qualitative, human-centric aspects like 'design taste,' code maintainability, and alignment with a team's specific coding style. These are hard to measure automatically and signal a shift toward more complex, human-in-the-loop or LLM-judged evaluation frameworks.

Related Insights

Once AI coding agents reach a high performance level, objective benchmarks become less important than a developer's subjective experience. Like a warrior choosing a sword, the best tool is often the one that has the right "feel," writes code in a preferred style, and integrates seamlessly into a human workflow.

AI excels where success is quantifiable (e.g., code generation). Its greatest challenge lies in subjective domains like mental health or education. Progress requires a messy, societal conversation to define 'success,' not just a developer-built technical leaderboard.

Using AI to code doesn't mean sacrificing craftsmanship. It shifts the craftsman's role from writing every line to being a director with a strong vision. The key is measuring the AI's output against that vision and ensuring each piece fits the larger puzzle correctly, not just functionally.

OpenAI's evals team is looking beyond current benchmarks that test self-contained, hour-long tasks. They are calling for new evaluations that measure performance on problems that would take top engineers weeks or months to solve, such as creating entire products end-to-end. This signals a major increase in the complexity and ambition expected from future AI benchmarks.

Developers fall into the "agentic trap" by building complex, fully-automated AI coding systems. These systems fail to create good products because they lack human taste and the iterative feedback loop where a creator's vision evolves through interaction with the software being built.

Simple, function-level evals are a "local optimization." Blitzy evaluates system changes by tasking them with completing large, real-world projects (e.g., modifying Apache Spark) and assessing the percentage of completion. This requires human "taste" to judge the gap between functional correctness and true user intent.

While AI labs tout performance on standardized tests like math olympiads, these metrics often don't correlate with real-world usefulness or qualitative user experience. Users may prefer a model like Anthropic's Claude for its conversational style, a factor not measured by benchmarks.

A one-size-fits-all evaluation method is inefficient. Use simple code for deterministic checks like word count. Leverage an LLM-as-a-judge for subjective qualities like tone. Reserve costly human evaluation for ambiguous cases flagged by the LLM or for validating new features.

The best AI models are trained on data that reflects deep, subjective qualities—not just simple criteria. This "taste" is a key differentiator, influencing everything from code generation to creative writing, and is shaped by the values of the frontier lab.

As AI generates more code, the core engineering task evolves from writing to reviewing. Developers will spend significantly more time evaluating AI-generated code for correctness, style, and reliability, fundamentally changing daily workflows and skill requirements.

The Next Frontier for Coding AI is Measuring Subjective 'Design Taste,' Not Just Functionality | RiffOn