Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

Creating AI that can reliably judge aesthetics is a frontier problem. Unlike tasks with clear right or wrong answers, aesthetics is subjective. This lack of a clear, objective benchmark makes it difficult to apply standard model improvement techniques, making it a better fit for Reinforcement Learning from Human Feedback (RLHF).

Related Insights

AI excels where success is quantifiable (e.g., code generation). Its greatest challenge lies in subjective domains like mental health or education. Progress requires a messy, societal conversation to define 'success,' not just a developer-built technical leaderboard.

An attempt to teach AI 'scientific taste' using RLHF on hypotheses failed because human raters prioritized superficial qualities like tone and feasibility over a hypothesis's potential world-changing impact. This suggests a need for feedback tied to downstream outcomes, not just human preference.

Once models reach human-level performance via supervised learning, they hit a ceiling. The next step to achieve superhuman capabilities is moving to a Reinforcement Learning from Human Feedback (RLHF) paradigm, where humans provide preference rankings ("this is better") rather than creating ground-truth labels from scratch.

To teach AI subjective skills like poetry, a group of experts with some disagreement is better than one with full consensus. This approach captures diverse tastes and edge cases, which is more valuable for creating a robust model than achieving perfect agreement.

While AI labs tout performance on standardized tests like math olympiads, these metrics often don't correlate with real-world usefulness or qualitative user experience. Users may prefer a model like Anthropic's Claude for its conversational style, a factor not measured by benchmarks.

The best AI models are trained on data that reflects deep, subjective qualities—not just simple criteria. This "taste" is a key differentiator, influencing everything from code generation to creative writing, and is shaped by the values of the frontier lab.

Despite AI's ability to generate functional code, replicating the nuanced, subjective quality of a specific designer's "taste" remains extremely difficult. Felix Lee, after spending weeks attempting to codify his own taste into an AI model with little success, notes it's a significant unsolved challenge.

Current benchmarks focus on whether code passes tests. The future of AI evaluation must assess qualitative, human-centric aspects like 'design taste,' code maintainability, and alignment with a team's specific coding style. These are hard to measure automatically and signal a shift toward more complex, human-in-the-loop or LLM-judged evaluation frameworks.

For tasks where a simple right/wrong answer doesn't exist, verification is a major challenge. The solution is creating detailed rubrics with thousands of criteria, often developed with AI help. This provides a granular reward signal that allows models to climb the learning curve even in highly subjective domains.

For subjective outputs like image aesthetics and face consistency, quantitative metrics are misleading. Google's team relies heavily on disciplined human evaluations, internal 'eyeballing,' and community testing to capture the subtle, emotional impact that benchmarks can't quantify.