Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

Rather than optimizing solely for performance on standard industry benchmarks, Ideogram focuses on embedding a subjective quality of "taste" into its models. This requires using human designers for evaluation, as they believe current AI is poor at judging aesthetic nuances, giving them a unique creative edge.

Related Insights

When every company has access to the same powerful AI tools, the competitive advantage is no longer budget or technology. The real differentiator becomes human taste, judgment, and the ability to apply a unique point of view to guide the AI, separating average, generic output from exceptional work.

Concepts like good taste or judgment aren't magical human traits but are a form of "embedded measurement" in our brains. This data, collected through unique, lived experiences (especially edge cases), is not yet digitized and thus remains a key differentiator from AI models trained on public data.

Creating AI that can reliably judge aesthetics is a frontier problem. Unlike tasks with clear right or wrong answers, aesthetics is subjective. This lack of a clear, objective benchmark makes it difficult to apply standard model improvement techniques, making it a better fit for Reinforcement Learning from Human Feedback (RLHF).

The best AI models are trained on data that reflects deep, subjective qualities—not just simple criteria. This "taste" is a key differentiator, influencing everything from code generation to creative writing, and is shaped by the values of the frontier lab.

Despite AI's ability to generate functional code, replicating the nuanced, subjective quality of a specific designer's "taste" remains extremely difficult. Felix Lee, after spending weeks attempting to codify his own taste into an AI model with little success, notes it's a significant unsolved challenge.

Current benchmarks focus on whether code passes tests. The future of AI evaluation must assess qualitative, human-centric aspects like 'design taste,' code maintainability, and alignment with a team's specific coding style. These are hard to measure automatically and signal a shift toward more complex, human-in-the-loop or LLM-judged evaluation frameworks.

True taste isn't just recognizing good design; it's the judgment of when to innovate versus when to adhere to established patterns. This discernment, the ability to zoom in and out, is a uniquely human skill that current AI models cannot replicate.

For subjective outputs like image aesthetics and face consistency, quantitative metrics are misleading. Google's team relies heavily on disciplined human evaluations, internal 'eyeballing,' and community testing to capture the subtle, emotional impact that benchmarks can't quantify.

For creative AI tools, quantitative benchmarks are insufficient. Descript relies on 'vibes' and the curated aesthetic judgment of trusted tastemakers to evaluate and select the best generative models, echoing Midjourney's strategy of having a 'thumb on the scale'.

AI models, trained on data divorced from our lived, biological experience, lack the innate aesthetic sense that almost all humans possess. This makes taste and aesthetic judgment a uniquely human and valuable contribution as AI handles more logical and computational tasks.