For subjective outputs like image aesthetics and face consistency, quantitative metrics are misleading. Google's team relies heavily on disciplined human evaluations, internal 'eyeballing,' and community testing to capture the subtle, emotional impact that benchmarks can't quantify.
AI excels where success is quantifiable (e.g., code generation). Its greatest challenge lies in subjective domains like mental health or education. Progress requires a messy, societal conversation to define 'success,' not just a developer-built technical leaderboard.
Users are dissatisfied with purely AI-generated creative outputs like interior design, calling it "slop." This creates an opportunity for platforms that blend AI's efficiency with a human's taste and curation, for which consumers are willing to pay a premium.
AI is engineered to eliminate errors, which is precisely its limitation. True human creativity stems from our "bugs"—our quirks, emotions, misinterpretations, and mistakes. This ability to be imperfect is what will continue to separate human ingenuity from artificial intelligence.
True creative mastery emerges from an unpredictable human process. AI can generate options quickly but bypasses this journey, losing the potential for inexplicable, last-minute genius that defines truly great work. It optimizes for speed at the cost of brilliance.
The breakthrough performance of Nano Banana wasn't just about massive datasets. The team emphasizes the importance of 'craft'—attention to detail, high-quality data curation, and numerous small design decisions. This human element of quality control is as crucial as model scale.
The 'aha' moment for Google's team was when the AI model accurately rendered their own faces. Judging consistency on unfamiliar faces is unreliable; the most stringent and meaningful evaluation comes from a person judging an AI-generated image of themselves.
Do not blindly trust an LLM's evaluation scores. The biggest mistake is showing stakeholders metrics that don't match their perception of product quality. To build trust, first hand-label a sample of data with binary outcomes (good/bad), then compare the LLM judge's scores against these human labels to ensure agreement before deploying the eval.
Quantifying the "goodness" of an AI-generated summary is analogous to measuring the impact of a peacebuilding initiative. Both require moving beyond simple quantitative data (clicks, meetings held) to define and measure complex, ineffable outcomes by focusing on the qualitative "so what."
The best AI models are trained on data that reflects deep, subjective qualities—not just simple criteria. This "taste" is a key differentiator, influencing everything from code generation to creative writing, and is shaped by the values of the frontier lab.
AI tools can drastically increase the volume of initial creative explorations, moving from 3 directions to 10 or more. The designer's role then shifts from pure creation to expert curation, using their taste to edit AI outputs into winning concepts.