Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

Tobi Lütke argues the true measure of AI's capability isn't passing conversational tests. He proposes a new benchmark: prompting an AI to start a business and successfully generate $1 million in revenue. This tests its ability to act in the real world, market, prioritize, and find product-market fit.

Related Insights

The true, underhyped potential of AI isn't just making existing tasks more efficient. Tobi Lütke argues we should use first principles thinking: 'If AI had always been here, how would we have designed this job from scratch?' This approach moves beyond optimization to complete reinvention of roles and workflows.

Many leaders test AI with simple, surface-level experiments. But modern AI is so advanced that these small tests create a false sense of understanding. According to Braze CPO Kevin Wang, genuine value is only revealed when AI is applied to complex, multi-team business problems and real-world workloads.

The democratization of technology via AI shifts the entrepreneurial goalpost. Instead of focusing on creating a handful of billion-dollar "unicorns," the more impactful ambition is to empower millions of people to each build a million-dollar "donkey corn" business, truly broadening economic opportunity.

The debate over whether LLMs are truly "intelligent" is academic. The practical test for product builders is whether the tool produces valuable outputs that lead to better decisions, regardless of the underlying mechanism.

Just as standardized tests fail to capture a student's full potential, AI benchmarks often don't reflect real-world performance. The true value comes from the 'last mile' ingenuity of productization and workflow integration, not just raw model scores, which can be misleading.

Altman argues that as AI capabilities grow, abstract technical benchmarks become less relevant. He suggests the ultimate measure of an AI's effectiveness will be its direct economic contribution, jokingly proposing "GDP impact" as the next major metric to watch.

Sam Altman suggests that as AI models create enormous economic value, proxy metrics like task completion benchmarks will become obsolete. The most meaningful chart will be the model's direct impact on GDP. This signals a fundamental shift from the research phase of AI to an era of broad economic transformation.

Cutting through abstract definitions, Quora CEO Adam D'Angelo offers a practical benchmark for AGI: an AI that can perform any job a typical human can do remotely. This anchors the concept to tangible economic impact, providing a more useful milestone than philosophical debates on consciousness.

Traditional AI benchmarks are seen as increasingly incremental and less interesting. The new frontier for evaluating a model's true capability lies in applied, complex tasks that mimic real-world interaction, such as building in Minecraft (MC Bench) or managing a simulated business (VendingBench), which are more revealing of raw intelligence.

OpenAI's new GDP-val benchmark evaluates models on complex, real-world knowledge work tasks, not abstract IQ tests. This pivot signifies that the true measure of AI progress is now its ability to perform economically valuable human jobs, making performance metrics directly comparable to professional output.