Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

A "vibe check" is simply using your brain as a scoring function to intuit if an AI output is good. This aligns with the "do things that don't scale" startup principle and is a necessary first step before building more robust, scalable evaluation systems.

Related Insights

Don't treat evals as a mere checklist. Instead, use them as a creative tool to discover opportunities. A well-designed eval can reveal that a product is underperforming for a specific user segment, pointing directly to areas for high-impact improvement that a simple "vibe check" would miss.

The goal of testing multiple AI models isn't to crown a universal winner, but to build your own subjective "rule of thumb" for which model works best for the specific tasks you frequently perform. This personal topography is more valuable than any generic benchmark.

The "vibe coding" trend, where non-technical staff use AI to rapidly build prototypes, is a legitimate accelerator for innovation. However, it's not yet a substitute for professional engineers when building scalable, mission-critical systems that are ready for deployment.

Unlike traditional software development that starts with unit tests for quality assurance, AI product development often begins with 'vibe testing.' Developers test a broad hypothesis to see if the model's output *feels* right, prioritizing creative exploration over rigid, predefined test cases at the outset.

Building a functional AI agent is just the starting point. The real work lies in developing a set of evaluations ("evals") to test if the agent consistently behaves as expected. Without quantifying failures and successes against a standard, you're just guessing, not iteratively improving the agent's performance.

Do not blindly trust an LLM's evaluation scores. The biggest mistake is showing stakeholders metrics that don't match their perception of product quality. To build trust, first hand-label a sample of data with binary outcomes (good/bad), then compare the LLM judge's scores against these human labels to ensure agreement before deploying the eval.

Teams that claim to build AI on "vibes," like the Claude Code team, aren't ignoring evaluation. Their intense, expert-led dogfooding is a form of manual error analysis. Furthermore, their products are built on foundational models that have already undergone rigorous automated evaluations. The two approaches are part of the same quality spectrum, not opposites.

A one-size-fits-all evaluation method is inefficient. Use simple code for deterministic checks like word count. Leverage an LLM-as-a-judge for subjective qualities like tone. Reserve costly human evaluation for ambiguous cases flagged by the LLM or for validating new features.

You don't need to create an automated "LLM as a judge" for every potential failure. Many issues discovered during error analysis can be fixed with a simple prompt adjustment. Reserve the effort of building robust, automated evals for the 4-7 most persistent and critical failure modes that prompt changes alone cannot solve.

Instead of seeking a "magical system" for AI quality, the most effective starting point is a manual process called error analysis. This involves spending a few hours reading through ~100 random user interactions, taking simple notes on failures, and then categorizing those notes to identify the most common problems.

Treat Intuitive 'Vibe Checks' as a Valid, Non-Scalable Form of AI Evaluation | RiffOn