Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

AI models and frameworks change constantly. A deep understanding of user needs, encoded into a robust evaluation suite, is a lasting asset. This allows you to continuously iterate and improve quality, regardless of which new model or agent framework becomes popular.

Related Insights

While evals involve testing, their purpose isn't just to report bugs (information), like traditional QA. For an AI PM, evals are a core tool to actively shape and improve the product's behavior and performance (transformation) by iteratively refining prompts, models, and orchestration layers.

Before building an AI agent, product managers must first create an evaluation set and scorecard. This 'eval-driven development' approach is critical for measuring whether training is improving the model and aligning its progress with the product vision. Without it, you cannot objectively demonstrate progress.

Simply offering the latest model is no longer a competitive advantage. True value is created in the system built around the model—the system prompts, tools, and overall scaffolding. This 'harness' is what optimizes a model's performance for specific tasks and delivers a superior user experience.

With AI commoditizing the tech stack, traditional technical moats are disappearing. The only sustainable differentiator at the application layer is having a unique insight into a problem and assembling a team that can out-iterate everyone else. Your long-term defensibility becomes customer love built through relentless execution.

Standardized benchmarks for AI models are largely irrelevant for business applications. Companies need to create their own evaluation systems tailored to their specific industry, workflows, and use cases to accurately assess which new model provides a tangible benefit and ROI.

In a world where AI implementation is becoming cheaper, the real competitive advantage isn't speed or features. It's the accumulated knowledge gained through the difficult, iterative process of building and learning. This "pain" of figuring out what truly works for a specific problem becomes a durable moat.

The primary bottleneck in improving AI is no longer data or compute, but the creation of 'evals'—tests that measure a model's capabilities. These evals act as product requirement documents (PRDs) for researchers, defining what success looks like and guiding the training process.

As AI makes building software features trivial, the sustainable competitive advantage shifts to data. A true data moat uses proprietary customer interaction data to train AI models, creating a feedback loop that continuously improves the product faster than competitors.

The founder of Stormy AI focuses on building a company that benefits from, rather than competes with, improving foundation models. He avoids over-optimizing for current model limitations, ensuring his business becomes stronger, not obsolete, with every new release like GPT-5. This strategy is key to building a durable AI company.

The prompts for your "LLM as a judge" evals function as a new form of PRD. They explicitly define the desired behavior, edge cases, and quality standards for your AI agent. Unlike static PRDs, these are living documents, derived from real user data and are constantly, automatically testing if the product meets its requirements.

Invest in Evals as Your Durable Moat, Not in Transient LLM or Agent Architectures | RiffOn