Get your free personalized podcast brief

We scan new podcasts and send you the top 5 insights daily.

Don't start building evaluations from a blank slate. Use an AI agent to analyze your production traces and automatically generate a baseline 'vibe eval.' This initial evaluation won't be perfect, but it provides a starting point for refinement and accelerates the improvement loop.

Related Insights

Systematically review production traces ("open coding"), categorize the observed errors ("axial coding"), and then count them. This simple process transforms subjective "vibe checks" and messy logs into a prioritized, data-backed roadmap for improving your AI application, giving PMs a superpower.

To move beyond 'vibe-based' AI usage, create an automated weekly report that scores your performance on key dimensions like automation and learning. This provides objective feedback, grounds your sense of progress in data, and highlights specific areas for improvement.

If your application isn't live and you lack real user data, you can still perform evals. The best methods are dogfooding and recruiting friends. If that's not possible, use an LLM to simulate user interactions at scale. This generates the necessary traces to begin the crucial error analysis process before launch.

Before building an AI agent, product managers must first create an evaluation set and scorecard. This 'eval-driven development' approach is critical for measuring whether training is improving the model and aligning its progress with the product vision. Without it, you cannot objectively demonstrate progress.

Establish a powerful feedback loop where the AI agent analyzes your notes to find inefficiencies, proposes a solution as a new custom command, and then immediately writes the code for that command upon your approval. The system becomes self-improving, building its own upgrades.

Notion treats its entire evaluation process as a coding agent problem. The system is designed for an agent to download a dataset, run an eval, identify a failure, debug the issue, and implement a fix, all within an automated loop. This turns quality assurance into a meta-problem for agents to solve.

Building a functional AI agent is just the starting point. The real work lies in developing a set of evaluations ("evals") to test if the agent consistently behaves as expected. Without quantifying failures and successes against a standard, you're just guessing, not iteratively improving the agent's performance.

The common mistake in building AI evals is jumping straight to writing automated tests. The correct first step is a manual process called "error analysis" or "open coding," where a product expert reviews real user interaction logs to understand what's actually going wrong. This grounds your entire evaluation process in reality.

You don't need a sophisticated and expensive AI observability platform to start doing evals. The most critical first step is logging traces. This can be done simply by writing to a CSV, JSON, or text file. The key is to begin taking notes on your traces, not to implement the perfect tool.

The modern product development cycle for AI is a tight, iterative loop executed within a coding agent. This involves creating the agent, tracing every step for observability, running evaluations (evals) to find weaknesses, and then improving the agent based on those findings.