If your application isn't live and you lack real user data, you can still perform evals. The best methods are dogfooding and recruiting friends. If that's not possible, use an LLM to simulate user interactions at scale. This generates the necessary traces to begin the crucial error analysis process before launch.

Related Insights

To ensure AI reliability, Salesforce builds environments that mimic enterprise CRM workflows, not game worlds. They use synthetic data and introduce corner cases like background noise, accents, or conflicting user requests to find and fix agent failure points before deployment, closing the "reality gap."

Don't ask an LLM to perform initial error analysis; it lacks the product context to spot subtle failures. Instead, have a human expert write detailed, freeform notes ("open codes"). Then, leverage an LLM's strength in synthesis to automatically categorize those hundreds of human-written notes into actionable failure themes ("axial codes").

To test complex AI prompts for tasks like customer persona generation without exposing sensitive company data, first ask the AI to create realistic, synthetic data (e.g., fake sales call notes). This allows you to safely develop and refine prompts before applying them to real, proprietary information, overcoming data privacy hurdles in experimentation.

Use Claude's "Artifacts" feature to generate interactive, LLM-powered application prototypes directly from a prompt. This allows product managers to test the feel and flow of a conversational AI, including latency and response length, without needing API keys or engineering support, bridging the gap between a static mock and a coded MVP.

Building a functional AI agent is just the starting point. The real work lies in developing a set of evaluations ("evals") to test if the agent consistently behaves as expected. Without quantifying failures and successes against a standard, you're just guessing, not iteratively improving the agent's performance.

The common mistake in building AI evals is jumping straight to writing automated tests. The correct first step is a manual process called "error analysis" or "open coding," where a product expert reviews real user interaction logs to understand what's actually going wrong. This grounds your entire evaluation process in reality.

You don't need to create an automated "LLM as a judge" for every potential failure. Many issues discovered during error analysis can be fixed with a simple prompt adjustment. Reserve the effort of building robust, automated evals for the 4-7 most persistent and critical failure modes that prompt changes alone cannot solve.

The prompts for your "LLM as a judge" evals function as a new form of PRD. They explicitly define the desired behavior, edge cases, and quality standards for your AI agent. Unlike static PRDs, these are living documents, derived from real user data and are constantly, automatically testing if the product meets its requirements.

You don't need a sophisticated and expensive AI observability platform to start doing evals. The most critical first step is logging traces. This can be done simply by writing to a CSV, JSON, or text file. The key is to begin taking notes on your traces, not to implement the perfect tool.

Instead of creating mock data from scratch, provide an LLM with your existing production data schema as a JSON file. You can then prompt the AI to augment this schema with new fields and realistic data needed to prototype a new feature, seamlessly extending your current data model.