Reviewing user interaction data is the highest ROI activity for improving an AI product. Instead of relying solely on third-party observability tools, high-performing teams build simple, custom internal applications. These tools are tailored to their specific data and workflow, removing all friction from the process of looking at and annotating traces.
Many teams wrongly focus on the latest models and frameworks. True improvement comes from classic product development: talking to users, preparing better data, optimizing workflows, and writing better prompts.
Top product teams like those at OpenAI don't just monitor high-level KPIs. They maintain a fanatical obsession with understanding the 'why' behind every micro-trend. When a metric shifts even slightly, they dig relentlessly to uncover the underlying user behavior or market dynamic causing it.
Instead of presenting static charts, teams can now upload raw data into AI tools to generate interactive visualizations on the fly. This transforms review meetings from passive presentations into active analysis sessions where leaders can ask new questions and explore data in real time without needing a data analyst.
Instead of generic PRD generators, a high-leverage AI agent for PMs is a personalized reviewer. By training an agent on your manager's past document reviews, you can pre-empt their specific feedback, align your work with their priorities, and increase your credibility and efficiency.
AI evaluation shouldn't be confined to engineering silos. Subject matter experts (SMEs) and business users hold the critical domain knowledge to assess what's "good." Providing them with GUI-based tools, like an "eval studio," is crucial for continuous improvement and building trustworthy enterprise AI.
Using plain-English rule files in tools like Cursor, data teams can create reusable AI agents that automate the entire A/B test write-up process. The agent can fetch data from an experimentation platform, pull context from Notion, analyze results, and generate a standardized report automatically.
The common mistake in building AI evals is jumping straight to writing automated tests. The correct first step is a manual process called "error analysis" or "open coding," where a product expert reviews real user interaction logs to understand what's actually going wrong. This grounds your entire evaluation process in reality.
Developers often test AI systems with well-formed, correctly spelled questions. However, real users submit vague, typo-ridden, and ambiguous prompts. Directly analyzing these raw logs is the most crucial first step to understanding how your product fails in the real world and where to focus quality improvements.
Before diving into SQL, analysts can use enterprise AI search (like Notion AI) to query internal documents, PRDs, and Slack messages. This rapidly generates context and hypotheses about metric changes, replacing hours of manual digging and leading to better, faster analysis.
Instead of seeking a "magical system" for AI quality, the most effective starting point is a manual process called error analysis. This involves spending a few hours reading through ~100 random user interactions, taking simple notes on failures, and then categorizing those notes to identify the most common problems.