We scan new podcasts and send you the top 5 insights daily.
Building non-deterministic AI products fundamentally changes the PM role. Instead of creating detailed, rigid specifications, the PM's primary task becomes defining and codifying "what good looks like." This is done by repeatedly grading AI outputs to train evaluation systems and guide the model's behavior.
As AI tools automate coding and prototyping, the product manager's core function is no longer detailed specification writing. Instead, their value multiplies in judging, facilitating, and making the right strategic decisions quickly. The emphasis moves from the 'how' of building to the 'what' and 'why,' making decision-making the critical skill.
While evals involve testing, their purpose isn't just to report bugs (information), like traditional QA. For an AI PM, evals are a core tool to actively shape and improve the product's behavior and performance (transformation) by iteratively refining prompts, models, and orchestration layers.
Before building an AI agent, product managers must first create an evaluation set and scorecard. This 'eval-driven development' approach is critical for measuring whether training is improving the model and aligning its progress with the product vision. Without it, you cannot objectively demonstrate progress.
Unlike traditional software, AI products are evolving systems. The role of an AI PM shifts from defining fixed specifications to managing uncertainty, bias, and trust. The focus is on creating feedback loops for continuous improvement and establishing guardrails for model behavior post-launch.
AI's rapid capability growth makes top-down product specs obsolete. Product Managers now work bottoms-up with engineers, prototyping and even checking in code using AI tools. This blurs traditional roles, shifting the PM's focus to defining high-level customer needs and evaluating outcomes rather than prescribing features.
The primary bottleneck in improving AI is no longer data or compute, but the creation of 'evals'—tests that measure a model's capabilities. These evals act as product requirement documents (PRDs) for researchers, defining what success looks like and guiding the training process.
Because PMs deeply understand the customer's job, needs, and alternatives, they are the only ones qualified to write the evaluation criteria for what a successful AI output looks like. This critical task goes beyond technical metrics and is core to the PM's role in the AI era.
Instead of traditional product requirements documents, AI PMs should define success through a set of specific evaluation metrics. Engineers then work to improve the system's performance against these evals in a "hill climbing" process, making the evals the functional specification for the product.
To manage non-deterministic AI products, Shopify created an internal tool where PMs grade AI-generated outputs. This creates a "ground truth" dataset of what "good" looks like, which is then used to fine-tune a separate LLM that acts as an automated quality judge for new features and updates.
The prompts for your "LLM as a judge" evals function as a new form of PRD. They explicitly define the desired behavior, edge cases, and quality standards for your AI agent. Unlike static PRDs, these are living documents, derived from real user data and are constantly, automatically testing if the product meets its requirements.