Use an LLM to Author Your Final Evaluation Prompts from Human-Defined Criteria

Related Insights

Generate AI System Prompts by Feeding an LLM Your Ideal Conversations

Instead of manually crafting a system prompt, feed an LLM multiple "golden conversation" examples. Then, ask the LLM to analyze these examples and generate a system prompt that would produce similar conversational flows. This reverses the typical prompt engineering process, letting the ideal output define the instructions.

How this Yelp AI PM works backward from “golden conversations” to create high-quality prototypes using Claude Artifacts and Magic Patterns | Priya Badger

How I AI·4 months ago

Use Humans for Context-Rich Eval Notes, Then Use LLMs to Cluster Those Notes into Themes

Don't ask an LLM to perform initial error analysis; it lacks the product context to spot subtle failures. Instead, have a human expert write detailed, freeform notes ("open codes"). Then, leverage an LLM's strength in synthesis to automatically categorize those hundreds of human-written notes into actionable failure themes ("axial codes").

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Lenny's Podcast: Product | Career | Growth·5 months ago

Create LLM Prompts That Interview You for High-Quality Content Inputs

After deconstructing successful content into a playbook, build a master prompt. This prompt's function is to systematically interview you for the specific context, ideas, and details needed to generate new content that adheres to your proven, successful formula, effectively automating quality control.

Inside Dan Koe's AI Content Engine

The Startup Ideas Podcast·4 months ago

Force AI Agents to Self-Critique and Improve Their Own System Prompts

Instead of manually refining a complex prompt, create a process where an AI agent evaluates its own output. By providing a framework for self-critique, including quantitative scores and qualitative reasoning, the AI can iteratively enhance its own system instructions and achieve a much stronger result.

How to Build Multi-Agent AI Systems That Actually Work in Production | Tyler Fisk

Product Growth Podcast·4 months ago

Treat LLM Interactions as a Multi-Stage Project, Not a Single Prompt

Achieve higher-quality results by using an AI to first generate an outline or plan. Then, refine that plan with follow-up prompts before asking for the final execution. This course-corrects early and avoids wasted time on flawed one-shot outputs, ultimately saving time.

Prompt Claude better than 99% of people

The Startup Ideas Podcast·2 months ago

Use Domain Experts to Define Failure Criteria After Reviewing Initial AI Outputs

Product managers may lack the expertise to create comprehensive evals from scratch. A better approach is to generate initial outputs with a base model, have subject matter experts review them, and use their direct feedback to define what constitutes a failure. It's easier for experts to spot mistakes than to predict them.

AI Evals Explained Simply by Ankit Shula

The Growth Podcast·18 hours ago

Employ a Hybrid Evaluation Strategy: Code for Objectivity, LLMs for Subjectivity, and Humans for Ambiguity

A one-size-fits-all evaluation method is inefficient. Use simple code for deterministic checks like word count. Leverage an LLM-as-a-judge for subjective qualities like tone. Reserve costly human evaluation for ambiguous cases flagged by the LLM or for validating new features.

AI Evals Explained Simply by Ankit Shula

The Growth Podcast·18 hours ago

Refine Failing AI Prompts by Asking the LLM Itself to Critique and Rewrite Them

When a prompt yields poor results, use a meta-prompting technique. Feed the failing prompt back to the AI, describe the incorrect output, specify the desired outcome, and explicitly grant it permission to rewrite, add, or delete. The AI will then debug and improve its own instructions.

ChatGPT agent mode: The “little helper” that transformed recruiting, crafted user personas, and solved parking nightmares | Michal Peled (Honeybook)

How I AI·2 months ago

AI Evals Are the New Product Requirements Docs (PRDs), Codifying Desired Behavior

The prompts for your "LLM as a judge" evals function as a new form of PRD. They explicitly define the desired behavior, edge cases, and quality standards for your AI agent. Unlike static PRDs, these are living documents, derived from real user data and are constantly, automatically testing if the product meets its requirements.

Why AI evals are the hottest new skill for product builders | Hamel Husain & Shreya Shankar (creators of the #1 eval course)

Lenny's Podcast: Product | Career | Growth·5 months ago

Create Effective AI Prompts by Directly Codifying an Expert's Step-by-Step Workflow

The most effective way to build a powerful automation prompt is to interview a human expert, document their step-by-step process and decision criteria, and translate that knowledge directly into the AI's instructions. Don't invent; document and translate.

ChatGPT agent mode: The “little helper” that transformed recruiting, crafted user personas, and solved parking nightmares | Michal Peled (Honeybook)

How I AI·2 months ago