Instead of seeking a "magical system" for AI quality, the most effective starting point is a manual process called error analysis. This involves spending a few hours reading through ~100 random user interactions, taking simple notes on failures, and then categorizing those notes to identify the most common problems.
Generic evaluation metrics like "helpfulness" or "conciseness" are vague and untrustworthy. A better approach is to first perform manual error analysis to find recurring problems (e.g., "tour scheduling failures"). Then, build specific, targeted evaluations (evals) that directly measure the frequency of these concrete issues, making metrics meaningful.
Instead of using siloed note-taking apps, structure all your knowledge—code, writing, proposals, notes—into a single GitHub monorepo. This creates a unified, context-rich environment that any AI coding assistant can access. This approach avoids vendor lock-in and provides the AI with a comprehensive "second brain" to work from.
AI interactions often involve multiple steps (e.g., user prompt, tool calls, retrieval). When an error occurs, the entire chain can fail. The most efficient debugging heuristic is to analyze the sequence and stop at the very first mistake. Focusing on this "most upstream problem" addresses the root cause, as downstream failures are merely symptoms.
Developers often test AI systems with well-formed, correctly spelled questions. However, real users submit vague, typo-ridden, and ambiguous prompts. Directly analyzing these raw logs is the most crucial first step to understanding how your product fails in the real world and where to focus quality improvements.
When an LLM produces text with the wrong style, re-prompting is often ineffective. A superior technique is to use a tool that allows you to directly edit the model's output. This act of editing creates a perfect, in-context example for the next turn, teaching the LLM your preferred style much more effectively than descriptive instructions.
Fine-tuning an AI model is most effective when you use high-signal data. The best source for this is the set of difficult examples where your system consistently fails. The processes of error analysis and evaluation naturally curate this valuable dataset, making fine-tuning a logical and powerful next step after prompt engineering.
Do not blindly trust an LLM's evaluation scores. The biggest mistake is showing stakeholders metrics that don't match their perception of product quality. To build trust, first hand-label a sample of data with binary outcomes (good/bad), then compare the LLM judge's scores against these human labels to ensure agreement before deploying the eval.
