Don't just rely on explicit feedback like thumbs up/down. Soft signals are powerful evaluation inputs. A user repeatedly re-generating an answer, quickly abandoning a session, or escalating to human support are strong indicators that your AI is failing, even if they don't explicitly say so.
When deploying AI tools, especially in sales, users exhibit no patience for mistakes. While a human making an error receives coaching and a second chance, an AI's single failure can cause users to abandon the tool permanently due to a complete loss of trust.
Don't treat evals as a mere checklist. Instead, use them as a creative tool to discover opportunities. A well-designed eval can reveal that a product is underperforming for a specific user segment, pointing directly to areas for high-impact improvement that a simple "vibe check" would miss.
AI models are trained to be agreeable, often providing uselessly positive feedback. To get real insights, you must explicitly prompt them to be rigorous and critical. Use phrases like "my standards of excellence are very high and you won't hurt my feelings" to bypass their people-pleasing nature.
Generic evaluation metrics like "helpfulness" or "conciseness" are vague and untrustworthy. A better approach is to first perform manual error analysis to find recurring problems (e.g., "tour scheduling failures"). Then, build specific, targeted evaluations (evals) that directly measure the frequency of these concrete issues, making metrics meaningful.
Users mistakenly evaluate AI tools based on the quality of the first output. However, since 90% of the work is iterative, the superior tool is the one that handles a high volume of refinement prompts most effectively, not the one with the best initial result.
Counterintuitively, AI responses that are too fast can be perceived as low-quality or pre-scripted, harming user trust. There is a sweet spot for response time; a slight, human-like delay can signal that the AI is actually "thinking" and generating a considered answer.
A key metric for AI coding agent performance is real-time sentiment analysis of user prompts. By measuring whether users say 'fantastic job' or 'this is not what I wanted,' teams get an immediate signal of the agent's comprehension and effectiveness, which is more telling than lagging indicators like bug counts.
When an AI tool fails, a common user mistake is to get stuck in a 'doom loop' by repeatedly using negative, low-context prompts like 'it's not working.' This is counterproductive. A better approach is to use a specific command or prompt that forces the AI to reflect and reset its approach.
Standard AI models are often overly supportive. To get genuine, valuable feedback, explicitly instruct your AI to act as a critical thought partner. Use prompts like "push back on things" and "feel free to challenge me" to break the AI's default agreeableness and turn it into a true sparring partner.
While useful for catching regressions like a unit test, directly optimizing for an eval benchmark is misleading. Evals are, by definition, a lagging proxy for the real-world user experience. Over-optimizing for a metric can lead to gaming it and degrading the actual product.