To set realistic success metrics for new AI tools, Descript used its most popular pre-AI feature, "remove filler words," as the baseline. They compared adoption and retention of new AI features against this known winner, providing a clear, internal benchmark for what "good" looks like instead of guessing at targets.

Related Insights

To quantify the real-world impact of its AI tools, Block tracks a simple but powerful metric: "manual hours saved." This KPI combines qualitative and quantitative signals to provide a clear measure of ROI, with a target to save 25% of manual hours across the company.

Unlike traditional software that optimizes for time-in-app, the most successful AI products will be measured by their ability to save users time. The new benchmark for value will be how much cognitive load or manual work is automated "behind the scenes," fundamentally changing the definition of a successful product.

Users mistakenly evaluate AI tools based on the quality of the first output. However, since 90% of the work is iterative, the superior tool is the one that handles a high volume of refinement prompts most effectively, not the one with the best initial result.

The current AI hype cycle can create misleading top-of-funnel metrics. The only companies that will survive are those demonstrating strong, above-benchmark user and revenue retention. It has become the ultimate litmus test for whether a product provides real, lasting value beyond the initial curiosity.

The primary bottleneck in improving AI is no longer data or compute, but the creation of 'evals'—tests that measure a model's capabilities. These evals act as product requirement documents (PRDs) for researchers, defining what success looks like and guiding the training process.

To ensure product quality, Fixer pitted its AI against 10 of its own human executive assistants on the same tasks. They refused to launch features until the AI could consistently outperform the humans on accuracy, using their service business as a direct training and validation engine.

Open and click rates are ineffective for measuring AI-driven, two-way conversations. Instead, leaders should adopt new KPIs: outcome metrics (e.g., meetings booked), conversational quality (tracking an agent's 'I don't know' rate to measure trust), and, ultimately, customer lifetime value.

Founders can get objective performance feedback without waiting for a fundraising cycle. AI benchmarking tools can analyze routine documents like monthly investor updates or board packs, providing continuous, low-effort insight into how the company truly stacks up against the market.

Standardized AI benchmarks are saturated and becoming less relevant for real-world use cases. The true measure of a model's improvement is now found in custom, internal evaluations (evals) created by application-layer companies. Progress for a legal AI tool, for example, is a more meaningful indicator than a generic test score.

Instead of waiting for external reports, companies should develop their own AI model evaluations. By defining key tasks for specific roles and testing new models against them with standard prompts, businesses can create a relevant, internal benchmark.