The team views its comprehensive 'SeiCo' benchmark, with over 200,000 concepts, as a more lasting contribution than the SAM3 model itself. While models are quickly surpassed, a robust benchmark can guide and measure progress for the entire research community for years.

Related Insights

The key innovation was a data engine where AI models, fine-tuned on human verification data, took over mask verification and exhaustivity checks. This reduced the time to create a single training data point from over 2 minutes (human-only) to just 25 seconds, enabling massive scale.

Instead of one component doing everything, SAM3 first uses a specialized token to answer a simple question: "Is this concept in the image at all?" Only then does it proceed to localization. This simplifies the model's task, improving its ability to avoid hallucinating objects that aren't there.

The primary bottleneck in improving AI is no longer data or compute, but the creation of 'evals'—tests that measure a model's capabilities. These evals act as product requirement documents (PRDs) for researchers, defining what success looks like and guiding the training process.

Traditional AI benchmarks are seen as increasingly incremental and less interesting. The new frontier for evaluating a model's true capability lies in applied, complex tasks that mimic real-world interaction, such as building in Minecraft (MC Bench) or managing a simulated business (VendingBench), which are more revealing of raw intelligence.

Meta's chief AI scientist, Yann LeCun, is reportedly leaving to start a company focused on "world models"—AI that learns from video and spatial data to understand cause-and-effect. He argues the industry's focus on LLMs is a dead end and that his alternative approach will become dominant within five years.

To teach the model to recognize when a concept is *not* in an image, the team heavily annotated negative phrases. This massive volume of negative data was critical for building a robust recognition capability and preventing the model from falsely detecting objects that are not present.

Instead of generic benchmarks, Superhuman tests its AI models against specific problem "dimensions" like deep search and date comprehension. It uses "canonical queries," including extreme edge cases from its CEO, to ensure high quality on tasks that matter most to demanding users.

OpenAI's new GDP-val benchmark evaluates models on complex, real-world knowledge work tasks, not abstract IQ tests. This pivot signifies that the true measure of AI progress is now its ability to perform economically valuable human jobs, making performance metrics directly comparable to professional output.

While SAM3 can act as a "tool" for LLMs, researchers argue that fundamental vision tasks like counting fingers should be a native, immediate capability of a frontier model, akin to human System 1 thinking. Relying on tool calls for simple perception indicates a critical missing capability in the core model.

Standardized AI benchmarks are saturated and becoming less relevant for real-world use cases. The true measure of a model's improvement is now found in custom, internal evaluations (evals) created by application-layer companies. Progress for a legal AI tool, for example, is a more meaningful indicator than a generic test score.