A flawed or unsolvable benchmark task can function as a 'canary' or 'honeypot'. If a model successfully completes it, it's a strong signal that the model has memorized the answer from contaminated training data, rather than reasoning its way to a solution.
Anthropic's choice to label data collection by Chinese labs as a 'distillation attack' is a strategic branding move. This framing aligns with their public image focused on AI safety and geopolitical concerns, rather than just being a technical description of the activity.
Simply using the most powerful model to generate synthetic data for a smaller model often fails. Effective distillation requires matching the 'teacher' model's token probabilities to the 'student' model's base architecture and training data, making it a complex research problem.
The SWE-bench benchmark is now obsolete primarily because its open-source problems were absorbed into models' training data. This allowed models to 'cheat' by memorizing solutions rather than demonstrating true reasoning, leading to artificially high and meaningless scores.
Contrary to the belief that memorization requires multiple training epochs, large language models demonstrate the capacity to perfectly recall specific information after seeing it only once. This surprising phenomenon highlights how understudied the information theory behind LLMs still is.
The public-facing models from major labs are likely efficient Mixture-of-Experts (MOE) versions distilled from much larger, private, and computationally expensive dense models. This means the model users interact with is a smaller, optimized copy, not the original frontier model.
API providers like Anthropic struggle to differentiate between users distilling models for competitive purposes and those conducting large-scale evaluations. Both activities generate similar high-volume, repetitive API calls, creating a detection challenge that also raises user privacy concerns.
OpenAI's effort to create 'SWE-bench-verified' demonstrates the immense cost of quality benchmarks, requiring millions of dollars and multiple human annotators per task. Despite this, a later audit revealed that 59% of the unsolved problems were actually impossible to solve due to inherent flaws.
![[LIVE] Anthropic Distillation & How Models Cheat (SWE-Bench Dead) | Nathan Lambert & Sebastian Raschka](https://substackcdn.com/feed/podcast/1084089/post/189277598/ca7468da5614a246d2906ee8926f6de7.jpg)