/

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

Latent Space: The AI Engineer Podcast · Dec 31, 2025

John Yang of SWE-bench discusses the evolution of coding evals, his new project Code Clash, and the shift towards long-horizon, interactive AI.

AI Coding Assistants Are Moving Beyond Task Completion to "Code Base Understanding" for Human Augmentation

The next major advance for AI in software development is not just completing tasks, but deeply understanding entire codebases. This capability aims to "mind meld" the human with the AI, enabling them to collaboratively tackle problems that neither could solve alone.

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang thumbnail

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

Latent Space: The AI Engineer Podcast·2 months ago

AI Benchmarks Should Intentionally Include Impossible Tasks to Test Model Refusal Capabilities

When models achieve suspiciously high scores, it raises questions about benchmark integrity. Intentionally including impossible problems in benchmarks can serve as a flag to test an AI's ability to recognize unsolvable requests and refuse them, a crucial skill for real-world reliability and safety.

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang thumbnail

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

Latent Space: The AI Engineer Podcast·2 months ago

Code Clash Benchmark Moves Beyond Static Tests to Evaluate Long-Term, Competitive AI Development

Current benchmarks like SWE-bench test isolated, independent tasks. The new Code Clash benchmark aims to evaluate long-horizon development by having AI models compete in a tournament, continuously improving their own codebases in response to competitive pressure from other models.

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang thumbnail

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

Latent Space: The AI Engineer Podcast·2 months ago

AI Coding Benchmarks Evolve from Repo Diversity to Justifying Difficulty Through Curation Techniques

Early benchmark improvements focused on adding more languages and repositories. Now, the cutting edge involves creating more difficult evaluation splits through sophisticated curation techniques. Researchers must justify why their new benchmark is qualitatively harder, not just broader, than existing ones.

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang thumbnail

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

Latent Space: The AI Engineer Podcast·2 months ago

AI Development's Future is Debated: Fully Autonomous Agents vs. Highly Interactive Human Collaborators

One vision pushes for long-running, autonomous AI agents that complete complex goals with minimal human input. The counter-argument, emphasized by teams like Cognition, is that real-world value comes from fast, interactive back-and-forth between humans and AI, as tasks are often underspecified.

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang thumbnail

[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang

Latent Space: The AI Engineer Podcast·2 months ago

RiffOn - [State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang | Latent Space: The AI Engineer Podcast