
[State of Code Evals] After SWE-bench, Code Clash & SOTA Coding Benchmarks recap — John Yang
Latent SpaceAI Summary
→ WHAT IT COVERS John Yang discusses the evolution of SWE-bench coding benchmarks since its October 2022 launch, including multilingual extensions across nine languages, the new Code Clash tournament framework for long-horizon development evaluation, and emerging challenges in coding evaluation methodology. → KEY INSIGHTS - **SWE-bench Extensions:** The benchmark expanded beyond its Django-focused Python origins to include multilingual support across nine languages (JavaScript, Rust, Java, C, Ruby) and 40 repositories, plus multimodal capabilities. Independent teams created variants like SWE-bench Pro without original author involvement, showing benchmark adoption. - **Code Clash Framework:** This new evaluation method replaces unit tests with programming tournaments where two or more language models maintain separate codebases, iteratively improving them each round before competing in arenas. Models must demonstrate long-horizon development skills with consequential, dependent changes rather than isolated task completion. - **Benchmark Diversification:** New domain-specific benchmarks emerged including SWE-ficiency for code optimization without behavior changes, Psy-code for scientific computing, SEC-bench for security, and SRE-bench for operations. Each targets specific coding domains beyond general software engineering, enabling more targeted model evaluation and development. - **Academic Data Limitations:** Academic researchers lack access to valuable user interaction data that companies like Cognition and Cursor collect naturally through product usage. Building compelling products or creating realistic user simulators both present significant challenges, limiting academic progress on human-AI collaboration research compared to industry. → NOTABLE MOMENT Yang reveals that when Cognition released Devon with strong SWE-bench results, he received only two weeks advance notice via email. The release sparked an industry arms race in coding benchmarks, transforming SWE-bench from a little-used academic project into a central evaluation standard. 💼 SPONSORS None detected 🏷️ Code Benchmarks, AI Evaluation, Software Engineering, Long-Horizon Tasks