Skip to main content
JY

John Yang

1episode
1podcast

We have 1 summarized appearance for John Yang so far. Browse all podcasts to discover more episodes.

Featured On 1 Podcast

All Appearances

1 episode

AI Summary

→ WHAT IT COVERS John Yang discusses the evolution of SWE-bench coding benchmarks since its October 2022 launch, including multilingual extensions across nine languages, the new Code Clash tournament framework for long-horizon development evaluation, and emerging challenges in coding evaluation methodology. → KEY INSIGHTS - **SWE-bench Extensions:** The benchmark expanded beyond its Django-focused Python origins to include multilingual support across nine languages (JavaScript, Rust, Java, C, Ruby) and 40 repositories, plus multimodal capabilities. Independent teams created variants like SWE-bench Pro without original author involvement, showing benchmark adoption. - **Code Clash Framework:** This new evaluation method replaces unit tests with programming tournaments where two or more language models maintain separate codebases, iteratively improving them each round before competing in arenas. Models must demonstrate long-horizon development skills with consequential, dependent changes rather than isolated task completion. - **Benchmark Diversification:** New domain-specific benchmarks emerged including SWE-ficiency for code optimization without behavior changes, Psy-code for scientific computing, SEC-bench for security, and SRE-bench for operations. Each targets specific coding domains beyond general software engineering, enabling more targeted model evaluation and development. - **Academic Data Limitations:** Academic researchers lack access to valuable user interaction data that companies like Cognition and Cursor collect naturally through product usage. Building compelling products or creating realistic user simulators both present significant challenges, limiting academic progress on human-AI collaboration research compared to industry. → NOTABLE MOMENT Yang reveals that when Cognition released Devon with strong SWE-bench results, he received only two weeks advance notice via email. The release sparked an industry arms race in coding benchmarks, transforming SWE-bench from a little-used academic project into a central evaluation standard. 💼 SPONSORS None detected 🏷️ Code Benchmarks, AI Evaluation, Software Engineering, Long-Horizon Tasks

Never miss John Yang's insights

Subscribe to get AI-powered summaries of John Yang's podcast appearances delivered to your inbox weekly.

Start Free Today

No credit card required • Free tier available