Tools mentioned by John Yang
Software and services John Yang has mentioned across podcast appearances.
SignalCast may earn a small commission on purchases through these links — at no extra cost to you. As an Amazon Associate we earn from qualifying purchases.
SWE-Bench
“John Yang discusses SWE-bench evolution since its October 2023 launch, including multilingual extensions across nine languages, the new Code Clash benchmark for long-horizon development, and emerging evaluation approaches for autonomous coding agents.”
Devon
by Cognition
“Yang reveals Cognition contacted him just two weeks before their Devon launch announcing strong SWE-bench results, with the subsequent public release triggering an industry arms race in autonomous coding that transformed the benchmark from rarely used to widely adopted.”
Code Clash
“New benchmark evaluates long-horizon development by having models maintain separate codebases that compete in programming tournaments across multiple rounds, testing iterative improvement and consequential changes rather than isolated task completion typical of unit test approaches.”
Psy-code
“New domain-specific benchmarks emerged including SWE-ficiency for code optimization without behavior changes, Psy-code for scientific computing, SEC-bench for security, and SRE-bench for operations.”
SWE-ficiency
“New domain-specific benchmarks emerged including SWE-ficiency for code optimization without behavior changes, Psy-code for scientific computing, SEC-bench for security, and SRE-bench for operations.”
SRE-bench
“New domain-specific benchmarks emerged including SWE-ficiency for code optimization without behavior changes, Psy-code for scientific computing, SEC-bench for security, and SRE-bench for operations.”
SEC-bench
“New domain-specific benchmarks emerged including SWE-ficiency for code optimization without behavior changes, Psy-code for scientific computing, SEC-bench for security, and SRE-bench for operations.”
Tau-bench
“Current approaches like Tau-bench and Vending-bench sample single paths and lack realism, creating need for better human-AI interaction data either through compelling products that generate real usage patterns or sophisticated simulators beyond simple prompting.”
Vending-bench
“Current approaches like Tau-bench and Vending-bench sample single paths and lack realism, creating need for better human-AI interaction data either through compelling products that generate real usage patterns or sophisticated simulators beyond simple prompting.”
SWE-Bench Pro
“Independent teams created variants like SWE-bench Pro without original author involvement, showing benchmark adoption.”