Tools mentioned by John Yang

Software and services John Yang has mentioned across podcast appearances.

SignalCast may earn a small commission on purchases through these links — at no extra cost to you.

SWE-Bench
“John Yang discusses SWE-bench evolution since its October 2023 launch, including multilingual extensions across nine languages, the new Code Clash benchmark for long-horizon development, and emerging evaluation approaches for autonomous coding agents.”
Mentioned on: Latent Space · [State of Code Evals] After SWE-bench, Code Cla..., Latent Space · [State of Code Evals] After SWE-bench, Code Cla...
Devon
by Cognition
“Yang reveals Cognition contacted him just two weeks before their Devon launch announcing strong SWE-bench results, with the subsequent public release triggering an industry arms race in autonomous coding that transformed the benchmark from rarely used to widely adopted.”
Mentioned on: Latent Space · [State of Code Evals] After SWE-bench, Code Cla..., Latent Space · [State of Code Evals] After SWE-bench, Code Cla...
Code Clash
“New benchmark evaluates long-horizon development by having models maintain separate codebases that compete in programming tournaments across multiple rounds, testing iterative improvement and consequential changes rather than isolated task completion typical of unit test approaches.”
Mentioned on: Latent Space · [State of Code Evals] After SWE-bench, Code Cla..., Latent Space · [State of Code Evals] After SWE-bench, Code Cla...
Psy-code
“New domain-specific benchmarks emerged including SWE-ficiency for code optimization without behavior changes, Psy-code for scientific computing, SEC-bench for security, and SRE-bench for operations.”
Mentioned on: Latent Space · [State of Code Evals] After SWE-bench, Code Cla...
SWE-ficiency
“New domain-specific benchmarks emerged including SWE-ficiency for code optimization without behavior changes, Psy-code for scientific computing, SEC-bench for security, and SRE-bench for operations.”
Mentioned on: Latent Space · [State of Code Evals] After SWE-bench, Code Cla...
SRE-bench
“New domain-specific benchmarks emerged including SWE-ficiency for code optimization without behavior changes, Psy-code for scientific computing, SEC-bench for security, and SRE-bench for operations.”
Mentioned on: Latent Space · [State of Code Evals] After SWE-bench, Code Cla...
SEC-bench
“New domain-specific benchmarks emerged including SWE-ficiency for code optimization without behavior changes, Psy-code for scientific computing, SEC-bench for security, and SRE-bench for operations.”
Mentioned on: Latent Space · [State of Code Evals] After SWE-bench, Code Cla...
Tau-bench
“Current approaches like Tau-bench and Vending-bench sample single paths and lack realism, creating need for better human-AI interaction data either through compelling products that generate real usage patterns or sophisticated simulators beyond simple prompting.”
Mentioned on: Latent Space · [State of Code Evals] After SWE-bench, Code Cla...
Vending-bench
“Current approaches like Tau-bench and Vending-bench sample single paths and lack realism, creating need for better human-AI interaction data either through compelling products that generate real usage patterns or sophisticated simulators beyond simple prompting.”
Mentioned on: Latent Space · [State of Code Evals] After SWE-bench, Code Cla...
SWE-Bench Pro
“Independent teams created variants like SWE-bench Pro without original author involvement, showing benchmark adoption.”
Mentioned on: Latent Space · [State of Code Evals] After SWE-bench, Code Cla...

← Back to John Yang's podcast appearances

SWE-Bench

Devon

Code Clash

Psy-code

SWE-ficiency

SRE-bench

SEC-bench

Tau-bench

Vending-bench

SWE-Bench Pro