New top score on ARC-AGI-2-pub (29.4%) - Jeremy Berman
Episode
68 min
Read time
3 min
AI-Generated Summary
Key Takeaways
- ✓Natural language over Python for ARC-AGI-2: Switching from Python programs to plain English descriptions of transformation rules dramatically improves performance on ARC-AGI-2 because every task can be described in five to ten bullet points. Python becomes brittle and verbose for compositional grid tasks, while natural language lets the model's inductive bias express itself fully, yielding higher accuracy at roughly $30 per task on v2 versus $8 on v1.
- ✓Breadth over depth for thinking models: On ARC-AGI-1, iterative revision loops were critical because models lacked internal reasoning. On ARC-AGI-2, RL-trained thinking models like Grok 4 perform deep revision internally, so the optimal strategy shifts toward maximizing entropy and breadth of initial generation rather than deep iterative refinement. Artificially increasing entropy in prompts consistently outperformed narrow, constrained prompting strategies.
- ✓Model selection is domain-specific and spiky: ARC leaderboard performance varies dramatically by model in ways other benchmarks do not. Grok 4 outperforms GPT-class models on ARC-AGI-2 grid reasoning, while Sonnet 3.5 remains superior for code generation tasks. Testing each model directly on the target domain rather than relying on general leaderboard rankings is necessary to identify the right tool for a specific problem type.
- ✓Reasoning as the meta-skill for AGI: Current LLMs acquire domain-specific reasoning circuits — math reasoning stays in math weights, science reasoning in science weights — with limited cross-domain transfer. The core AGI gap is not skill acquisition but the meta-skill of creating new skills. Aligning models purely toward general reasoning through RL, before layering domain knowledge, is the proposed path toward a foundation for general intelligence.
- ✓Knowledge trees versus knowledge webs: Pretraining treats all knowledge as an associative web of embeddings without guaranteed causal structure. Reinforcement learning with verifiable rewards functions as a pruning mechanism, replacing web-like associations with deductive trees where each node is causally consistent with its ancestors. The hypothesis is that models with weight configurations reflecting actual deductive structure will generalize to novel problems, while web-based models cannot.
What It Covers
Jeremy Berman, research scientist at Reflection AI, explains how he reached 29.4% on the ARC-AGI-2 public leaderboard using an evolutionary algorithm that generates and refines natural language descriptions of transformation rules rather than Python code, then discusses why reasoning is the meta-skill required for AGI and the fundamental gap between knowledge webs and deductive knowledge trees.
Key Questions Answered
- •Natural language over Python for ARC-AGI-2: Switching from Python programs to plain English descriptions of transformation rules dramatically improves performance on ARC-AGI-2 because every task can be described in five to ten bullet points. Python becomes brittle and verbose for compositional grid tasks, while natural language lets the model's inductive bias express itself fully, yielding higher accuracy at roughly $30 per task on v2 versus $8 on v1.
- •Breadth over depth for thinking models: On ARC-AGI-1, iterative revision loops were critical because models lacked internal reasoning. On ARC-AGI-2, RL-trained thinking models like Grok 4 perform deep revision internally, so the optimal strategy shifts toward maximizing entropy and breadth of initial generation rather than deep iterative refinement. Artificially increasing entropy in prompts consistently outperformed narrow, constrained prompting strategies.
- •Model selection is domain-specific and spiky: ARC leaderboard performance varies dramatically by model in ways other benchmarks do not. Grok 4 outperforms GPT-class models on ARC-AGI-2 grid reasoning, while Sonnet 3.5 remains superior for code generation tasks. Testing each model directly on the target domain rather than relying on general leaderboard rankings is necessary to identify the right tool for a specific problem type.
- •Reasoning as the meta-skill for AGI: Current LLMs acquire domain-specific reasoning circuits — math reasoning stays in math weights, science reasoning in science weights — with limited cross-domain transfer. The core AGI gap is not skill acquisition but the meta-skill of creating new skills. Aligning models purely toward general reasoning through RL, before layering domain knowledge, is the proposed path toward a foundation for general intelligence.
- •Knowledge trees versus knowledge webs: Pretraining treats all knowledge as an associative web of embeddings without guaranteed causal structure. Reinforcement learning with verifiable rewards functions as a pruning mechanism, replacing web-like associations with deductive trees where each node is causally consistent with its ancestors. The hypothesis is that models with weight configurations reflecting actual deductive structure will generalize to novel problems, while web-based models cannot.
- •Catastrophic forgetting blocks continual learning more than compute does: The fundamental barrier to adaptive, continuously learning AI is not computational cost but catastrophic forgetting — fine-tuning on new data drifts weights away from previously correct solutions. Proposed directions include freezing expert layers, composable model architectures analogous to Docker's immutable layers, and selective data mixtures during fine-tuning. Solving this problem is framed as the next S-curve after the current RL scaling wave.
Notable Moment
Berman argues that heavy pretraining may actively slow reasoning development rather than accelerate it. His analogy contrasts consultants who know terminology but cannot derive conclusions with Feynman-style thinkers who deduce everything from first principles — and frames RL post-training as the process of converting one into the other.
You just read a 3-minute summary of a 65-minute episode.
Get Machine Learning Street Talk summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from Machine Learning Street Talk
The AI Models Smart Enough to Know They're Cheating — Beth Barnes & David Rein [METR]
May 4 · 113 min
Marketing School
Google Search Is Winning Again
May 13
More from Machine Learning Street Talk
When AI Discovers The Next Transformer - Robert Lange (Sakana)
Mar 13 · 78 min
a16z Podcast
Energy, Minerals, and the Physical Stack Behind AI
May 13
More from Machine Learning Street Talk
We summarize every new episode. Want them in your inbox?
The AI Models Smart Enough to Know They're Cheating — Beth Barnes & David Rein [METR]
When AI Discovers The Next Transformer - Robert Lange (Sakana)
"Vibe Coding is a Slot Machine" - Jeremy Howard
Evolution "Doesn't Need" Mutation - Blaise Agüera y Arcas
VAEs Are Energy-Based Models? [Dr. Jeff Beck]
Similar Episodes
Related episodes from other podcasts
Marketing School
May 13
Google Search Is Winning Again
a16z Podcast
May 13
Energy, Minerals, and the Physical Stack Behind AI
Everything Everywhere Daily
May 13
The Battle of the Plains of Abraham: How Quebec Became British
Invest Like the Best with Patrick O'Shaughnessy
May 13
Krishna Rao - Anthropic's CFO on Compute, Scaling to $30B ARR, and the Returns to Frontier Intelligence - [Invest Like the Best, EP.471]
On Purpose with Jay Shetty
May 13
Jay's Must-Listens: The #1 Way to Feel Stronger, Healthier & More Energized (Follow THIS Simple Weekly Workout Plan) ft. Senada Greca & Dr. Andy Galpin
This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.
You're clearly into Machine Learning Street Talk.
Every Monday, we deliver AI summaries of the latest episodes from Machine Learning Street Talk and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime