New top score on ARC-AGI-2-pub (29.4%) - Jeremy Berman
Episode
68 min
Read time
3 min
Topics
Startups, Fundraising & VC, Artificial Intelligence
AI-Generated Summary
Key Takeaways
- ✓Natural language over Python for ARC-AGI-2: Switching from Python programs to plain English descriptions of transformation rules dramatically improves performance on ARC-AGI-2 because every task can be described in five to ten bullet points. Python becomes brittle and verbose for compositional grid tasks, while natural language lets the model's inductive bias express itself fully, yielding higher accuracy at roughly $30 per task on v2 versus $8 on v1.
- ✓Breadth over depth for thinking models: On ARC-AGI-1, iterative revision loops were critical because models lacked internal reasoning. On ARC-AGI-2, RL-trained thinking models like Grok 4 perform deep revision internally, so the optimal strategy shifts toward maximizing entropy and breadth of initial generation rather than deep iterative refinement. Artificially increasing entropy in prompts consistently outperformed narrow, constrained prompting strategies.
- ✓Model selection is domain-specific and spiky: ARC leaderboard performance varies dramatically by model in ways other benchmarks do not. Grok 4 outperforms GPT-class models on ARC-AGI-2 grid reasoning, while Sonnet 3.5 remains superior for code generation tasks. Testing each model directly on the target domain rather than relying on general leaderboard rankings is necessary to identify the right tool for a specific problem type.
- ✓Reasoning as the meta-skill for AGI: Current LLMs acquire domain-specific reasoning circuits — math reasoning stays in math weights, science reasoning in science weights — with limited cross-domain transfer. The core AGI gap is not skill acquisition but the meta-skill of creating new skills. Aligning models purely toward general reasoning through RL, before layering domain knowledge, is the proposed path toward a foundation for general intelligence.
- ✓Knowledge trees versus knowledge webs: Pretraining treats all knowledge as an associative web of embeddings without guaranteed causal structure. Reinforcement learning with verifiable rewards functions as a pruning mechanism, replacing web-like associations with deductive trees where each node is causally consistent with its ancestors. The hypothesis is that models with weight configurations reflecting actual deductive structure will generalize to novel problems, while web-based models cannot.
What It Covers
Jeremy Berman, research scientist at Reflection AI, explains how he reached 29.4% on the ARC-AGI-2 public leaderboard using an evolutionary algorithm that generates and refines natural language descriptions of transformation rules rather than Python code, then discusses why reasoning is the meta-skill required for AGI and the fundamental gap between knowledge webs and deductive knowledge trees.
Key Questions Answered
- •Natural language over Python for ARC-AGI-2: Switching from Python programs to plain English descriptions of transformation rules dramatically improves performance on ARC-AGI-2 because every task can be described in five to ten bullet points. Python becomes brittle and verbose for compositional grid tasks, while natural language lets the model's inductive bias express itself fully, yielding higher accuracy at roughly $30 per task on v2 versus $8 on v1.
- •Breadth over depth for thinking models: On ARC-AGI-1, iterative revision loops were critical because models lacked internal reasoning. On ARC-AGI-2, RL-trained thinking models like Grok 4 perform deep revision internally, so the optimal strategy shifts toward maximizing entropy and breadth of initial generation rather than deep iterative refinement. Artificially increasing entropy in prompts consistently outperformed narrow, constrained prompting strategies.
- •Model selection is domain-specific and spiky: ARC leaderboard performance varies dramatically by model in ways other benchmarks do not. Grok 4 outperforms GPT-class models on ARC-AGI-2 grid reasoning, while Sonnet 3.5 remains superior for code generation tasks. Testing each model directly on the target domain rather than relying on general leaderboard rankings is necessary to identify the right tool for a specific problem type.
- •Reasoning as the meta-skill for AGI: Current LLMs acquire domain-specific reasoning circuits — math reasoning stays in math weights, science reasoning in science weights — with limited cross-domain transfer. The core AGI gap is not skill acquisition but the meta-skill of creating new skills. Aligning models purely toward general reasoning through RL, before layering domain knowledge, is the proposed path toward a foundation for general intelligence.
- •Knowledge trees versus knowledge webs: Pretraining treats all knowledge as an associative web of embeddings without guaranteed causal structure. Reinforcement learning with verifiable rewards functions as a pruning mechanism, replacing web-like associations with deductive trees where each node is causally consistent with its ancestors. The hypothesis is that models with weight configurations reflecting actual deductive structure will generalize to novel problems, while web-based models cannot.
- •Catastrophic forgetting blocks continual learning more than compute does: The fundamental barrier to adaptive, continuously learning AI is not computational cost but catastrophic forgetting — fine-tuning on new data drifts weights away from previously correct solutions. Proposed directions include freezing expert layers, composable model architectures analogous to Docker's immutable layers, and selective data mixtures during fine-tuning. Solving this problem is framed as the next S-curve after the current RL scaling wave.
Notable Moment
Berman argues that heavy pretraining may actively slow reasoning development rather than accelerate it. His analogy contrasts consultants who know terminology but cannot derive conclusions with Feynman-style thinkers who deduce everything from first principles — and frames RL post-training as the process of converting one into the other.
You just read a 3-minute summary of a 65-minute episode.
Get Machine Learning Street Talk summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from Machine Learning Street Talk
He won a Nobel here for AlphaFold. Then he left. - John Jumper
Jun 22 · 53 min
The Art of Manliness
The Power of a Purpose-Driven Life
Mar 3
More from Machine Learning Street Talk
When AI Decides You're a Threat — Brad Carson
May 31 · 80 min
Cognitive Revolution
Nested Learning: Ali Behrouz on the Quest for Continual Learning & Illusion of AI Architectures
Jun 3
Books, tools, and gear mentioned in this episode
SignalCast may earn commission on purchases via these links.
Tools
- Grok-4Recommended
by xAI
“RL-trained thinking models like Grok 4 perform deep revision internally, so the optimal strategy shifts toward maximizing entropy and breadth of initial generation rather than deep iterative refinement.”
by OpenAI
“Grok 4 outperforms GPT-class models on ARC-AGI-2 grid reasoning, while Sonnet 3.5 remains superior for code generation tasks.”
by Anthropic
“Grok 4 outperforms GPT-class models on ARC-AGI-2 grid reasoning, while Sonnet 3.5 remains superior for code generation tasks.”
company
“Jeremy Berman, research scientist at Reflection AI, explains how he reached 29.4% on the ARC-AGI-2 public leaderboard.”
other
“Jeremy Berman, research scientist at Reflection AI, explains how he reached 29.4% on the ARC-AGI-2 public leaderboard using an evolutionary algorithm that generates and refines natural language descriptions of transformation rules.”
More from Machine Learning Street Talk
We summarize every new episode. Want them in your inbox?
He won a Nobel here for AlphaFold. Then he left. - John Jumper
When AI Decides You're a Threat — Brad Carson
Intelligence is collective, not artificial — Prof. Michael I. Jordan (UC Berkeley / Inria)
The AI Models Smart Enough to Know They're Cheating — Beth Barnes & David Rein [METR]
When AI Discovers The Next Transformer - Robert Lange (Sakana)
Similar Episodes
Related episodes from other podcasts
The Art of Manliness
Mar 3
The Power of a Purpose-Driven Life
Cognitive Revolution
Jun 3
Nested Learning: Ali Behrouz on the Quest for Continual Learning & Illusion of AI Architectures
Practical AI
May 21
Hermes Agent: Agents that grow with you
Huberman Lab
Apr 16
Essentials: Understand & Improve Memory Using Science-Based Tools
Huberman Lab
Mar 30
How Hormones Shape Sexual Orientation & Behavior | Dr. Marc Breedlove
Explore Related Topics
This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.
Read this week's Startups & Product Podcast Insights — cross-podcast analysis updated weekly.
You're clearly into Machine Learning Street Talk.
Every Monday, we deliver AI summaries of the latest episodes from Machine Learning Street Talk and 192+ other podcasts. Free for one show.
Start My Monday DigestNo credit card · Unsubscribe anytime