New top score on ARC-AGI-2-pub (29.4%) - Jeremy Berman

September 27, 2025

68 min episode · 3 min read

Jeremy Berman

Episode

68 min

Read time

3 min

AI-Generated Summary

Published Mar 29, 2026

Key Takeaways

✓Natural language over Python for ARC-AGI-2: Switching from Python programs to plain English descriptions of transformation rules dramatically improves performance on ARC-AGI-2 because every task can be described in five to ten bullet points. Python becomes brittle and verbose for compositional grid tasks, while natural language lets the model's inductive bias express itself fully, yielding higher accuracy at roughly $30 per task on v2 versus $8 on v1.
✓Breadth over depth for thinking models: On ARC-AGI-1, iterative revision loops were critical because models lacked internal reasoning. On ARC-AGI-2, RL-trained thinking models like Grok 4 perform deep revision internally, so the optimal strategy shifts toward maximizing entropy and breadth of initial generation rather than deep iterative refinement. Artificially increasing entropy in prompts consistently outperformed narrow, constrained prompting strategies.
✓Model selection is domain-specific and spiky: ARC leaderboard performance varies dramatically by model in ways other benchmarks do not. Grok 4 outperforms GPT-class models on ARC-AGI-2 grid reasoning, while Sonnet 3.5 remains superior for code generation tasks. Testing each model directly on the target domain rather than relying on general leaderboard rankings is necessary to identify the right tool for a specific problem type.
✓Reasoning as the meta-skill for AGI: Current LLMs acquire domain-specific reasoning circuits — math reasoning stays in math weights, science reasoning in science weights — with limited cross-domain transfer. The core AGI gap is not skill acquisition but the meta-skill of creating new skills. Aligning models purely toward general reasoning through RL, before layering domain knowledge, is the proposed path toward a foundation for general intelligence.
✓Knowledge trees versus knowledge webs: Pretraining treats all knowledge as an associative web of embeddings without guaranteed causal structure. Reinforcement learning with verifiable rewards functions as a pruning mechanism, replacing web-like associations with deductive trees where each node is causally consistent with its ancestors. The hypothesis is that models with weight configurations reflecting actual deductive structure will generalize to novel problems, while web-based models cannot.

What It Covers

Jeremy Berman, research scientist at Reflection AI, explains how he reached 29.4% on the ARC-AGI-2 public leaderboard using an evolutionary algorithm that generates and refines natural language descriptions of transformation rules rather than Python code, then discusses why reasoning is the meta-skill required for AGI and the fundamental gap between knowledge webs and deductive knowledge trees.

Key Questions Answered

•Natural language over Python for ARC-AGI-2: Switching from Python programs to plain English descriptions of transformation rules dramatically improves performance on ARC-AGI-2 because every task can be described in five to ten bullet points. Python becomes brittle and verbose for compositional grid tasks, while natural language lets the model's inductive bias express itself fully, yielding higher accuracy at roughly $30 per task on v2 versus $8 on v1.
•Breadth over depth for thinking models: On ARC-AGI-1, iterative revision loops were critical because models lacked internal reasoning. On ARC-AGI-2, RL-trained thinking models like Grok 4 perform deep revision internally, so the optimal strategy shifts toward maximizing entropy and breadth of initial generation rather than deep iterative refinement. Artificially increasing entropy in prompts consistently outperformed narrow, constrained prompting strategies.
•Model selection is domain-specific and spiky: ARC leaderboard performance varies dramatically by model in ways other benchmarks do not. Grok 4 outperforms GPT-class models on ARC-AGI-2 grid reasoning, while Sonnet 3.5 remains superior for code generation tasks. Testing each model directly on the target domain rather than relying on general leaderboard rankings is necessary to identify the right tool for a specific problem type.
•Reasoning as the meta-skill for AGI: Current LLMs acquire domain-specific reasoning circuits — math reasoning stays in math weights, science reasoning in science weights — with limited cross-domain transfer. The core AGI gap is not skill acquisition but the meta-skill of creating new skills. Aligning models purely toward general reasoning through RL, before layering domain knowledge, is the proposed path toward a foundation for general intelligence.
•Knowledge trees versus knowledge webs: Pretraining treats all knowledge as an associative web of embeddings without guaranteed causal structure. Reinforcement learning with verifiable rewards functions as a pruning mechanism, replacing web-like associations with deductive trees where each node is causally consistent with its ancestors. The hypothesis is that models with weight configurations reflecting actual deductive structure will generalize to novel problems, while web-based models cannot.
•Catastrophic forgetting blocks continual learning more than compute does: The fundamental barrier to adaptive, continuously learning AI is not computational cost but catastrophic forgetting — fine-tuning on new data drifts weights away from previously correct solutions. Proposed directions include freezing expert layers, composable model architectures analogous to Docker's immutable layers, and selective data mixtures during fine-tuning. Solving this problem is framed as the next S-curve after the current RL scaling wave.

Notable Moment

Berman argues that heavy pretraining may actively slow reasoning development rather than accelerate it. His analogy contrasts consultants who know terminology but cannot derive conclusions with Feynman-style thinkers who deduce everything from first principles — and frames RL post-training as the process of converting one into the other.

Know someone who'd find this useful?