What's Missing Between LLMs and AGI - Vishal Misra & Martin Casado

March 17, 2026

47 min episode · 2 min read

Vishal Misra,Martin Casado

Episode

47 min

Read time

2 min

AI-Generated Summary

Published Mar 17, 2026

Key Takeaways

✓Bayesian Wind Tunnel methodology: To prove LLMs perform true Bayesian inference rather than superficial pattern matching, Misra's team created controlled experiments using blank architectures trained on tasks mathematically impossible to memorize. Transformers matched the analytically calculated Bayesian posterior to 10⁻³ bit accuracy. Mamba performed nearly as well; LSTMs partially; MLPs failed entirely. Architecture, not training data, determines this capability.
✓The Frozen Weights Problem: LLMs perform Bayesian updating within a conversation but reset completely when a new session begins — weights are frozen post-training. Human brains maintain synaptic plasticity throughout life, continuously updating from experience. Continual learning research must solve catastrophic forgetting: updating weights on new information without erasing previously learned knowledge before plasticity becomes viable.
✓Shannon Entropy vs. Kolmogorov Complexity: LLMs operate in the Shannon entropy domain — learning correlations across all available data. Human reasoning operates closer to Kolmogorov complexity — finding the shortest causal program that explains observations. Einstein's field equation (Gμν = 8πTμν) is a minimal representation explaining Mercury's orbit, gravitational lensing, and GPS simultaneously. LLMs cannot generate equivalent new representations.
✓The Einstein AGI Test: A concrete benchmark for AGI: train an LLM exclusively on pre-1911 physics data and determine whether it independently derives the theory of relativity. Current models would fail because they are bound to existing data manifolds and cannot construct new causal representations that reconcile anomalous observations like Michelson-Morley experiment results with Newtonian mechanics.
✓Causation vs. Correlation as the Core Gap: Deep learning performs association — the first tier of Judea Pearl's causal hierarchy. It does not perform intervention or counterfactual reasoning, which require internal simulation models. When a person dodges a thrown object, the brain runs a causal simulation, not a probability calculation. Building architectures capable of causal modeling, not scaling existing ones, is the necessary research direction.

What It Covers

Columbia University professor Vishal Misra presents mathematical proof that transformers perform precise Bayesian inference, matching theoretically correct posteriors to 10⁻³ bit accuracy. He argues two unsolved problems — continual learning plasticity and moving from correlation to causation — separate current LLMs from genuine artificial general intelligence.

Key Questions Answered

•Bayesian Wind Tunnel methodology: To prove LLMs perform true Bayesian inference rather than superficial pattern matching, Misra's team created controlled experiments using blank architectures trained on tasks mathematically impossible to memorize. Transformers matched the analytically calculated Bayesian posterior to 10⁻³ bit accuracy. Mamba performed nearly as well; LSTMs partially; MLPs failed entirely. Architecture, not training data, determines this capability.
•The Frozen Weights Problem: LLMs perform Bayesian updating within a conversation but reset completely when a new session begins — weights are frozen post-training. Human brains maintain synaptic plasticity throughout life, continuously updating from experience. Continual learning research must solve catastrophic forgetting: updating weights on new information without erasing previously learned knowledge before plasticity becomes viable.
•Shannon Entropy vs. Kolmogorov Complexity: LLMs operate in the Shannon entropy domain — learning correlations across all available data. Human reasoning operates closer to Kolmogorov complexity — finding the shortest causal program that explains observations. Einstein's field equation (Gμν = 8πTμν) is a minimal representation explaining Mercury's orbit, gravitational lensing, and GPS simultaneously. LLMs cannot generate equivalent new representations.
•The Einstein AGI Test: A concrete benchmark for AGI: train an LLM exclusively on pre-1911 physics data and determine whether it independently derives the theory of relativity. Current models would fail because they are bound to existing data manifolds and cannot construct new causal representations that reconcile anomalous observations like Michelson-Morley experiment results with Newtonian mechanics.
•Causation vs. Correlation as the Core Gap: Deep learning performs association — the first tier of Judea Pearl's causal hierarchy. It does not perform intervention or counterfactual reasoning, which require internal simulation models. When a person dodges a thrown object, the brain runs a causal simulation, not a probability calculation. Building architectures capable of causal modeling, not scaling existing ones, is the necessary research direction.

Notable Moment

Misra describes Donald Knuth's viral Hamiltonian cycle result as validation of LLM limits rather than evidence of emerging generality — the models exhausted their search space and stalled, while Knuth himself constructed the novel mathematical proof, demonstrating that humans still supply the causal reasoning layer.

Know someone who'd find this useful?