The Evolution of Reasoning in Small Language Models with Yejin Choi - #761
Episode
66 min
Read time
3 min
Topics
Productivity, Fundraising & VC, Design & UX
AI-Generated Summary
Key Takeaways
- ✓Prismatic Synthesis for Data Diversity: Generate synthetic math problems using a 32B parameter teacher model, compute gradient vectors with a 1.5B proxy model, apply k-means clustering to identify overrepresented patterns, then aggressively filter redundant examples. This iterative over-generation and filtration process produces 1 million diverse training examples that outperform datasets from 20x larger models by maximizing qualitative differences in the training distribution.
- ✓Mode Collapse After Post-Training: LLMs exhibit striking homogeneity across models after supervised fine-tuning and reinforcement learning, even for open-ended questions. When asked to generate random numbers or creative content with higher temperature settings, models like Llama, ChatGPT, and DeepSeek R1 produce nearly identical outputs, sometimes verbatim. Pretrained models show better diversity, but post-training optimization toward preferred answers creates intra-modal and inter-modal convergence that reduces output variety.
- ✓Reinforcement Learning as Pretraining Objective: During pretraining, encourage models to generate intermediate thoughts before predicting next tokens, rewarding predictions only when the thought increases conditional probability compared to predictions without thought. This information-gain approach requires more compute per token but produces models that perform better on reasoning benchmarks after standard post-training, similar to how humans benefit from learning logical thinking early in development.
- ✓Making Out-of-Distribution In-Distribution: Current AI systems only perform well on data similar to training examples, requiring comprehensive coverage of edge cases and corner scenarios. Post-training through supervised fine-tuning addresses some gaps, while reinforcement learning forces models to explore unexplored regions. This differs fundamentally from human learning efficiency, where people handle novel situations without extensive prior examples, representing a core limitation of the data-dependent paradigm.
- ✓Three Types of Pluralistic Alignment: Implement overtone pluralism by presenting multiple reasonable viewpoints for politically thorny questions rather than selecting majority opinion. Apply distributional pluralism to ensure AI decision distributions match human decision distributions across populations, avoiding super-skewed outcomes. Enable steerable pluralism so users can adjust models to different value systems within socially acceptable bounds, respecting cultural and religious diversity without enforcing artificial neutrality.
What It Covers
Yejin Choi, professor at Stanford HAI, explores democratizing AI through small language models that match larger counterparts. She details synthetic data generation techniques, reinforcement learning during pretraining, and pluralistic alignment approaches. The conversation examines mode collapse in LLMs, the artificial hive mind phenomenon, and how academic research can make powerful AI accessible beyond resource-rich tech companies.
Key Questions Answered
- •Prismatic Synthesis for Data Diversity: Generate synthetic math problems using a 32B parameter teacher model, compute gradient vectors with a 1.5B proxy model, apply k-means clustering to identify overrepresented patterns, then aggressively filter redundant examples. This iterative over-generation and filtration process produces 1 million diverse training examples that outperform datasets from 20x larger models by maximizing qualitative differences in the training distribution.
- •Mode Collapse After Post-Training: LLMs exhibit striking homogeneity across models after supervised fine-tuning and reinforcement learning, even for open-ended questions. When asked to generate random numbers or creative content with higher temperature settings, models like Llama, ChatGPT, and DeepSeek R1 produce nearly identical outputs, sometimes verbatim. Pretrained models show better diversity, but post-training optimization toward preferred answers creates intra-modal and inter-modal convergence that reduces output variety.
- •Reinforcement Learning as Pretraining Objective: During pretraining, encourage models to generate intermediate thoughts before predicting next tokens, rewarding predictions only when the thought increases conditional probability compared to predictions without thought. This information-gain approach requires more compute per token but produces models that perform better on reasoning benchmarks after standard post-training, similar to how humans benefit from learning logical thinking early in development.
- •Making Out-of-Distribution In-Distribution: Current AI systems only perform well on data similar to training examples, requiring comprehensive coverage of edge cases and corner scenarios. Post-training through supervised fine-tuning addresses some gaps, while reinforcement learning forces models to explore unexplored regions. This differs fundamentally from human learning efficiency, where people handle novel situations without extensive prior examples, representing a core limitation of the data-dependent paradigm.
- •Three Types of Pluralistic Alignment: Implement overtone pluralism by presenting multiple reasonable viewpoints for politically thorny questions rather than selecting majority opinion. Apply distributional pluralism to ensure AI decision distributions match human decision distributions across populations, avoiding super-skewed outcomes. Enable steerable pluralism so users can adjust models to different value systems within socially acceptable bounds, respecting cultural and religious diversity without enforcing artificial neutrality.
- •Spectrum Tuning for Output Diversity: Post-training methods can teach models to retain diverse generation patterns instead of converging on single correct answers. By designing algorithms that preserve the spectrum of valid outputs and ensuring training data represents diverse perspectives, models avoid the homogenization problem where AI-generated content makes internet discourse less varied. This requires explicit algorithmic innovation beyond standard supervised fine-tuning and reinforcement learning approaches.
Notable Moment
Choi reveals that when DeepSeek R1 undergoes pure reinforcement learning optimization, the model spontaneously begins code-switching between Chinese, English, and other languages mid-solution while solving math problems. The RL process only rewards correct final answers without constraining the reasoning path, allowing bizarre behaviors to emerge and even get reinforced, demonstrating why distillation and imitation learning become necessary to maintain human-interpretable outputs.
You just read a 3-minute summary of a 63-minute episode.
Get The TWIML AI Podcast summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from The TWIML AI Podcast
Is RAG Dead? Lessons from Building AI for Tax Law with Alex Bowcut - #769
Jun 9 · 51 min
Eye on AI
#331 Sergey Levine: The Robot Revolution Nobody Is Talking About
Apr 12
More from The TWIML AI Podcast
Relational Foundation Models for Enterprise Data with Jure Leskovec - #768
May 21 · 66 min
This Week in Startups
Anthropic’s Mythos is a cyber-weapon, so you can’t have it | E2273
Apr 9
More from The TWIML AI Podcast
We summarize every new episode. Want them in your inbox?
Is RAG Dead? Lessons from Building AI for Tax Law with Alex Bowcut - #769
Relational Foundation Models for Enterprise Data with Jure Leskovec - #768
How to Find the Agent Failures Your Evals Miss with Scott Clark - #767
How to Engineer AI Inference Systems with Philip Kiely - #766
How Capital One Delivers Multi-Agent Systems with Rashmi Shetty - #765
Similar Episodes
Related episodes from other podcasts
Eye on AI
Apr 12
#331 Sergey Levine: The Robot Revolution Nobody Is Talking About
This Week in Startups
Apr 9
Anthropic’s Mythos is a cyber-weapon, so you can’t have it | E2273
Eye on AI
Feb 27
#324 Sharon Zhou: Inside AMD's Plan to Build Self-Improving AI
Cognitive Revolution
Feb 13
Approaching the AI Event Horizon? Part 1, w/ James Zou, Sam Hammond, Shoshannah Tekofsky, @8teAPi
The Mel Robbins Podcast
Nov 3
3 Questions to Ask Yourself to Figure Out What You Really Want
Explore Related Topics
This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.
You're clearly into The TWIML AI Podcast.
Every Monday, we deliver AI summaries of the latest episodes from The TWIML AI Podcast and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime