#335 Sriram Raghavan: Why IBM Is Betting Everything on Small AI Models
Episode
60 min
Read time
3 min
Topics
Artificial Intelligence
AI-Generated Summary
Key Takeaways
- ✓Direct RL vs. Distillation: Training small models directly with reinforcement learning, rather than distilling from larger models, preserves both broad benchmark performance and safety alignment. Distillation improves targeted tasks like code and math but measurably degrades a model's general capabilities and strips away safety alignment — a critical liability for enterprise deployments where compliance and reliability are non-negotiable requirements.
- ✓Inference-Time Scaling Economics: IBM's Granite 3.3 (8B parameters) matches GPT-4o and Claude 3.5 on code and math benchmarks using inference-time scaling techniques including particle filtering and majority voting. The economic logic: a small model scaled up on-demand for complex tasks costs far less than permanently hosting a large model for all tasks, nine of which may require no heavy compute.
- ✓Model Parameters as a Misleading Metric: Parameter count is becoming an unreliable way to compare models. Mixture-of-experts architectures and hybrid state-space/attention models (like IBM's Bamba 2, developed with CMU and Princeton) mean a 50B-parameter model can require less memory than a 34B model. Memory footprint per user session at a given context length is a more actionable metric for hardware and deployment decisions.
- ✓Generative Computing via LoRA Adapters: IBM's "generative computing" framework wraps Granite models in a runtime that activates or deactivates specialized LoRA adapters — for hallucination detection, query rewriting, and uncertainty quantification — without touching base model weights. "Activated LoRAs" share the base model's KV cache, delivering 2–3x efficiency gains over standard LoRA invocations and enabling modular, programmable AI application architectures.
- ✓Data Quality Over Data Volume: Granite 3.3 was measurably ahead of Granite 3.0 at the halfway point of training, solely due to data quality improvements. The shift in the field has moved from maximizing model size and raw token count to curating high-quality data at every training stage — pre-training, mid-training, instruction tuning, and RL data — with IBM releasing its cleaning pipeline as an open-source project called Data Prep Kit.
What It Covers
IBM Research VP Sriram Raghavan explains why IBM trains its Granite models — currently 2B and 8B parameters — directly using reinforcement learning rather than distilling from larger models, and how combining direct RL training with inference-time scaling allows small models to match GPT-4o and Claude 3.5 on code and math benchmarks at a fraction of the cost.
Key Questions Answered
- •Direct RL vs. Distillation: Training small models directly with reinforcement learning, rather than distilling from larger models, preserves both broad benchmark performance and safety alignment. Distillation improves targeted tasks like code and math but measurably degrades a model's general capabilities and strips away safety alignment — a critical liability for enterprise deployments where compliance and reliability are non-negotiable requirements.
- •Inference-Time Scaling Economics: IBM's Granite 3.3 (8B parameters) matches GPT-4o and Claude 3.5 on code and math benchmarks using inference-time scaling techniques including particle filtering and majority voting. The economic logic: a small model scaled up on-demand for complex tasks costs far less than permanently hosting a large model for all tasks, nine of which may require no heavy compute.
- •Model Parameters as a Misleading Metric: Parameter count is becoming an unreliable way to compare models. Mixture-of-experts architectures and hybrid state-space/attention models (like IBM's Bamba 2, developed with CMU and Princeton) mean a 50B-parameter model can require less memory than a 34B model. Memory footprint per user session at a given context length is a more actionable metric for hardware and deployment decisions.
- •Generative Computing via LoRA Adapters: IBM's "generative computing" framework wraps Granite models in a runtime that activates or deactivates specialized LoRA adapters — for hallucination detection, query rewriting, and uncertainty quantification — without touching base model weights. "Activated LoRAs" share the base model's KV cache, delivering 2–3x efficiency gains over standard LoRA invocations and enabling modular, programmable AI application architectures.
- •Data Quality Over Data Volume: Granite 3.3 was measurably ahead of Granite 3.0 at the halfway point of training, solely due to data quality improvements. The shift in the field has moved from maximizing model size and raw token count to curating high-quality data at every training stage — pre-training, mid-training, instruction tuning, and RL data — with IBM releasing its cleaning pipeline as an open-source project called Data Prep Kit.
- •Multi-Agent Architecture for Enterprise AI: IBM structures enterprise AI as collections of fit-for-purpose small models and agents rather than one large generalist system. Granite 3.3 ships alongside a 2B document-processing model, a speech model, and a vision model. Agents are being built specifically for COBOL/Java software development, IT automation (via Instana, Turbonomic, Apptio), and physical asset management via Maximo — with agent observability, security delegation, and lifecycle governance as active research priorities.
Notable Moment
Raghavan argues that the massive English-language prompts used in today's agent systems are essentially poorly written software programs. Decomposing those prompts into actual programming abstractions — declared instructions, explicit requirements, external verification functions — allows a small model to match a large model's output while eliminating the need to train the model on every formatting rule.
You just read a 3-minute summary of a 57-minute episode.
Get Eye on AI summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from Eye on AI
Why the Future of AI Isn't Just Bigger Models. It's Models That Evolve | Risto Miikkulainen of Cognizant
Jun 2 · 64 min
The Biotech Startups Podcast
🧬 AI Psychosis, Coordination Tax & the Limits of LLMs | Alex Telford (2/4)
Jun 4
More from Eye on AI
How AI Is Reinventing Elder Care | Chia-Lin Simmons of LogicMark
Jun 1 · 53 min
The Intelligence (Economist)
A murder exploited: Britain’s George Floyd moment that wasn’t
Jun 4
More from Eye on AI
We summarize every new episode. Want them in your inbox?
Why the Future of AI Isn't Just Bigger Models. It's Models That Evolve | Risto Miikkulainen of Cognizant
How AI Is Reinventing Elder Care | Chia-Lin Simmons of LogicMark
The App of the Future Is Voice — Not a Screen. Mitel's CTO Luiz Domingos Explains Why.
Is ChatGPT Conscious? A Pioneer of AI Explains | Dr. Terry Sejnowski
Your Child's Data Profile Starts Before They're Born | Eamonn Maguire of Proton
Similar Episodes
Related episodes from other podcasts
The Biotech Startups Podcast
Jun 4
🧬 AI Psychosis, Coordination Tax & the Limits of LLMs | Alex Telford (2/4)
The Intelligence (Economist)
Jun 4
A murder exploited: Britain’s George Floyd moment that wasn’t
a16z Podcast
Jun 4
AI Eats the World? A Reality Check with Benedict Evans
Rational Reminder
Jun 4
Ben Carlson: Investing at All-Time Highs | #412
Practical AI
Jun 4
Breaking down the 2026 Stanford AI Index Report
Explore Related Topics
This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.
Read this week's AI & Machine Learning Podcast Insights — cross-podcast analysis updated weekly.
You're clearly into Eye on AI.
Every Monday, we deliver AI summaries of the latest episodes from Eye on AI and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime