#335 Sriram Raghavan: Why IBM Is Betting Everything on Small AI Models
Episode
60 min
Read time
3 min
Topics
Artificial Intelligence
AI-Generated Summary
Key Takeaways
- ✓Direct RL vs. Distillation: Training small models directly with reinforcement learning, rather than distilling from larger models, preserves both broad benchmark performance and safety alignment. Distillation improves targeted tasks like code and math but measurably degrades a model's general capabilities and strips away safety alignment — a critical liability for enterprise deployments where compliance and reliability are non-negotiable requirements.
- ✓Inference-Time Scaling Economics: IBM's Granite 3.3 (8B parameters) matches GPT-4o and Claude 3.5 on code and math benchmarks using inference-time scaling techniques including particle filtering and majority voting. The economic logic: a small model scaled up on-demand for complex tasks costs far less than permanently hosting a large model for all tasks, nine of which may require no heavy compute.
- ✓Model Parameters as a Misleading Metric: Parameter count is becoming an unreliable way to compare models. Mixture-of-experts architectures and hybrid state-space/attention models (like IBM's Bamba 2, developed with CMU and Princeton) mean a 50B-parameter model can require less memory than a 34B model. Memory footprint per user session at a given context length is a more actionable metric for hardware and deployment decisions.
- ✓Generative Computing via LoRA Adapters: IBM's "generative computing" framework wraps Granite models in a runtime that activates or deactivates specialized LoRA adapters — for hallucination detection, query rewriting, and uncertainty quantification — without touching base model weights. "Activated LoRAs" share the base model's KV cache, delivering 2–3x efficiency gains over standard LoRA invocations and enabling modular, programmable AI application architectures.
- ✓Data Quality Over Data Volume: Granite 3.3 was measurably ahead of Granite 3.0 at the halfway point of training, solely due to data quality improvements. The shift in the field has moved from maximizing model size and raw token count to curating high-quality data at every training stage — pre-training, mid-training, instruction tuning, and RL data — with IBM releasing its cleaning pipeline as an open-source project called Data Prep Kit.
What It Covers
IBM Research VP Sriram Raghavan explains why IBM trains its Granite models — currently 2B and 8B parameters — directly using reinforcement learning rather than distilling from larger models, and how combining direct RL training with inference-time scaling allows small models to match GPT-4o and Claude 3.5 on code and math benchmarks at a fraction of the cost.
Key Questions Answered
- •Direct RL vs. Distillation: Training small models directly with reinforcement learning, rather than distilling from larger models, preserves both broad benchmark performance and safety alignment. Distillation improves targeted tasks like code and math but measurably degrades a model's general capabilities and strips away safety alignment — a critical liability for enterprise deployments where compliance and reliability are non-negotiable requirements.
- •Inference-Time Scaling Economics: IBM's Granite 3.3 (8B parameters) matches GPT-4o and Claude 3.5 on code and math benchmarks using inference-time scaling techniques including particle filtering and majority voting. The economic logic: a small model scaled up on-demand for complex tasks costs far less than permanently hosting a large model for all tasks, nine of which may require no heavy compute.
- •Model Parameters as a Misleading Metric: Parameter count is becoming an unreliable way to compare models. Mixture-of-experts architectures and hybrid state-space/attention models (like IBM's Bamba 2, developed with CMU and Princeton) mean a 50B-parameter model can require less memory than a 34B model. Memory footprint per user session at a given context length is a more actionable metric for hardware and deployment decisions.
- •Generative Computing via LoRA Adapters: IBM's "generative computing" framework wraps Granite models in a runtime that activates or deactivates specialized LoRA adapters — for hallucination detection, query rewriting, and uncertainty quantification — without touching base model weights. "Activated LoRAs" share the base model's KV cache, delivering 2–3x efficiency gains over standard LoRA invocations and enabling modular, programmable AI application architectures.
- •Data Quality Over Data Volume: Granite 3.3 was measurably ahead of Granite 3.0 at the halfway point of training, solely due to data quality improvements. The shift in the field has moved from maximizing model size and raw token count to curating high-quality data at every training stage — pre-training, mid-training, instruction tuning, and RL data — with IBM releasing its cleaning pipeline as an open-source project called Data Prep Kit.
- •Multi-Agent Architecture for Enterprise AI: IBM structures enterprise AI as collections of fit-for-purpose small models and agents rather than one large generalist system. Granite 3.3 ships alongside a 2B document-processing model, a speech model, and a vision model. Agents are being built specifically for COBOL/Java software development, IT automation (via Instana, Turbonomic, Apptio), and physical asset management via Maximo — with agent observability, security delegation, and lifecycle governance as active research priorities.
Notable Moment
Raghavan argues that the massive English-language prompts used in today's agent systems are essentially poorly written software programs. Decomposing those prompts into actual programming abstractions — declared instructions, explicit requirements, external verification functions — allows a small model to match a large model's output while eliminating the need to train the model on every formatting rule.
You just read a 3-minute summary of a 57-minute episode.
Get Eye on AI summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from Eye on AI
#334 Abhishek Singh: The $1.2 Billion Plan to Turn India Into an AI Superpower
Apr 15 · 34 min
Stacking Benjamins
The Tax Triangle Most Investors Have Never Heard Of (SB1833)
Apr 20
More from Eye on AI
#333 Adi Kuruganti: Why Your AI Pilot Is Failing and What It Takes to Reach Production
Apr 15 · 58 min
The Productivity Show
The 5 Cognitive Biases Destroying Your Productivity (And How to Beat Them) (TPS609)
Apr 20
More from Eye on AI
We summarize every new episode. Want them in your inbox?
#334 Abhishek Singh: The $1.2 Billion Plan to Turn India Into an AI Superpower
#333 Adi Kuruganti: Why Your AI Pilot Is Failing and What It Takes to Reach Production
#332 Dan Faulkner: The Code Is Clean. The App Is Broken. Why AI Development Has an Integrity Problem
#331 Sergey Levine: The Robot Revolution Nobody Is Talking About
#330 Sebastian Risi: Why AI Should Be Grown, Not Trained
Similar Episodes
Related episodes from other podcasts
Stacking Benjamins
Apr 20
The Tax Triangle Most Investors Have Never Heard Of (SB1833)
The Productivity Show
Apr 20
The 5 Cognitive Biases Destroying Your Productivity (And How to Beat Them) (TPS609)
The Rest is History
Apr 19
662. Britain in the 70s: The Rise of Thatcher (Part 1)
The Learning Leader Show
Apr 19
684: Marcus Buckingham - Design Love In, The 5 Feelings Leaders Must Create, The ABCs of Authentic Leadership, and How to Unleash The Most Powerful Force in Business
The AI Breakdown
Apr 19
How the Best Companies Use AI
Explore Related Topics
This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.
Read this week's AI & Machine Learning Podcast Insights — cross-podcast analysis updated weekly.
You're clearly into Eye on AI.
Every Monday, we deliver AI summaries of the latest episodes from Eye on AI and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime