#335 Sriram Raghavan: Why IBM Is Betting Everything on Small AI Models

April 19, 2026

60 min episode · 3 min read

Sriram Raghavan

Episode

60 min

Read time

3 min

Topics

Artificial Intelligence

AI-Generated Summary

Published Apr 20, 2026

Key Takeaways

✓Direct RL vs. Distillation: Training small models directly with reinforcement learning, rather than distilling from larger models, preserves both broad benchmark performance and safety alignment. Distillation improves targeted tasks like code and math but measurably degrades a model's general capabilities and strips away safety alignment — a critical liability for enterprise deployments where compliance and reliability are non-negotiable requirements.
✓Inference-Time Scaling Economics: IBM's Granite 3.3 (8B parameters) matches GPT-4o and Claude 3.5 on code and math benchmarks using inference-time scaling techniques including particle filtering and majority voting. The economic logic: a small model scaled up on-demand for complex tasks costs far less than permanently hosting a large model for all tasks, nine of which may require no heavy compute.
✓Model Parameters as a Misleading Metric: Parameter count is becoming an unreliable way to compare models. Mixture-of-experts architectures and hybrid state-space/attention models (like IBM's Bamba 2, developed with CMU and Princeton) mean a 50B-parameter model can require less memory than a 34B model. Memory footprint per user session at a given context length is a more actionable metric for hardware and deployment decisions.
✓Generative Computing via LoRA Adapters: IBM's "generative computing" framework wraps Granite models in a runtime that activates or deactivates specialized LoRA adapters — for hallucination detection, query rewriting, and uncertainty quantification — without touching base model weights. "Activated LoRAs" share the base model's KV cache, delivering 2–3x efficiency gains over standard LoRA invocations and enabling modular, programmable AI application architectures.
✓Data Quality Over Data Volume: Granite 3.3 was measurably ahead of Granite 3.0 at the halfway point of training, solely due to data quality improvements. The shift in the field has moved from maximizing model size and raw token count to curating high-quality data at every training stage — pre-training, mid-training, instruction tuning, and RL data — with IBM releasing its cleaning pipeline as an open-source project called Data Prep Kit.

What It Covers

IBM Research VP Sriram Raghavan explains why IBM trains its Granite models — currently 2B and 8B parameters — directly using reinforcement learning rather than distilling from larger models, and how combining direct RL training with inference-time scaling allows small models to match GPT-4o and Claude 3.5 on code and math benchmarks at a fraction of the cost.

Key Questions Answered

•Direct RL vs. Distillation: Training small models directly with reinforcement learning, rather than distilling from larger models, preserves both broad benchmark performance and safety alignment. Distillation improves targeted tasks like code and math but measurably degrades a model's general capabilities and strips away safety alignment — a critical liability for enterprise deployments where compliance and reliability are non-negotiable requirements.
•Inference-Time Scaling Economics: IBM's Granite 3.3 (8B parameters) matches GPT-4o and Claude 3.5 on code and math benchmarks using inference-time scaling techniques including particle filtering and majority voting. The economic logic: a small model scaled up on-demand for complex tasks costs far less than permanently hosting a large model for all tasks, nine of which may require no heavy compute.
•Model Parameters as a Misleading Metric: Parameter count is becoming an unreliable way to compare models. Mixture-of-experts architectures and hybrid state-space/attention models (like IBM's Bamba 2, developed with CMU and Princeton) mean a 50B-parameter model can require less memory than a 34B model. Memory footprint per user session at a given context length is a more actionable metric for hardware and deployment decisions.
•Generative Computing via LoRA Adapters: IBM's "generative computing" framework wraps Granite models in a runtime that activates or deactivates specialized LoRA adapters — for hallucination detection, query rewriting, and uncertainty quantification — without touching base model weights. "Activated LoRAs" share the base model's KV cache, delivering 2–3x efficiency gains over standard LoRA invocations and enabling modular, programmable AI application architectures.
•Data Quality Over Data Volume: Granite 3.3 was measurably ahead of Granite 3.0 at the halfway point of training, solely due to data quality improvements. The shift in the field has moved from maximizing model size and raw token count to curating high-quality data at every training stage — pre-training, mid-training, instruction tuning, and RL data — with IBM releasing its cleaning pipeline as an open-source project called Data Prep Kit.
•Multi-Agent Architecture for Enterprise AI: IBM structures enterprise AI as collections of fit-for-purpose small models and agents rather than one large generalist system. Granite 3.3 ships alongside a 2B document-processing model, a speech model, and a vision model. Agents are being built specifically for COBOL/Java software development, IT automation (via Instana, Turbonomic, Apptio), and physical asset management via Maximo — with agent observability, security delegation, and lifecycle governance as active research priorities.

Notable Moment

Raghavan argues that the massive English-language prompts used in today's agent systems are essentially poorly written software programs. Decomposing those prompts into actual programming abstractions — declared instructions, explicit requirements, external verification functions — allows a small model to match a large model's output while eliminating the need to train the model on every formatting rule.

Know someone who'd find this useful?