#490 – State of AI in 2026: LLMs, Coding, Scaling Laws, China, Agents, GPUs, AGI
Read time
2 min
Topics
Startups, Artificial Intelligence, Software Development
AI-Generated Summary
Key Takeaways
- ✓Chinese Open Model Strategy: DeepSeek trained their model for approximately $5 million at cloud rates, while Olmo 3 spent around $2 million for cluster rental including engineering issues. Chinese companies release open weight models primarily to gain international distribution where users won't pay for API subscriptions to Chinese services due to security concerns, creating influence through free access rather than direct revenue.
- ✓Pretraining Cost Economics: Training costs represent a small fraction compared to serving costs for hundreds of millions of users. A thousand GPU rental costs roughly $100 daily, while frontier labs operate millions of GPUs. Companies now optimize for smaller, more efficient models because recurring serving costs reach billions of dollars, making model size reduction more valuable than raw capability gains through larger pretraining runs.
- ✓Reinforcement Learning Scaling: Post-training through reinforcement learning with verifiable rewards unlocked major capability gains in 2025, enabling tool use, multi-step reasoning, and better code generation. AI2's November model used five days of RL training, then ran another 3.5 weeks in December for notable improvements, demonstrating that RL scaling provides more cost-effective intelligence gains than expanding pretraining compute at current model sizes.
- ✓Data Quality Over Quantity: Olmo 3 achieved better performance with less training data than predecessors by focusing on data quality and mixing ratios. Labs train classifiers on samples from different sources like GitHub, Stack Exchange, and Wikipedia, then use linear regression to determine optimal dataset composition based on target evaluations. Synthetic data includes OCR extraction from PDFs yielding trillions of tokens, not just AI-generated content.
- ✓Architecture Convergence: Modern frontier models remain fundamentally similar to GPT-2 architecture with incremental tweaks like mixture of experts, multi-head latent attention, and group query attention. The differentiation comes from systems optimization including FP8 and FP4 training, distributed compute management across 10,000-100,000 GPUs, and post-training algorithms rather than novel architectural paradigms. Converting between model architectures requires only adding specific components to the base transformer.
What It Covers
Sebastian Raschka and Nathan Lambert analyze the 2025 AI landscape following DeepSeek's breakthrough, comparing Chinese and US model development, examining scaling laws across pretraining and inference, discussing open versus closed models, and evaluating the technical architecture evolution from GPT-2 to current frontier models like Claude Opus 4.5 and GPT-5.
Key Questions Answered
- •Chinese Open Model Strategy: DeepSeek trained their model for approximately $5 million at cloud rates, while Olmo 3 spent around $2 million for cluster rental including engineering issues. Chinese companies release open weight models primarily to gain international distribution where users won't pay for API subscriptions to Chinese services due to security concerns, creating influence through free access rather than direct revenue.
- •Pretraining Cost Economics: Training costs represent a small fraction compared to serving costs for hundreds of millions of users. A thousand GPU rental costs roughly $100 daily, while frontier labs operate millions of GPUs. Companies now optimize for smaller, more efficient models because recurring serving costs reach billions of dollars, making model size reduction more valuable than raw capability gains through larger pretraining runs.
- •Reinforcement Learning Scaling: Post-training through reinforcement learning with verifiable rewards unlocked major capability gains in 2025, enabling tool use, multi-step reasoning, and better code generation. AI2's November model used five days of RL training, then ran another 3.5 weeks in December for notable improvements, demonstrating that RL scaling provides more cost-effective intelligence gains than expanding pretraining compute at current model sizes.
- •Data Quality Over Quantity: Olmo 3 achieved better performance with less training data than predecessors by focusing on data quality and mixing ratios. Labs train classifiers on samples from different sources like GitHub, Stack Exchange, and Wikipedia, then use linear regression to determine optimal dataset composition based on target evaluations. Synthetic data includes OCR extraction from PDFs yielding trillions of tokens, not just AI-generated content.
- •Architecture Convergence: Modern frontier models remain fundamentally similar to GPT-2 architecture with incremental tweaks like mixture of experts, multi-head latent attention, and group query attention. The differentiation comes from systems optimization including FP8 and FP4 training, distributed compute management across 10,000-100,000 GPUs, and post-training algorithms rather than novel architectural paradigms. Converting between model architectures requires only adding specific components to the base transformer.
Notable Moment
Nathan Lambert reveals he exclusively uses extended thinking modes across multiple models, running five simultaneous GPT-5 pro queries for different research tasks like finding papers or checking equations. He finds the non-thinking GPT-5 model has higher error rates and poor tone, refusing to use it despite speed advantages, demonstrating how power users prioritize marginal intelligence gains over convenience.
Get Lex Fridman Podcast summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from Lex Fridman Podcast
#495 – Vikings, Ragnar, Berserkers, Valhalla & the Warriors of the Viking Age
Apr 9 · 129 min
The Model Health Show
The Menopause Gut: Why Metabolism Changes & How to Reclaim Your Body - With Cynthia Thurlow
Apr 27
More from Lex Fridman Podcast
#494 – Jensen Huang: NVIDIA – The $4 Trillion Company & the AI Revolution
Mar 23
The Rest is History
664. Britain in the 70s: Scandal in Downing Street (Part 3)
Apr 26
More from Lex Fridman Podcast
We summarize every new episode. Want them in your inbox?
#495 – Vikings, Ragnar, Berserkers, Valhalla & the Warriors of the Viking Age
#494 – Jensen Huang: NVIDIA – The $4 Trillion Company & the AI Revolution
#493 – Jeff Kaplan: World of Warcraft, Overwatch, Blizzard, and Future of Gaming
#492 – Rick Beato: Greatest Guitarists of All Time, History & Future of Music
#491 – OpenClaw: The Viral AI Agent that Broke the Internet – Peter Steinberger
Similar Episodes
Related episodes from other podcasts
The Model Health Show
Apr 27
The Menopause Gut: Why Metabolism Changes & How to Reclaim Your Body - With Cynthia Thurlow
The Rest is History
Apr 26
664. Britain in the 70s: Scandal in Downing Street (Part 3)
The Learning Leader Show
Apr 26
685: David Epstein - The Freedom Trap, Narrative Values, General Magic, The Nobel Prize Winner Who Simplified Everything, Wearing the Same Thing Everyday, and Why Constraints Are the Secret to Your Best Work
The AI Breakdown
Apr 26
Where the Economy Thrives After AI
Cognitive Revolution
Apr 26
AI in the AM: 99% off search, GPT-5.5 is "clean", model welfare analysis, & efficient analog compute
Explore Related Topics
This podcast is featured in Best Tech Podcasts (2026) — ranked and reviewed with AI summaries.
Read this week's Startups & Product Podcast Insights — cross-podcast analysis updated weekly.
You're clearly into Lex Fridman Podcast.
Every Monday, we deliver AI summaries of the latest episodes from Lex Fridman Podcast and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime