#490 – State of AI in 2026: LLMs, Coding, Scaling Laws, China, Agents, GPUs, AGI
Read time
2 min
Topics
Startups, Artificial Intelligence, Software Development
AI-Generated Summary
Key Takeaways
- ✓Chinese Open Model Strategy: DeepSeek trained their model for approximately $5 million at cloud rates, while Olmo 3 spent around $2 million for cluster rental including engineering issues. Chinese companies release open weight models primarily to gain international distribution where users won't pay for API subscriptions to Chinese services due to security concerns, creating influence through free access rather than direct revenue.
- ✓Pretraining Cost Economics: Training costs represent a small fraction compared to serving costs for hundreds of millions of users. A thousand GPU rental costs roughly $100 daily, while frontier labs operate millions of GPUs. Companies now optimize for smaller, more efficient models because recurring serving costs reach billions of dollars, making model size reduction more valuable than raw capability gains through larger pretraining runs.
- ✓Reinforcement Learning Scaling: Post-training through reinforcement learning with verifiable rewards unlocked major capability gains in 2025, enabling tool use, multi-step reasoning, and better code generation. AI2's November model used five days of RL training, then ran another 3.5 weeks in December for notable improvements, demonstrating that RL scaling provides more cost-effective intelligence gains than expanding pretraining compute at current model sizes.
- ✓Data Quality Over Quantity: Olmo 3 achieved better performance with less training data than predecessors by focusing on data quality and mixing ratios. Labs train classifiers on samples from different sources like GitHub, Stack Exchange, and Wikipedia, then use linear regression to determine optimal dataset composition based on target evaluations. Synthetic data includes OCR extraction from PDFs yielding trillions of tokens, not just AI-generated content.
- ✓Architecture Convergence: Modern frontier models remain fundamentally similar to GPT-2 architecture with incremental tweaks like mixture of experts, multi-head latent attention, and group query attention. The differentiation comes from systems optimization including FP8 and FP4 training, distributed compute management across 10,000-100,000 GPUs, and post-training algorithms rather than novel architectural paradigms. Converting between model architectures requires only adding specific components to the base transformer.
What It Covers
Sebastian Raschka and Nathan Lambert analyze the 2025 AI landscape following DeepSeek's breakthrough, comparing Chinese and US model development, examining scaling laws across pretraining and inference, discussing open versus closed models, and evaluating the technical architecture evolution from GPT-2 to current frontier models like Claude Opus 4.5 and GPT-5.
Key Questions Answered
- •Chinese Open Model Strategy: DeepSeek trained their model for approximately $5 million at cloud rates, while Olmo 3 spent around $2 million for cluster rental including engineering issues. Chinese companies release open weight models primarily to gain international distribution where users won't pay for API subscriptions to Chinese services due to security concerns, creating influence through free access rather than direct revenue.
- •Pretraining Cost Economics: Training costs represent a small fraction compared to serving costs for hundreds of millions of users. A thousand GPU rental costs roughly $100 daily, while frontier labs operate millions of GPUs. Companies now optimize for smaller, more efficient models because recurring serving costs reach billions of dollars, making model size reduction more valuable than raw capability gains through larger pretraining runs.
- •Reinforcement Learning Scaling: Post-training through reinforcement learning with verifiable rewards unlocked major capability gains in 2025, enabling tool use, multi-step reasoning, and better code generation. AI2's November model used five days of RL training, then ran another 3.5 weeks in December for notable improvements, demonstrating that RL scaling provides more cost-effective intelligence gains than expanding pretraining compute at current model sizes.
- •Data Quality Over Quantity: Olmo 3 achieved better performance with less training data than predecessors by focusing on data quality and mixing ratios. Labs train classifiers on samples from different sources like GitHub, Stack Exchange, and Wikipedia, then use linear regression to determine optimal dataset composition based on target evaluations. Synthetic data includes OCR extraction from PDFs yielding trillions of tokens, not just AI-generated content.
- •Architecture Convergence: Modern frontier models remain fundamentally similar to GPT-2 architecture with incremental tweaks like mixture of experts, multi-head latent attention, and group query attention. The differentiation comes from systems optimization including FP8 and FP4 training, distributed compute management across 10,000-100,000 GPUs, and post-training algorithms rather than novel architectural paradigms. Converting between model architectures requires only adding specific components to the base transformer.
Notable Moment
Nathan Lambert reveals he exclusively uses extended thinking modes across multiple models, running five simultaneous GPT-5 pro queries for different research tasks like finding papers or checking equations. He finds the non-thinking GPT-5 model has higher error rates and poor tone, refusing to use it despite speed advantages, demonstrating how power users prioritize marginal intelligence gains over convenience.
Get Lex Fridman Podcast summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from Lex Fridman Podcast
#495 – Vikings, Ragnar, Berserkers, Valhalla & the Warriors of the Viking Age
Apr 9 · 129 min
Odd Lots
Presenting Foundering Season 6: The Killing of Bob Lee, Part 1
Apr 26
More from Lex Fridman Podcast
#494 – Jensen Huang: NVIDIA – The $4 Trillion Company & the AI Revolution
Mar 23
Masters of Scale
Possible: Netflix co-founder Reed Hastings: stories, schools, superpowers
Apr 25
More from Lex Fridman Podcast
We summarize every new episode. Want them in your inbox?
#495 – Vikings, Ragnar, Berserkers, Valhalla & the Warriors of the Viking Age
#494 – Jensen Huang: NVIDIA – The $4 Trillion Company & the AI Revolution
#493 – Jeff Kaplan: World of Warcraft, Overwatch, Blizzard, and Future of Gaming
#492 – Rick Beato: Greatest Guitarists of All Time, History & Future of Music
#491 – OpenClaw: The Viral AI Agent that Broke the Internet – Peter Steinberger
Similar Episodes
Related episodes from other podcasts
Odd Lots
Apr 26
Presenting Foundering Season 6: The Killing of Bob Lee, Part 1
Masters of Scale
Apr 25
Possible: Netflix co-founder Reed Hastings: stories, schools, superpowers
The Futur
Apr 25
Why Process is Better Than AI w/ Scott Clum | Ep 430
20VC (20 Minute VC)
Apr 25
20Product: Replit CEO on Why Coding Models Are Plateauing | Why the SaaS Apocalypse is Justified: Will Incumbents Be Replaced? | Why IDEs Are Dead and Do PMs Survive the Next 3-5 Years with Amjad Masad
This Week in Startups
Apr 25
The Defense Tech Startup YC Kicked Out of a Meeting is Now Arming America | E2280
Explore Related Topics
This podcast is featured in Best Tech Podcasts (2026) — ranked and reviewed with AI summaries.
Read this week's Startups & Product Podcast Insights — cross-podcast analysis updated weekly.
You're clearly into Lex Fridman Podcast.
Every Monday, we deliver AI summaries of the latest episodes from Lex Fridman Podcast and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime