Nathan Lambert

[LIVE] Anthropic Distillation & How Models Cheat (SWE-Bench Dead) | Nathan Lambert & Sebastian Raschka

Feb 26, 202652 min

AI Summary

→ WHAT IT COVERS Nathan Lambert, Sebastian Raschka, and Swyx analyze two converging AI stories: Anthropic's public accusation that Chinese labs — primarily MiniMax and DeepSeek — used distributed API accounts to extract training data, and OpenAI's formal deprecation of SWE-Bench Verified after discovering 59 unsolvable tasks and model memorization of benchmark solutions. → KEY INSIGHTS - **Distillation Detection Limits:** Anthropic identified distillation attempts by analyzing account patterns, request volume, and traffic shifts — MiniMax nearly halved its API traffic the moment Anthropic released a new model version. However, distinguishing distillation from legitimate large-scale evaluation or customer chatbot usage remains technically ambiguous, creating a gray zone that terms-of-service enforcement cannot cleanly resolve. - **Teacher-Student Model Mismatch:** The strongest model is not always the best distillation teacher. Open-weight models trained on Qwen outputs consistently outperform those trained on frontier API outputs, likely because token probability distributions must align between teacher and student. Labs should run ablations across multiple teacher models rather than defaulting to the highest-capability available model. - **SWE-Bench Verified Collapse:** OpenAI's audit of its own 500-task curated benchmark found 59 tasks were entirely unsolvable due to flawed test specifications — tasks that passed three rounds of human verification. Practitioners should treat any benchmark saturating above 80% across diverse model sizes as likely compromised, regardless of how many human verification rounds it underwent. - **Benchmark Memorization as Canary:** GPT-5's chain-of-thought reasoning on SWE-Bench tasks included knowledge of future Django API versions not available at the time the benchmark problems were written, revealing training data contamination. Benchmark designers should embed deliberately unsolvable "honeypot" tasks — problems with no valid solution — to detect memorization rather than genuine reasoning capability. - **SWE-Bench Pro Structural Fixes:** The replacement benchmark, SWE-Bench Pro, addresses three core flaws: it draws from more recent GitHub issues beyond the 2022–2023 window, maintains a private test set requiring answer submission rather than data download, and diversifies across more repositories and programming languages. Evaluators submitting to SWE-Bench Pro send only model outputs; Scale AI runs scoring server-side to prevent data leakage. → NOTABLE MOMENT OpenAI researchers prompted competing models with only a benchmark task ID — no problem statement — and the models reproduced the full problem description and solution verbatim, confirming that benchmark content had been absorbed wholesale into model weights during pretraining from public GitHub data. 💼 SPONSORS None detected 🏷️ LLM Distillation, SWE-Bench, Benchmark Contamination, AI Evaluation, Model Training Data

Read Full Summary Listen

#490 – State of AI in 2026: LLMs, Coding, Scaling Laws, China, Agents, GPUs, AGI

Lex Fridman Podcast

Feb 1, 2026Post Training Lead at Allen Institute for AI

AI Summary

→ WHAT IT COVERS Sebastian Raschka and Nathan Lambert analyze the 2025 AI landscape following DeepSeek's breakthrough, comparing Chinese and US model development, examining scaling laws across pretraining and inference, discussing open versus closed models, and evaluating the technical architecture evolution from GPT-2 to current frontier models like Claude Opus 4.5 and GPT-5. → KEY INSIGHTS - **Chinese Open Model Strategy:** DeepSeek trained their model for approximately $5 million at cloud rates, while Olmo 3 spent around $2 million for cluster rental including engineering issues. Chinese companies release open weight models primarily to gain international distribution where users won't pay for API subscriptions to Chinese services due to security concerns, creating influence through free access rather than direct revenue. - **Pretraining Cost Economics:** Training costs represent a small fraction compared to serving costs for hundreds of millions of users. A thousand GPU rental costs roughly $100 daily, while frontier labs operate millions of GPUs. Companies now optimize for smaller, more efficient models because recurring serving costs reach billions of dollars, making model size reduction more valuable than raw capability gains through larger pretraining runs. - **Reinforcement Learning Scaling:** Post-training through reinforcement learning with verifiable rewards unlocked major capability gains in 2025, enabling tool use, multi-step reasoning, and better code generation. AI2's November model used five days of RL training, then ran another 3.5 weeks in December for notable improvements, demonstrating that RL scaling provides more cost-effective intelligence gains than expanding pretraining compute at current model sizes. - **Data Quality Over Quantity:** Olmo 3 achieved better performance with less training data than predecessors by focusing on data quality and mixing ratios. Labs train classifiers on samples from different sources like GitHub, Stack Exchange, and Wikipedia, then use linear regression to determine optimal dataset composition based on target evaluations. Synthetic data includes OCR extraction from PDFs yielding trillions of tokens, not just AI-generated content. - **Architecture Convergence:** Modern frontier models remain fundamentally similar to GPT-2 architecture with incremental tweaks like mixture of experts, multi-head latent attention, and group query attention. The differentiation comes from systems optimization including FP8 and FP4 training, distributed compute management across 10,000-100,000 GPUs, and post-training algorithms rather than novel architectural paradigms. Converting between model architectures requires only adding specific components to the base transformer. → NOTABLE MOMENT Nathan Lambert reveals he exclusively uses extended thinking modes across multiple models, running five simultaneous GPT-5 pro queries for different research tasks like finding papers or checking equations. He finds the non-thinking GPT-5 model has higher error rates and poor tone, refusing to use it despite speed advantages, demonstrating how power users prioritize marginal intelligence gains over convenience. 💼 SPONSORS [{"name": "Box", "url": "https://box.com/ai"}, {"name": "Quo", "url": "https://quo.com/lex"}, {"name": "Uplift Desk", "url": "https://upliftdesk.com/lex"}, {"name": "Fin", "url": "https://fin.ai/lex"}, {"name": "Shopify", "url": "https://shopify.com/luxe"}, {"name": "CodeRabbit", "url": "https://coderabbit.ai/lex"}, {"name": "Element", "url": "https://drinkelement.com/lex"}, {"name": "Perplexity", "url": "not specified"}] 🏷️ Scaling Laws, Open Weight Models, Reinforcement Learning, AI Training Costs, Model Architecture

Read Full Summary Listen

Featured On 2 Podcasts

Latent Space

Lex Fridman Podcast

All Appearances

[LIVE] Anthropic Distillation & How Models Cheat (SWE-Bench Dead) | Nathan Lambert & Sebastian Raschka

AI Summary

#490 – State of AI in 2026: LLMs, Coding, Scaling Laws, China, Agents, GPUs, AGI

AI Summary

Explore More

Never miss Nathan Lambert's insights