Sebastian Raschka — Podcast Appearances & Summaries

Featured On 3 Podcasts

The TWIML AI Podcast

1 episode

Latent Space

1 episode

Lex Fridman Podcast

1 episode

All Appearances

3 episodes

AI Trends 2026: OpenClaw Agents, Reasoning LLMs, and More with Sebastian Raschka - #762

The TWIML AI Podcast

Feb 26, 202679 min

AI Summary

→ WHAT IT COVERS Sebastian Raschka, independent LLM researcher, joins Sam Charrington to assess the LLM landscape in early 2026. They cover reasoning model advances, inference-time scaling techniques, the rise of agentic tools like OpenClaw, practical workflow automation using LLMs, and what to expect from post-training research through the rest of 2026. → KEY INSIGHTS - **Post-training vs. pre-training R&D shift:** Research teams are now concentrating resources on post-training techniques rather than pre-training because low-hanging fruit remains in reinforcement learning and reasoning pipelines. Pre-training is already highly optimized — more data and better data mixes yield diminishing returns — while post-training algorithms like GRPO still have significant room for improvement through relatively accessible algorithmic tweaks. - **Verifiable rewards as the reasoning engine:** DeepSeek R1's breakthrough relied on training models using math and code problems where correctness can be verified deterministically — using tools like SymPy for symbolic math comparison or code compilers. This eliminates the need for human evaluators, enabling generation and scoring of tens of thousands of answers cheaply. Extending verifiable rewards to domains like drug design or protein structure modeling is the next frontier. - **Inference-time scaling via self-consistency and self-refinement:** Two concrete techniques boost model accuracy without retraining. Self-consistency generates multiple answers at varied temperatures and selects via majority vote (best-of-N). Self-refinement feeds a model's output back to itself or another model with a rubric, prompting iterative correction. DeepSeek Math 3.2 demonstrated that cranking up both techniques enabled gold-level competition math performance from the same base model. - **LLMs as tool-builders, not just task-doers:** The highest-leverage use of LLMs for technical users is building deterministic workflow tools — native apps, scripts, custom web tools — rather than using LLMs for every task directly. Raschka built macOS apps for podcast chapter-mark insertion and metadata extraction from arXiv links. Charrington built a podcast analytics pipeline. Using LLMs to create deterministic tools avoids hallucination risk on repetitive structured tasks. - **Agentic systems require model fine-tuning for multi-agent environments:** Current agentic tools like OpenClaw and Claude Code use standard LLMs not specifically trained for multi-agent interaction. OpenAI's Codex backend is a fork of GPT-5.3 fine-tuned specifically for agentic coding tasks. Raschka predicts major labs will fine-tune dedicated agent models for multi-agent settings, similar to how Codex diverged from the base model, improving reliability in looped, tool-using pipelines. - **Mixture-of-experts and multi-head latent attention define 2025-2026 architecture trends:** DeepSeek V3's architecture — combining mixture-of-experts with multi-head latent attention (MLA) — became the dominant template, adopted by Kimi (scaled to 1 trillion parameters) and Mistral AI. MLA compresses key-value cache via low-rank projection (similar to LoRA), trading compute for memory efficiency. DeepSeek's sparse attention mechanism further reduces quadratic scaling costs, making these the practical production-proven architectural choices to watch. → NOTABLE MOMENT Raschka recounts attempting to add a dark mode to his personal website using an LLM, only to find that manually editing the CSS file himself was faster than iteratively prompting the model to reposition a misaligned button — illustrating that retained technical knowledge still outperforms LLM delegation on precise, structured tasks. 💼 SPONSORS None detected 🏷️ Reasoning LLMs, Inference-Time Scaling, Agentic AI, Post-Training Techniques, Mixture of Experts, LLM Workflow Automation

Read Full Summary Listen

[LIVE] Anthropic Distillation & How Models Cheat (SWE-Bench Dead) | Nathan Lambert & Sebastian Raschka

Latent Space

Feb 26, 202652 min

AI Summary

→ WHAT IT COVERS Nathan Lambert, Sebastian Raschka, and Swyx analyze two converging AI stories: Anthropic's public accusation that Chinese labs — primarily MiniMax and DeepSeek — used distributed API accounts to extract training data, and OpenAI's formal deprecation of SWE-Bench Verified after discovering 59 unsolvable tasks and model memorization of benchmark solutions. → KEY INSIGHTS - **Distillation Detection Limits:** Anthropic identified distillation attempts by analyzing account patterns, request volume, and traffic shifts — MiniMax nearly halved its API traffic the moment Anthropic released a new model version. However, distinguishing distillation from legitimate large-scale evaluation or customer chatbot usage remains technically ambiguous, creating a gray zone that terms-of-service enforcement cannot cleanly resolve. - **Teacher-Student Model Mismatch:** The strongest model is not always the best distillation teacher. Open-weight models trained on Qwen outputs consistently outperform those trained on frontier API outputs, likely because token probability distributions must align between teacher and student. Labs should run ablations across multiple teacher models rather than defaulting to the highest-capability available model. - **SWE-Bench Verified Collapse:** OpenAI's audit of its own 500-task curated benchmark found 59 tasks were entirely unsolvable due to flawed test specifications — tasks that passed three rounds of human verification. Practitioners should treat any benchmark saturating above 80% across diverse model sizes as likely compromised, regardless of how many human verification rounds it underwent. - **Benchmark Memorization as Canary:** GPT-5's chain-of-thought reasoning on SWE-Bench tasks included knowledge of future Django API versions not available at the time the benchmark problems were written, revealing training data contamination. Benchmark designers should embed deliberately unsolvable "honeypot" tasks — problems with no valid solution — to detect memorization rather than genuine reasoning capability. - **SWE-Bench Pro Structural Fixes:** The replacement benchmark, SWE-Bench Pro, addresses three core flaws: it draws from more recent GitHub issues beyond the 2022–2023 window, maintains a private test set requiring answer submission rather than data download, and diversifies across more repositories and programming languages. Evaluators submitting to SWE-Bench Pro send only model outputs; Scale AI runs scoring server-side to prevent data leakage. → NOTABLE MOMENT OpenAI researchers prompted competing models with only a benchmark task ID — no problem statement — and the models reproduced the full problem description and solution verbatim, confirming that benchmark content had been absorbed wholesale into model weights during pretraining from public GitHub data. 💼 SPONSORS None detected 🏷️ LLM Distillation, SWE-Bench, Benchmark Contamination, AI Evaluation, Model Training Data

Read Full Summary Listen

#490 – State of AI in 2026: LLMs, Coding, Scaling Laws, China, Agents, GPUs, AGI

Lex Fridman Podcast

Feb 1, 2026Machine Learning Researcher and Engineer

AI Summary

→ WHAT IT COVERS Sebastian Raschka and Nathan Lambert analyze the 2025 AI landscape following DeepSeek's breakthrough, comparing Chinese and US model development, examining scaling laws across pretraining and inference, discussing open versus closed models, and evaluating the technical architecture evolution from GPT-2 to current frontier models like Claude Opus 4.5 and GPT-5. → KEY INSIGHTS - **Chinese Open Model Strategy:** DeepSeek trained their model for approximately $5 million at cloud rates, while Olmo 3 spent around $2 million for cluster rental including engineering issues. Chinese companies release open weight models primarily to gain international distribution where users won't pay for API subscriptions to Chinese services due to security concerns, creating influence through free access rather than direct revenue. - **Pretraining Cost Economics:** Training costs represent a small fraction compared to serving costs for hundreds of millions of users. A thousand GPU rental costs roughly $100 daily, while frontier labs operate millions of GPUs. Companies now optimize for smaller, more efficient models because recurring serving costs reach billions of dollars, making model size reduction more valuable than raw capability gains through larger pretraining runs. - **Reinforcement Learning Scaling:** Post-training through reinforcement learning with verifiable rewards unlocked major capability gains in 2025, enabling tool use, multi-step reasoning, and better code generation. AI2's November model used five days of RL training, then ran another 3.5 weeks in December for notable improvements, demonstrating that RL scaling provides more cost-effective intelligence gains than expanding pretraining compute at current model sizes. - **Data Quality Over Quantity:** Olmo 3 achieved better performance with less training data than predecessors by focusing on data quality and mixing ratios. Labs train classifiers on samples from different sources like GitHub, Stack Exchange, and Wikipedia, then use linear regression to determine optimal dataset composition based on target evaluations. Synthetic data includes OCR extraction from PDFs yielding trillions of tokens, not just AI-generated content. - **Architecture Convergence:** Modern frontier models remain fundamentally similar to GPT-2 architecture with incremental tweaks like mixture of experts, multi-head latent attention, and group query attention. The differentiation comes from systems optimization including FP8 and FP4 training, distributed compute management across 10,000-100,000 GPUs, and post-training algorithms rather than novel architectural paradigms. Converting between model architectures requires only adding specific components to the base transformer. → NOTABLE MOMENT Nathan Lambert reveals he exclusively uses extended thinking modes across multiple models, running five simultaneous GPT-5 pro queries for different research tasks like finding papers or checking equations. He finds the non-thinking GPT-5 model has higher error rates and poor tone, refusing to use it despite speed advantages, demonstrating how power users prioritize marginal intelligence gains over convenience. 💼 SPONSORS [{"name": "Box", "url": "https://box.com/ai"}, {"name": "Quo", "url": "https://quo.com/lex"}, {"name": "Uplift Desk", "url": "https://upliftdesk.com/lex"}, {"name": "Fin", "url": "https://fin.ai/lex"}, {"name": "Shopify", "url": "https://shopify.com/luxe"}, {"name": "CodeRabbit", "url": "https://coderabbit.ai/lex"}, {"name": "Element", "url": "https://drinkelement.com/lex"}, {"name": "Perplexity", "url": "not specified"}] 🏷️ Scaling Laws, Open Weight Models, Reinforcement Learning, AI Training Costs, Model Architecture

Read Full Summary Listen

Explore More

AI & Machine Learning Episodes Best AI Podcasts (2026)

Never miss Sebastian Raschka's insights

Subscribe to get AI-powered summaries of Sebastian Raschka's podcast appearances delivered to your inbox weekly.

Start Free Today

No credit card required • Free tier available