→ WHAT IT COVERS Alex Bowcut, head of engineering at Sphere, explains how the company built TRAM, an AI system for sales tax compliance across global jurisdictions. The system combines semantic chunking, hybrid dense-sparse retrieval, and reinforcement fine-tuning to help tax experts work nearly two orders of magnitude faster than traditional manual methods.
Latest Insights
Key takeaways from recent episodes
Is RAG Dead? Lessons from Building AI for Tax Law with Alex Bowcut - #769
- ✓**Hybrid Retrieval Architecture:** Combining dense semantic embeddings (OpenAI models via Pinecone) with sparse TF-IDF-style full-text search measurably improves citation accuracy over dense-only retrieval. When sparse search was reintroduced after starting with dense alone, Sphere saw a clear accuracy increase on retrieval evals, particularly for jurisdiction-specific legal terminology that semantic embeddings alone failed to surface reliably.
- ✓**Semantic Chunking Over Naive Splitting:** Splitting legal documents by character count discards critical hierarchical context. Build document-type-specific parsers — separate ones for statutes, case law, and department bulletins — that cut at natural legal section boundaries while preserving parent-child hierarchy metadata. This hierarchy enables downstream passage expansion and accurate citation reconstruction, directly reducing determination errors.
Relational Foundation Models for Enterprise Data with Jure Leskovec - #768
- ✓**Multi-table vs. Single-table ML:** The real performance gap in enterprise ML is not between XGBoost and deep learning on a single table — it's the information lost when collapsing relational databases into flat tables. Aggregating transactions into summary statistics (mean, median, count) discards signal that graph neural networks recover by attending directly over raw multi-table data, producing double-digit accuracy improvements on tasks like fraud detection and churn prediction.
- ✓**Relational Foundation Model (Zero-Shot Prediction):** Kumo's RFM-2 performs accurate predictions on unseen databases and tasks without any model training. It uses in-context learning: the system extracts labeled subgraphs from historical data, passes them alongside an unlabeled target entity through a frozen transformer in a single forward pass, and returns a prediction in under half a second — no backpropagation, no hyperparameter tuning, no feature engineering required.
How to Find the Agent Failures Your Evals Miss with Scott Clark - #767
- ✓**Observability Hierarchy:** Structure agent observability across three layers: telemetry (logging raw traces), monitoring (tracking known signals like latency and tool call counts in real time), and analytics (discovering unknown failure patterns via unsupervised clustering). Most teams stop at monitoring, missing the analytics layer where the highest-value insights about agent behavior actually emerge.
- ✓**Trace Enrichment for Clustering:** Convert raw OpenTelemetry traces using the GenAI semantic convention into structured numerical vectors capturing tool call sequences, response patterns, and LLM-scored evals. These vectors enable clustering across thousands of sessions to identify behavioral sub-populations, such as the 5% of traces where agents claim tool calls occurred but trace logs confirm they never executed.
How to Engineer AI Inference Systems with Philip Kiely - #766
- ✓**Inference research-to-production timeline:** New inference techniques move from research paper to production implementation in hours, not weeks. Base Ten's team implemented a PoloQuant CUDA kernel 31 hours after the paper published. Engineers should monitor inference research continuously, as techniques dismissed a year ago can become immediately viable when model scales shift.
- ✓**Product maturity deployment cycle:** Companies follow a predictable inference path: start with per-token closed APIs, then move to hyperscaler provisioned throughput (AWS Bedrock, GCP Vertex), then dedicated inference providers like Base Ten with weight-owned models, then in-house platforms. The trigger is typically cost overruns or capacity constraints, not company size.
Recent Episode Summaries
20 AI-powered summaries available
→ WHAT IT COVERS Jure Leskovec, Stanford professor and Kumo AI cofounder, presents relational deep learning as a fundamental shift in enterprise ML — moving from single-table feature engineering to multi-table graph-based neural networks, culminating in a foundation model that makes accurate predictions on any database without model training. → KEY INSIGHTS - **Multi-table vs.
→ WHAT IT COVERS Scott Clark, cofounder of Distributional, explains why production AI agents require analytics beyond monitoring and evals. Using a "Maslow's hierarchy of observability" framework, he outlines how unsupervised learning on agent traces surfaces unknown failure patterns that standard evaluation pipelines systematically miss. → KEY INSIGHTS - **Observability Hierarchy:** Structure agent observability across three layers: telemetry (logging raw traces), monitoring (tracking known...
→ WHAT IT COVERS Philip Kiely, Head of AI Education at Base Ten, explains inference engineering as a discipline requiring expertise across CUDA programming, distributed systems, and applied research. He covers the maturity cycle from per-token APIs to dedicated deployments, hardware generations, and why inference optimization becomes critical at scale for agentic AI workloads.
→ WHAT IT COVERS Rashmi Shetty, Senior Director of Enterprise Generative AI Platform at Capital One, explains how the company built and deployed Chat Concierge, a multi-agent car-buying system, and outlines the platform strategy enabling developers to build governed agentic systems at scale across the enterprise. → KEY INSIGHTS - **Multi-agent trigger criteria:** Deploy multi-agent architecture only when a problem contains multiple distinct user intents that cannot be resolved by a single...
→ WHAT IT COVERS Stefano Ermon, Stanford professor and Inception CEO, explains how diffusion language models work as an alternative to autoregressive LLMs, covering the technical path from image diffusion to text generation, Mercury 2's benchmark performance against frontier speed-optimized models, and why inference-time economics now favor the diffusion approach.
→ WHAT IT COVERS Blitzy CTO Siddhant Pardeshi explains how his company achieves autonomous software development at enterprise scale using agent swarms, knowledge graphs, and database-driven orchestration. The system writes millions of lines of validated, compiled, tested code autonomously, completing roughly 80% of development work in a single run across large production codebases.
→ WHAT IT COVERS Sebastian Raschka, independent LLM researcher, joins Sam Charrington to assess the LLM landscape in early 2026. They cover reasoning model advances, inference-time scaling techniques, the rise of agentic tools like OpenClaw, practical workflow automation using LLMs, and what to expect from post-training research through the rest of 2026. → KEY INSIGHTS - **Post-training vs.
→ WHAT IT COVERS Yejin Choi, professor at Stanford HAI, explores democratizing AI through small language models that match larger counterparts. She details synthetic data generation techniques, reinforcement learning during pretraining, and pluralistic alignment approaches. The conversation examines mode collapse in LLMs, the artificial hive mind phenomenon, and how academic research can make powerful AI accessible beyond resource-rich tech companies.
→ WHAT IT COVERS Nikita Rudin, CEO of Flexion Robotics, explains the gap between robotics demos and real-world deployment, covering simulation-to-reality challenges, reinforcement learning techniques, and why no humanoid robot generates actual economic value today in 2025. → KEY INSIGHTS - **Sim-to-Real Gap:** Closing the simulation-to-reality gap requires deep understanding of both worlds, mapping every software layer from high-level commands down to motor currents.
→ WHAT IT COVERS Aakanksha Chowdhery from Reflection explains why pretraining language models specifically for agentic capabilities requires rethinking attention mechanisms, loss objectives, and training data composition beyond current post-training approaches that optimize static benchmarks. → KEY INSIGHTS - **Pretraining for agents:** Current models train on static benchmarks like GLUE or GSM8K, but agentic tasks require interactive environment capabilities.
→ WHAT IT COVERS Munawar Hayat from Qualcomm AI Research discusses three NeurIPS papers addressing critical failures in vision language models: why they ignore visual input, physics-based generation limitations, and multi-person image generation challenges with proposed solutions. → KEY INSIGHTS - **Vision Token Attention Failure:** Vision language models attend poorly to visual tokens despite having images as input.
→ WHAT IT COVERS Zain Asgar explains how Gimlet Labs optimizes AI inference costs through heterogeneous compute orchestration, using workload disaggregation, MLIR compilation, and LLM-generated kernel optimization across NVIDIA, AMD, and Intel hardware platforms. → KEY INSIGHTS - **Workload Disaggregation Strategy:** Gimlet splits agent workflows into granular components, assigns performance-critical pieces to premium hardware like B200s, and offloads less critical tasks to lower-cost...
→ WHAT IT COVERS Devi Parikh, co-founder of Utori, explains how AI browser agents will replace manual web interactions through proactive monitoring and automation, starting with Scouts, their product that monitors websites for user-specified information changes. → KEY INSIGHTS - **Visual-based browser navigation:** Training models on website screenshots rather than DOM information proves more reliable and generalizable across different sites, solving challenges like date pickers that plagued...
→ WHAT IT COVERS Robin Braun from HPE and Luke Norris from Kamiwaza discuss deploying AI orchestration for smart city operations in Vail, Colorado, focusing on back-office automation, website accessibility compliance, and deed restriction management using private infrastructure. → KEY INSIGHTS - **Back-office automation priority:** Fortune 500 companies and municipalities achieve fastest ROI by automating finance, HR, and procurement workflows first rather than customer-facing chatbots,...
→ WHAT IT COVERS Carina Hong, founder of Acxiom, explains building AI mathematicians through formal verification using Lean programming language, combining auto-formalization, theorem proving, and self-play systems to achieve mathematical reasoning with provable guarantees. → KEY INSIGHTS - **Data Scarcity Challenge:** Formal math has only 10 million Lean tokens versus one trillion Python tokens, creating a 100,000x data gap that requires auto-formalization and synthetic generation to bridge...
→ WHAT IT COVERS Hung Bui explains how VinAI Research achieved efficient on-device AI by training smaller models that match larger model performance, developing one-step diffusion for real-time image generation, and building Vietnam's top AI research lab. → KEY INSIGHTS - **Model Size Reduction:** A sub-4-billion parameter Vietnamese language model outperformed the original 7-billion parameter version by iterating over the same dataset multiple times during training and applying minor...
→ WHAT IT COVERS Alexandre Pesant, AI lead at Lovable, discusses vibe coding's evolution from GPT Engineer, scaling challenges reaching $100M ARR in eight months, the technical architecture behind AI-assisted development, and why nontechnical users can learn software building skills. → KEY INSIGHTS - **Vibe Coding Progression:** Users achieve better results by planning in chat mode before implementation, thinking through sequencing and architecture upfront, knowing when to stop failed attempts...
→ WHAT IT COVERS Kunle Olukotun explains how SambaNova's reconfigurable dataflow architecture achieves 5-10x better performance per watt for AI inference by eliminating instruction fetching, maximizing memory bandwidth utilization, and enabling microsecond model switching across trillion-parameter systems. → KEY INSIGHTS - **Dataflow vs Instructions:** Reconfigurable dataflow architectures configure hardware to match PyTorch computation graphs rather than fetching instructions each cycle, using...
→ WHAT IT COVERS Jacob Buckman explains power retention architecture for transformers, combining recurrence and attention to achieve linear scaling for long context processing while maintaining computational efficiency through balanced weight-state FLOP ratios and chunked algorithms. → KEY INSIGHTS - **State Size Balance:** Transformers have states 100,000x larger than LSTMs at long context, while RNNs have states too small.
Monday morning, inbox, done.
Pick your shows, and start the week knowing what happened in your world.
Pick the Podcasts You Care About
Choose from 200+ curated shows or add any public RSS feed.
AI Reads Every New Episode
Key arguments, surprising data points, and frameworks worth stealing — pulled automatically.
One Email, Every Monday
A curated brief for each episode, with links to listen if something grabs you.
Resources mentioned on The TWIML AI Podcast
Books, tools, and gear cited by guests across episodes we've summarized.
- tool
Pinecone
by Pinecone
Cited in 1 episode of The TWIML AI Podcast
- tool
OpenAI
by OpenAI
Cited in 1 episode of The TWIML AI Podcast
- company
Sphere
by Sphere
Cited in 1 episode of The TWIML AI Podcast
- hardware
Cerebras
by Cerebras
Cited in 1 episode of The TWIML AI Podcast
- tool
Claude Code
by Anthropic
Cited in 1 episode of The TWIML AI Podcast
- tool
OpenClaw
Cited in 1 episode of The TWIML AI Podcast
- tool
PyTorch
Cited in 1 episode of The TWIML AI Podcast
- product
RFM-2
by Kumo
Cited in 1 episode of The TWIML AI Podcast
SignalCast may earn commission on purchases via affiliate links on each resource page.
Similar Podcasts You'll Love
Explore More
Get a free sample digest
See what your Monday email looks like — real AI summaries, no account needed.
One free sample — no spam, no commitment.



