How to Engineer AI Inference Systems with Philip Kiely - #766
Episode
54 min
Read time
2 min
Topics
Remote Work, Artificial Intelligence, Software Development
AI-Generated Summary
Key Takeaways
- ✓Inference research-to-production timeline: New inference techniques move from research paper to production implementation in hours, not weeks. Base Ten's team implemented a PoloQuant CUDA kernel 31 hours after the paper published. Engineers should monitor inference research continuously, as techniques dismissed a year ago can become immediately viable when model scales shift.
- ✓Product maturity deployment cycle: Companies follow a predictable inference path: start with per-token closed APIs, then move to hyperscaler provisioned throughput (AWS Bedrock, GCP Vertex), then dedicated inference providers like Base Ten with weight-owned models, then in-house platforms. The trigger is typically cost overruns or capacity constraints, not company size.
- ✓KV cache prefix optimization: A single token difference early in a prompt sequence invalidates the entire KV cache, eliminating reuse benefits. Engineers should structure system prompts and chat templates to maximize shared prefixes across requests, ensuring cache hits. This applies regardless of whether inference runs on owned infrastructure or third-party providers.
- ✓Task-model matching for agentic speed: Agents making hundreds of model calls per user action require specialized runtimes per task type. Running named entity recognition on a frontier LLM costs significantly more versus a specialized runtime. Base Ten's NER runtime runs in 1 millisecond versus 500 milliseconds on a small LLM — a 500x difference that eliminates visible latency in agent pipelines.
- ✓Hopper GPU staying power: H100 rental prices are higher now than a year ago despite Blackwell availability. Hopper GPUs remain dominant because open-source models from Chinese labs are optimized for Hopper architecture due to export controls, and smaller models (1–8B parameters) run efficiently on MIG-partitioned Hopper slices without requiring full Blackwell NVL72 systems.
What It Covers
Philip Kiely, Head of AI Education at Base Ten, explains inference engineering as a discipline requiring expertise across CUDA programming, distributed systems, and applied research. He covers the maturity cycle from per-token APIs to dedicated deployments, hardware generations, and why inference optimization becomes critical at scale for agentic AI workloads.
Key Questions Answered
- •Inference research-to-production timeline: New inference techniques move from research paper to production implementation in hours, not weeks. Base Ten's team implemented a PoloQuant CUDA kernel 31 hours after the paper published. Engineers should monitor inference research continuously, as techniques dismissed a year ago can become immediately viable when model scales shift.
- •Product maturity deployment cycle: Companies follow a predictable inference path: start with per-token closed APIs, then move to hyperscaler provisioned throughput (AWS Bedrock, GCP Vertex), then dedicated inference providers like Base Ten with weight-owned models, then in-house platforms. The trigger is typically cost overruns or capacity constraints, not company size.
- •KV cache prefix optimization: A single token difference early in a prompt sequence invalidates the entire KV cache, eliminating reuse benefits. Engineers should structure system prompts and chat templates to maximize shared prefixes across requests, ensuring cache hits. This applies regardless of whether inference runs on owned infrastructure or third-party providers.
- •Task-model matching for agentic speed: Agents making hundreds of model calls per user action require specialized runtimes per task type. Running named entity recognition on a frontier LLM costs significantly more versus a specialized runtime. Base Ten's NER runtime runs in 1 millisecond versus 500 milliseconds on a small LLM — a 500x difference that eliminates visible latency in agent pipelines.
- •Hopper GPU staying power: H100 rental prices are higher now than a year ago despite Blackwell availability. Hopper GPUs remain dominant because open-source models from Chinese labs are optimized for Hopper architecture due to export controls, and smaller models (1–8B parameters) run efficiently on MIG-partitioned Hopper slices without requiring full Blackwell NVL72 systems.
Notable Moment
Kiely describes how text-to-speech models have a hard ceiling of roughly 80–100 tokens per second for real-time audio output. Beyond that threshold, the optimization goal flips entirely — engineers should increase batch size for concurrent streams or reduce hardware costs rather than chasing higher token throughput.
You just read a 3-minute summary of a 51-minute episode.
Get The TWIML AI Podcast summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from The TWIML AI Podcast
Is RAG Dead? Lessons from Building AI for Tax Law with Alex Bowcut - #769
Jun 9 · 51 min
Eye on AI
AI Is Already Resolving 90% of Customer Service Tickets - and It's Getting Smarter | Shashi Upadhyay, Zendesk
Jun 12
More from The TWIML AI Podcast
Relational Foundation Models for Enterprise Data with Jure Leskovec - #768
May 21 · 66 min
Latent Space
The Age of Async Agents — Cognition's Walden Yan & OpenInspect's Cole Murray
May 28
Books, tools, and gear mentioned in this episode
SignalCast may earn commission on purchases via these links. As an Amazon Associate, SignalCast earns from qualifying purchases.
Tools
by Amazon Web Services
“Companies follow a predictable inference path: start with per-token closed APIs, then move to hyperscaler provisioned throughput (AWS Bedrock, GCP Vertex), then dedicated inference providers like Base Ten with weight-owned models”
by Google Cloud
“Companies follow a predictable inference path: start with per-token closed APIs, then move to hyperscaler provisioned throughput (AWS Bedrock, GCP Vertex), then dedicated inference providers like Base Ten with weight-owned models”
company
“Philip Kiely, Head of AI Education at Base Ten, explains inference engineering as a discipline requiring expertise across CUDA programming, distributed systems, and applied research.”
More from The TWIML AI Podcast
We summarize every new episode. Want them in your inbox?
Is RAG Dead? Lessons from Building AI for Tax Law with Alex Bowcut - #769
Relational Foundation Models for Enterprise Data with Jure Leskovec - #768
How to Find the Agent Failures Your Evals Miss with Scott Clark - #767
How Capital One Delivers Multi-Agent Systems with Rashmi Shetty - #765
The Race to Production-Grade Diffusion LLMs with Stefano Ermon - #764
Similar Episodes
Related episodes from other podcasts
Eye on AI
Jun 12
AI Is Already Resolving 90% of Customer Service Tickets - and It's Getting Smarter | Shashi Upadhyay, Zendesk
Latent Space
May 28
The Age of Async Agents — Cognition's Walden Yan & OpenInspect's Cole Murray
Odd Lots
May 21
Why Cerebras CEO Andrew Feldman Built The World's Largest Computer Chip
Alt Goes Mainstream
Feb 13
AGM Unscripted: Goldman Sachs' Michael Bruun - Driving Value in Private Equity Through Network and Innovation
Alt Goes Mainstream
Feb 11
AGM Unscripted: Goldman Sachs' Kristin Olson - The Evolution of Alternatives: Bridging Private Markets and Wealth
Explore Related Topics
This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.
Read this week's AI & Machine Learning Podcast Insights — cross-podcast analysis updated weekly.
You're clearly into The TWIML AI Podcast.
Every Monday, we deliver AI summaries of the latest episodes from The TWIML AI Podcast and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime