What are the key takeaways from this The TWIML AI Podcast episode?

Key insights include: **Inference research-to-production timeline:** New inference techniques move from research paper to production implementation in hours, not weeks. Base Ten's team implemented a PoloQuant CUDA kernel 31 hours after the paper published. Engineers should monitor inference research continuously, as techniques dismissed a year ago can become immediately viable when model scales shift.; **Product maturity deployment cycle:** Companies follow a predictable inference path: start with per-token closed APIs, then move to hyperscaler provisioned throughput (AWS Bedrock, GCP Vertex), then dedicated inference providers like Base Ten with weight-owned models, then in-house platforms. The trigger is typically cost overruns or capacity constraints, not company size.; **KV cache prefix optimization:** A single token difference early in a prompt sequence invalidates the entire KV cache, eliminating reuse benefits. Engineers should structure system prompts and chat templates to maximize shared prefixes across requests, ensuring cache hits. This applies regardless of whether inference runs on owned infrastructure or third-party providers.

What did Philip Kiely discuss on The TWIML AI Podcast?

Philip Kiely, Head of AI Education at Base Ten, explains inference engineering as a discipline requiring expertise across CUDA programming, distributed systems, and applied research. He covers the maturity cycle from per-token APIs to dedicated deployments, hardware generations, and why inference optimization becomes critical at scale for agentic AI workloads. Key topics include: **Inference research-to-production timeline:** New inference techniques move from research paper to production implementation in hours, not weeks. Base Ten's team implemented a PoloQuant CUDA kernel 31 hours after the paper published. Engineers should monitor inference research continuously, as techniques dismissed a year ago can become immediately viable when model scales shift.; **Product maturity deployment cycle:** Companies follow a predictable inference path: start with per-token closed APIs, then move to hyperscaler provisioned throughput (AWS Bedrock, GCP Vertex), then dedicated inference providers like Base Ten with weight-owned models, then in-house platforms. The trigger is typically cost overruns or capacity constraints, not company size..

How long is this episode of The TWIML AI Podcast?

This episode is 54 minutes long. SignalCast provides an AI-generated summary so you can get the key insights in about 3 minutes.

The TWIML AI Podcast

How to Engineer AI Inference Systems with Philip Kiely - #766

April 30, 2026

54 min episode · 2 min read

Philip Kiely

Episode

54 min

Read time

2 min

Topics

Remote Work, Artificial Intelligence, Software Development

AI-Generated Summary

Published Apr 30, 2026

Key Takeaways

✓Inference research-to-production timeline: New inference techniques move from research paper to production implementation in hours, not weeks. Base Ten's team implemented a PoloQuant CUDA kernel 31 hours after the paper published. Engineers should monitor inference research continuously, as techniques dismissed a year ago can become immediately viable when model scales shift.
✓Product maturity deployment cycle: Companies follow a predictable inference path: start with per-token closed APIs, then move to hyperscaler provisioned throughput (AWS Bedrock, GCP Vertex), then dedicated inference providers like Base Ten with weight-owned models, then in-house platforms. The trigger is typically cost overruns or capacity constraints, not company size.
✓KV cache prefix optimization: A single token difference early in a prompt sequence invalidates the entire KV cache, eliminating reuse benefits. Engineers should structure system prompts and chat templates to maximize shared prefixes across requests, ensuring cache hits. This applies regardless of whether inference runs on owned infrastructure or third-party providers.
✓Task-model matching for agentic speed: Agents making hundreds of model calls per user action require specialized runtimes per task type. Running named entity recognition on a frontier LLM costs significantly more versus a specialized runtime. Base Ten's NER runtime runs in 1 millisecond versus 500 milliseconds on a small LLM — a 500x difference that eliminates visible latency in agent pipelines.
✓Hopper GPU staying power: H100 rental prices are higher now than a year ago despite Blackwell availability. Hopper GPUs remain dominant because open-source models from Chinese labs are optimized for Hopper architecture due to export controls, and smaller models (1–8B parameters) run efficiently on MIG-partitioned Hopper slices without requiring full Blackwell NVL72 systems.

What It Covers

Philip Kiely, Head of AI Education at Base Ten, explains inference engineering as a discipline requiring expertise across CUDA programming, distributed systems, and applied research. He covers the maturity cycle from per-token APIs to dedicated deployments, hardware generations, and why inference optimization becomes critical at scale for agentic AI workloads.

Key Questions Answered

•Inference research-to-production timeline: New inference techniques move from research paper to production implementation in hours, not weeks. Base Ten's team implemented a PoloQuant CUDA kernel 31 hours after the paper published. Engineers should monitor inference research continuously, as techniques dismissed a year ago can become immediately viable when model scales shift.
•Product maturity deployment cycle: Companies follow a predictable inference path: start with per-token closed APIs, then move to hyperscaler provisioned throughput (AWS Bedrock, GCP Vertex), then dedicated inference providers like Base Ten with weight-owned models, then in-house platforms. The trigger is typically cost overruns or capacity constraints, not company size.
•KV cache prefix optimization: A single token difference early in a prompt sequence invalidates the entire KV cache, eliminating reuse benefits. Engineers should structure system prompts and chat templates to maximize shared prefixes across requests, ensuring cache hits. This applies regardless of whether inference runs on owned infrastructure or third-party providers.
•Task-model matching for agentic speed: Agents making hundreds of model calls per user action require specialized runtimes per task type. Running named entity recognition on a frontier LLM costs significantly more versus a specialized runtime. Base Ten's NER runtime runs in 1 millisecond versus 500 milliseconds on a small LLM — a 500x difference that eliminates visible latency in agent pipelines.
•Hopper GPU staying power: H100 rental prices are higher now than a year ago despite Blackwell availability. Hopper GPUs remain dominant because open-source models from Chinese labs are optimized for Hopper architecture due to export controls, and smaller models (1–8B parameters) run efficiently on MIG-partitioned Hopper slices without requiring full Blackwell NVL72 systems.

Notable Moment

Kiely describes how text-to-speech models have a hard ceiling of roughly 80–100 tokens per second for real-time audio output. Beyond that threshold, the optimization goal flips entirely — engineers should increase batch size for concurrent streams or reduce hardware costs rather than chasing higher token throughput.

Know someone who'd find this useful?

You just read a 3-minute summary of a 51-minute episode.

Get The TWIML AI Podcast summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Books, tools, and gear mentioned in this episode

SignalCast may earn commission on purchases via these links.

Tools

AWS Bedrock
by Amazon Web Services
“Companies follow a predictable inference path: start with per-token closed APIs, then move to hyperscaler provisioned throughput (AWS Bedrock, GCP Vertex), then dedicated inference providers like Base Ten with weight-owned models”
GCP Vertex
by Google Cloud
“Companies follow a predictable inference path: start with per-token closed APIs, then move to hyperscaler provisioned throughput (AWS Bedrock, GCP Vertex), then dedicated inference providers like Base Ten with weight-owned models”

company

Base Ten
“Philip Kiely, Head of AI Education at Base Ten, explains inference engineering as a discipline requiring expertise across CUDA programming, distributed systems, and applied research.”

Similar Episodes

Related episodes from other podcasts

The Vergecast

Jul 16

Explore Related Topics

🏠Remote Work 🤖Artificial Intelligence 💻Software Development

This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.

Read this week's AI & Machine Learning Podcast Insights — cross-podcast analysis updated weekly.

You're clearly into The TWIML AI Podcast.

Every Monday, we deliver AI summaries of the latest episodes from The TWIML AI Podcast and 192+ other podcasts. Free for one show.

Start My Monday Digest

No credit card · Unsubscribe anytime

How to Engineer AI Inference Systems with Philip Kiely - #766

AI-Generated Summary

Key Takeaways

What It Covers

Key Questions Answered

Notable Moment

Keep Reading

Why Models Are AI’s Next Training Dataset with Damian Borth - #772

The one AI detector people actually trust

How AI Learns to Smell with Alex Wiltschko - #771

AURA and Open-Source Agents for Production Operations

Books, tools, and gear mentioned in this episode

Tools

company

More from The TWIML AI Podcast

Why Models Are AI’s Next Training Dataset with Damian Borth - #772

How AI Learns to Smell with Alex Wiltschko - #771

Why AI Agents Break the GenAI Security Model with Devvret Rishi - #770

Is RAG Dead? Lessons from Building AI for Tax Law with Alex Bowcut - #769

Relational Foundation Models for Enterprise Data with Jure Leskovec - #768

Similar Episodes

The one AI detector people actually trust

AURA and Open-Source Agents for Production Operations

The Benchmark With No Instructions — ARC-AGI-3 (winning team!)

1000 Designs a Day: Neural Concept's Thomas von Tschammer on AI-Native Engineering

Foundation Models for Structured Data

Explore Related Topics

You're clearly into The TWIML AI Podcast.