Skip to main content
The TWIML AI Podcast

How to Engineer AI Inference Systems with Philip Kiely - #766

54 min episode · 2 min read
·
Philip Kiely

Episode

54 min

Read time

2 min

Topics

Remote Work, Artificial Intelligence, Software Development

AI-Generated Summary

Key Takeaways

  • Inference research-to-production timeline: New inference techniques move from research paper to production implementation in hours, not weeks. Base Ten's team implemented a PoloQuant CUDA kernel 31 hours after the paper published. Engineers should monitor inference research continuously, as techniques dismissed a year ago can become immediately viable when model scales shift.
  • Product maturity deployment cycle: Companies follow a predictable inference path: start with per-token closed APIs, then move to hyperscaler provisioned throughput (AWS Bedrock, GCP Vertex), then dedicated inference providers like Base Ten with weight-owned models, then in-house platforms. The trigger is typically cost overruns or capacity constraints, not company size.
  • KV cache prefix optimization: A single token difference early in a prompt sequence invalidates the entire KV cache, eliminating reuse benefits. Engineers should structure system prompts and chat templates to maximize shared prefixes across requests, ensuring cache hits. This applies regardless of whether inference runs on owned infrastructure or third-party providers.
  • Task-model matching for agentic speed: Agents making hundreds of model calls per user action require specialized runtimes per task type. Running named entity recognition on a frontier LLM costs significantly more versus a specialized runtime. Base Ten's NER runtime runs in 1 millisecond versus 500 milliseconds on a small LLM — a 500x difference that eliminates visible latency in agent pipelines.
  • Hopper GPU staying power: H100 rental prices are higher now than a year ago despite Blackwell availability. Hopper GPUs remain dominant because open-source models from Chinese labs are optimized for Hopper architecture due to export controls, and smaller models (1–8B parameters) run efficiently on MIG-partitioned Hopper slices without requiring full Blackwell NVL72 systems.

What It Covers

Philip Kiely, Head of AI Education at Base Ten, explains inference engineering as a discipline requiring expertise across CUDA programming, distributed systems, and applied research. He covers the maturity cycle from per-token APIs to dedicated deployments, hardware generations, and why inference optimization becomes critical at scale for agentic AI workloads.

Key Questions Answered

  • Inference research-to-production timeline: New inference techniques move from research paper to production implementation in hours, not weeks. Base Ten's team implemented a PoloQuant CUDA kernel 31 hours after the paper published. Engineers should monitor inference research continuously, as techniques dismissed a year ago can become immediately viable when model scales shift.
  • Product maturity deployment cycle: Companies follow a predictable inference path: start with per-token closed APIs, then move to hyperscaler provisioned throughput (AWS Bedrock, GCP Vertex), then dedicated inference providers like Base Ten with weight-owned models, then in-house platforms. The trigger is typically cost overruns or capacity constraints, not company size.
  • KV cache prefix optimization: A single token difference early in a prompt sequence invalidates the entire KV cache, eliminating reuse benefits. Engineers should structure system prompts and chat templates to maximize shared prefixes across requests, ensuring cache hits. This applies regardless of whether inference runs on owned infrastructure or third-party providers.
  • Task-model matching for agentic speed: Agents making hundreds of model calls per user action require specialized runtimes per task type. Running named entity recognition on a frontier LLM costs significantly more versus a specialized runtime. Base Ten's NER runtime runs in 1 millisecond versus 500 milliseconds on a small LLM — a 500x difference that eliminates visible latency in agent pipelines.
  • Hopper GPU staying power: H100 rental prices are higher now than a year ago despite Blackwell availability. Hopper GPUs remain dominant because open-source models from Chinese labs are optimized for Hopper architecture due to export controls, and smaller models (1–8B parameters) run efficiently on MIG-partitioned Hopper slices without requiring full Blackwell NVL72 systems.

Notable Moment

Kiely describes how text-to-speech models have a hard ceiling of roughly 80–100 tokens per second for real-time audio output. Beyond that threshold, the optimization goal flips entirely — engineers should increase batch size for concurrent streams or reduce hardware costs rather than chasing higher token throughput.

Know someone who'd find this useful?

You just read a 3-minute summary of a 51-minute episode.

Get The TWIML AI Podcast summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Keep Reading

Books, tools, and gear mentioned in this episode

SignalCast may earn commission on purchases via these links. As an Amazon Associate, SignalCast earns from qualifying purchases.

Tools

  • by Amazon Web Services

    Companies follow a predictable inference path: start with per-token closed APIs, then move to hyperscaler provisioned throughput (AWS Bedrock, GCP Vertex), then dedicated inference providers like Base Ten with weight-owned models
  • by Google Cloud

    Companies follow a predictable inference path: start with per-token closed APIs, then move to hyperscaler provisioned throughput (AWS Bedrock, GCP Vertex), then dedicated inference providers like Base Ten with weight-owned models

company

  • Philip Kiely, Head of AI Education at Base Ten, explains inference engineering as a discipline requiring expertise across CUDA programming, distributed systems, and applied research.

More from The TWIML AI Podcast

We summarize every new episode. Want them in your inbox?

Similar Episodes

Related episodes from other podcasts

Explore Related Topics

This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.

Read this week's AI & Machine Learning Podcast Insights — cross-podcast analysis updated weekly.

You're clearly into The TWIML AI Podcast.

Every Monday, we deliver AI summaries of the latest episodes from The TWIML AI Podcast and 192+ other podcasts. Free for up to 3 shows.

Start My Monday Digest

No credit card · Unsubscribe anytime