Scaling Agentic Inference Across Heterogeneous Compute with Zain Asgar - #757

December 2, 2025

48 min episode · 2 min read

Zain Asgar

Episode

48 min

Read time

2 min

Topics

Startups

AI-Generated Summary

Published Dec 31, 2025

Key Takeaways

✓Workload Disaggregation Strategy: Gimlet splits agent workflows into granular components, assigns performance-critical pieces to premium hardware like B200s, and offloads less critical tasks to lower-cost accelerators, optimizing cost per token while maintaining SLA requirements through dynamic resource allocation.
✓Kernel Optimization Performance: LLM-based automatic kernel synthesis delivers single-digit improvements on mature H100 hardware but achieves 20-40% gains on newer B200/RTX 6000 systems and over 2x speedups on AMD/Intel/Apple hardware where optimization frameworks remain underdeveloped.
✓Hardware Utilization Economics: Most GPU deployments show only 30% utilization, wasting two-thirds of capacity. Heterogeneous orchestration captures the majority of cost savings by efficiently packing workloads across different hardware types based on compute cost, memory bandwidth, and capacity requirements.
✓Multi-Agent Kernel Generation: The system uses hardware-in-the-loop testing where supervisor agents generate candidate kernels, execute them on target hardware with profiling and correctness checks, then iteratively optimize based on performance data until convergence, caching verified kernels offline.

What It Covers

Zain Asgar explains how Gimlet Labs optimizes AI inference costs through heterogeneous compute orchestration, using workload disaggregation, MLIR compilation, and LLM-generated kernel optimization across NVIDIA, AMD, and Intel hardware platforms.

Key Questions Answered

•Workload Disaggregation Strategy: Gimlet splits agent workflows into granular components, assigns performance-critical pieces to premium hardware like B200s, and offloads less critical tasks to lower-cost accelerators, optimizing cost per token while maintaining SLA requirements through dynamic resource allocation.
•Kernel Optimization Performance: LLM-based automatic kernel synthesis delivers single-digit improvements on mature H100 hardware but achieves 20-40% gains on newer B200/RTX 6000 systems and over 2x speedups on AMD/Intel/Apple hardware where optimization frameworks remain underdeveloped.
•Hardware Utilization Economics: Most GPU deployments show only 30% utilization, wasting two-thirds of capacity. Heterogeneous orchestration captures the majority of cost savings by efficiently packing workloads across different hardware types based on compute cost, memory bandwidth, and capacity requirements.
•Multi-Agent Kernel Generation: The system uses hardware-in-the-loop testing where supervisor agents generate candidate kernels, execute them on target hardware with profiling and correctness checks, then iteratively optimize based on performance data until convergence, caching verified kernels offline.

Notable Moment

Asgar reveals that AI training infrastructure has regressed to the supercomputer era with fully vertically integrated rack-scale systems reaching 600 kilowatts, while inference workloads benefit from disaggregated commodity hardware approaches that enable sustainable scaling.

Know someone who'd find this useful?