Simon Moe

Inferact: Building the Infrastructure That Runs Modern AI

Jan 22, 202644 minCofounder of InfraAct

AI Summary

→ WHAT IT COVERS Simon Moe and Woosuk Kwon, cofounders of Infraact and creators of VLLM, explain how AI inference has evolved from a simple side project in 2022 to one of computing's most complex challenges, requiring sophisticated memory management, dynamic request scheduling, and support for 400,000-500,000 GPUs running diverse models across heterogeneous hardware architectures worldwide. → KEY INSIGHTS - **Page Attention Architecture:** VLLM solves the fundamental problem that language model requests vary dramatically in size—from single-word prompts to hundred-page documents—requiring dynamic batching and memory management instead of traditional static tensor operations. The system processes one token across all concurrent requests per step, handling nondeterministic output lengths where models decide their own stopping points rather than following predetermined patterns like image classification workloads. - **Open Source Scaling Model:** VLLM operates with 50+ full-time contributors and 2,000+ total contributors across GitHub, supported by model providers (Mistral, Hugging Face), hardware vendors (NVIDIA, AMD, Google, Intel), and infrastructure companies. This solves the M-times-N problem where each participant contributes to one universal layer instead of building separate integrations, with continuous integration costs exceeding one million dollars annually to test every commit across deployment scenarios. - **Agentic Workload Complexity:** Agent-based AI systems fundamentally disrupt cache management because conversations extend to hundreds or thousands of turns with external tool interactions (sandbox execution, web searches, Python scripts) creating unpredictable wait times from one second to hours. Traditional cache eviction patterns fail when the system cannot determine if an agent has finished thinking or is waiting for external environment responses, requiring co-optimization of agent architecture with inference infrastructure. - **Hardware-Model Co-Design:** Model architecture must be specialized for specific compute targets—designs optimized for NVIDIA H100 chips differ drastically from B200 or GB200 NVL72 systems, and differ again for TPUs. Vertical stack integration across data, model architecture, and hardware creates performance advantages that closed-source providers cannot deliver for diverse enterprise use cases requiring different context lengths, reasoning capabilities, and deployment environments across heterogeneous infrastructure. - **Deployment at Consumer Scale:** Amazon deploys VLLM to power the Rufus shopping assistant, processing every search query and bot interaction on their front-page feature. Character AI deployed experimental speculative decoding features to hundreds of GPUs before the code merged into the main branch, demonstrating how production deployments adopt cutting-edge optimizations within days of initial implementation, requiring PhD-level reliability standards for code affecting millions of consumer transactions. → NOTABLE MOMENT The first VLLM meetup in August 2023 drew such unexpectedly massive attendance that Andreessen Horowitz's security team called to warn the event exceeded safe capacity limits. Registration far surpassed the anticipated 10-20 people, demonstrating intense demand from systems engineers for inference optimization knowledge—a narrow, sophisticated audience not typically known for attending in-person gatherings. 💼 SPONSORS None detected 🏷️ AI Inference, Open Source Infrastructure, GPU Optimization, Agentic AI, Distributed Systems

Read Full Summary Listen

Featured On 1 Podcast

a16z Podcast

All Appearances

Inferact: Building the Infrastructure That Runs Modern AI

AI Summary

Explore More

Never miss Simon Moe's insights