Benny Chen

Open-Weight AI Models

Apr 28, 202650 minCofounder of Fireworks AI

AI Summary

→ WHAT IT COVERS Fireworks AI cofounder Benny Chen explains how his company serves and customizes open-weight models at scale, processing over 13 trillion tokens daily. The episode covers custom inference kernels, speculative decoding, multi-hardware strategy across NVIDIA and AMD, reinforcement fine-tuning, and why evals represent a durable business asset. → KEY INSIGHTS - **Open-weight model selection:** Roughly one-third of Fireworks customers arrive knowing exactly which model to deploy; one-third need cost and scalability guidance between two or three candidates; one-third rely fully on Fireworks evaluations. Knowing which category you fall into determines how much internal ML expertise you need before engaging an inference platform. - **Speculative decoding for production:** Fireworks trains custom speculator draft models specifically matched to each customer's fine-tuned target model, not generic open-source speculators. This pairing is critical for latency-sensitive workloads like Cursor's fast-apply feature, where a large file must be edited in one pass at high speed and low cost. - **Reinforcement fine-tuning unlocks non-ML teams:** RFT removes the need for MLE-managed data labeling pipelines. A product manager who can articulate what "good output" looks like can author a language-model-as-judge eval, send it to Fireworks, and trigger a training run. Vercel used this approach with two to three people and achieved 40x faster code-fixing with improved output quality. - **Evals as compounding business assets:** Unlike supervised fine-tuning datasets that require curation updates as models evolve, RL evaluation environments remain valid across model generations. Building evals now via Fireworks' open-source eval-protocol framework means the same asset used to benchmark models today can directly drive RFT training runs tomorrow without significant rework. - **Multi-hardware supply chain strategy:** Running inference across both NVIDIA and AMD hardware is primarily a supply chain reliability decision, not a performance one. At peak demand, NVIDIA cards become unavailable at reasonable prices, so maintaining AMD kernel support through custom in-house fire-attention kernels ensures uninterrupted capacity and competitive pricing for customers. → NOTABLE MOMENT Chen noted that Fireworks launched roughly five to six months before ChatGPT shipped, betting on open-weight models when the best available options could barely sustain a three-turn conversation and had no function-calling capability — a position he describes as genuinely contrarian at the time. 💼 SPONSORS [{"name": "TurboPuffer", "url": "https://turbopuffer.com/sed"}, {"name": "GuardSquare", "url": "https://www.guardsquare.com"}, {"name": "Unblocked", "url": "https://getunblocked.com/sedaily"}] 🏷️ Open-Weight Models, Inference Infrastructure, Reinforcement Fine-Tuning, LLM Evaluation, AI Customization

Read Full Summary Listen

Featured On 1 Podcast

Software Engineering Daily

All Appearances

Open-Weight AI Models

AI Summary

Explore More

Never miss Benny Chen's insights