Skip to main content
BC

Benny Chen

Fireworks AI Cofounder Benny Chen Explains**open-weight Model Selection**speculative Decoding for Production**reinforcement Fine-tuning Unlocks Non-ml Teams**evals as Compounding Business Assets
1episode
1podcast

We have 1 summarized appearance for Benny Chen so far. Browse all podcasts to discover more episodes.

Featured On 1 Podcast

Top resources Benny Chen mentions

Books, tools, and gear cited across podcast appearances. Ranked by frequency.

SignalCast may earn commission on purchases via affiliate links on each resource page.

All Appearances

1 episode
Software Engineering Daily

Open-Weight AI Models

Software Engineering Daily
50 minCofounder of Fireworks AI

AI Summary

→ WHAT IT COVERS Fireworks AI cofounder Benny Chen explains how his company serves and customizes open-weight models at scale, processing over 13 trillion tokens daily. The episode covers custom inference kernels, speculative decoding, multi-hardware strategy across NVIDIA and AMD, reinforcement fine-tuning, and why evals represent a durable business asset. → KEY INSIGHTS - **Open-weight model selection:** Roughly one-third of Fireworks customers arrive knowing exactly which model to deploy; one-third need cost and scalability guidance between two or three candidates; one-third rely fully on Fireworks evaluations. Knowing which category you fall into determines how much internal ML expertise you need before engaging an inference platform. - **Speculative decoding for production:** Fireworks trains custom speculator draft models specifically matched to each customer's fine-tuned target model, not generic open-source speculators. This pairing is critical for latency-sensitive workloads like Cursor's fast-apply feature, where a large file must be edited in one pass at high speed and low cost. - **Reinforcement fine-tuning unlocks non-ML teams:** RFT removes the need for MLE-managed data labeling pipelines. A product manager who can articulate what "good output" looks like can author a language-model-as-judge eval, send it to Fireworks, and trigger a training run. Vercel used this approach with two to three people and achieved 40x faster code-fixing with improved output quality. - **Evals as compounding business assets:** Unlike supervised fine-tuning datasets that require curation updates as models evolve, RL evaluation environments remain valid across model generations. Building evals now via Fireworks' open-source eval-protocol framework means the same asset used to benchmark models today can directly drive RFT training runs tomorrow without significant rework. - **Multi-hardware supply chain strategy:** Running inference across both NVIDIA and AMD hardware is primarily a supply chain reliability decision, not a performance one. At peak demand, NVIDIA cards become unavailable at reasonable prices, so maintaining AMD kernel support through custom in-house fire-attention kernels ensures uninterrupted capacity and competitive pricing for customers. → NOTABLE MOMENT Chen noted that Fireworks launched roughly five to six months before ChatGPT shipped, betting on open-weight models when the best available options could barely sustain a three-turn conversation and had no function-calling capability — a position he describes as genuinely contrarian at the time. 💼 SPONSORS [{"name": "TurboPuffer", "url": "https://turbopuffer.com/sed"}, {"name": "GuardSquare", "url": "https://www.guardsquare.com"}, {"name": "Unblocked", "url": "https://getunblocked.com/sedaily"}] 🏷️ Open-Weight Models, Inference Infrastructure, Reinforcement Fine-Tuning, LLM Evaluation, AI Customization

Explore More

Never miss Benny Chen's insights

Subscribe to get AI-powered summaries of Benny Chen's podcast appearances delivered to your inbox weekly.

Start Free Today

No credit card required • Free tier available