Open-Weight AI Models

April 28, 2026

50 min episode · 2 min read

Benny Chen

Episode

50 min

Read time

2 min

Topics

Artificial Intelligence

AI-Generated Summary

Published Apr 28, 2026

Key Takeaways

✓Open-weight model selection: Roughly one-third of Fireworks customers arrive knowing exactly which model to deploy; one-third need cost and scalability guidance between two or three candidates; one-third rely fully on Fireworks evaluations. Knowing which category you fall into determines how much internal ML expertise you need before engaging an inference platform.
✓Speculative decoding for production: Fireworks trains custom speculator draft models specifically matched to each customer's fine-tuned target model, not generic open-source speculators. This pairing is critical for latency-sensitive workloads like Cursor's fast-apply feature, where a large file must be edited in one pass at high speed and low cost.
✓Reinforcement fine-tuning unlocks non-ML teams: RFT removes the need for MLE-managed data labeling pipelines. A product manager who can articulate what "good output" looks like can author a language-model-as-judge eval, send it to Fireworks, and trigger a training run. Vercel used this approach with two to three people and achieved 40x faster code-fixing with improved output quality.
✓Evals as compounding business assets: Unlike supervised fine-tuning datasets that require curation updates as models evolve, RL evaluation environments remain valid across model generations. Building evals now via Fireworks' open-source eval-protocol framework means the same asset used to benchmark models today can directly drive RFT training runs tomorrow without significant rework.
✓Multi-hardware supply chain strategy: Running inference across both NVIDIA and AMD hardware is primarily a supply chain reliability decision, not a performance one. At peak demand, NVIDIA cards become unavailable at reasonable prices, so maintaining AMD kernel support through custom in-house fire-attention kernels ensures uninterrupted capacity and competitive pricing for customers.

What It Covers

Fireworks AI cofounder Benny Chen explains how his company serves and customizes open-weight models at scale, processing over 13 trillion tokens daily. The episode covers custom inference kernels, speculative decoding, multi-hardware strategy across NVIDIA and AMD, reinforcement fine-tuning, and why evals represent a durable business asset.

Key Questions Answered

•Open-weight model selection: Roughly one-third of Fireworks customers arrive knowing exactly which model to deploy; one-third need cost and scalability guidance between two or three candidates; one-third rely fully on Fireworks evaluations. Knowing which category you fall into determines how much internal ML expertise you need before engaging an inference platform.
•Speculative decoding for production: Fireworks trains custom speculator draft models specifically matched to each customer's fine-tuned target model, not generic open-source speculators. This pairing is critical for latency-sensitive workloads like Cursor's fast-apply feature, where a large file must be edited in one pass at high speed and low cost.
•Reinforcement fine-tuning unlocks non-ML teams: RFT removes the need for MLE-managed data labeling pipelines. A product manager who can articulate what "good output" looks like can author a language-model-as-judge eval, send it to Fireworks, and trigger a training run. Vercel used this approach with two to three people and achieved 40x faster code-fixing with improved output quality.
•Evals as compounding business assets: Unlike supervised fine-tuning datasets that require curation updates as models evolve, RL evaluation environments remain valid across model generations. Building evals now via Fireworks' open-source eval-protocol framework means the same asset used to benchmark models today can directly drive RFT training runs tomorrow without significant rework.
•Multi-hardware supply chain strategy: Running inference across both NVIDIA and AMD hardware is primarily a supply chain reliability decision, not a performance one. At peak demand, NVIDIA cards become unavailable at reasonable prices, so maintaining AMD kernel support through custom in-house fire-attention kernels ensures uninterrupted capacity and competitive pricing for customers.

Notable Moment

Chen noted that Fireworks launched roughly five to six months before ChatGPT shipped, betting on open-weight models when the best available options could barely sustain a three-turn conversation and had no function-calling capability — a position he describes as genuinely contrarian at the time.

Know someone who'd find this useful?