Skip to main content
Software Engineering Daily

Open-Weight AI Models

50 min episode · 2 min read
·

Episode

50 min

Read time

2 min

Topics

Artificial Intelligence

AI-Generated Summary

Key Takeaways

  • Open-weight model selection: Roughly one-third of Fireworks customers arrive knowing exactly which model to deploy; one-third need cost and scalability guidance between two or three candidates; one-third rely fully on Fireworks evaluations. Knowing which category you fall into determines how much internal ML expertise you need before engaging an inference platform.
  • Speculative decoding for production: Fireworks trains custom speculator draft models specifically matched to each customer's fine-tuned target model, not generic open-source speculators. This pairing is critical for latency-sensitive workloads like Cursor's fast-apply feature, where a large file must be edited in one pass at high speed and low cost.
  • Reinforcement fine-tuning unlocks non-ML teams: RFT removes the need for MLE-managed data labeling pipelines. A product manager who can articulate what "good output" looks like can author a language-model-as-judge eval, send it to Fireworks, and trigger a training run. Vercel used this approach with two to three people and achieved 40x faster code-fixing with improved output quality.
  • Evals as compounding business assets: Unlike supervised fine-tuning datasets that require curation updates as models evolve, RL evaluation environments remain valid across model generations. Building evals now via Fireworks' open-source eval-protocol framework means the same asset used to benchmark models today can directly drive RFT training runs tomorrow without significant rework.
  • Multi-hardware supply chain strategy: Running inference across both NVIDIA and AMD hardware is primarily a supply chain reliability decision, not a performance one. At peak demand, NVIDIA cards become unavailable at reasonable prices, so maintaining AMD kernel support through custom in-house fire-attention kernels ensures uninterrupted capacity and competitive pricing for customers.

What It Covers

Fireworks AI cofounder Benny Chen explains how his company serves and customizes open-weight models at scale, processing over 13 trillion tokens daily. The episode covers custom inference kernels, speculative decoding, multi-hardware strategy across NVIDIA and AMD, reinforcement fine-tuning, and why evals represent a durable business asset.

Key Questions Answered

  • Open-weight model selection: Roughly one-third of Fireworks customers arrive knowing exactly which model to deploy; one-third need cost and scalability guidance between two or three candidates; one-third rely fully on Fireworks evaluations. Knowing which category you fall into determines how much internal ML expertise you need before engaging an inference platform.
  • Speculative decoding for production: Fireworks trains custom speculator draft models specifically matched to each customer's fine-tuned target model, not generic open-source speculators. This pairing is critical for latency-sensitive workloads like Cursor's fast-apply feature, where a large file must be edited in one pass at high speed and low cost.
  • Reinforcement fine-tuning unlocks non-ML teams: RFT removes the need for MLE-managed data labeling pipelines. A product manager who can articulate what "good output" looks like can author a language-model-as-judge eval, send it to Fireworks, and trigger a training run. Vercel used this approach with two to three people and achieved 40x faster code-fixing with improved output quality.
  • Evals as compounding business assets: Unlike supervised fine-tuning datasets that require curation updates as models evolve, RL evaluation environments remain valid across model generations. Building evals now via Fireworks' open-source eval-protocol framework means the same asset used to benchmark models today can directly drive RFT training runs tomorrow without significant rework.
  • Multi-hardware supply chain strategy: Running inference across both NVIDIA and AMD hardware is primarily a supply chain reliability decision, not a performance one. At peak demand, NVIDIA cards become unavailable at reasonable prices, so maintaining AMD kernel support through custom in-house fire-attention kernels ensures uninterrupted capacity and competitive pricing for customers.

Notable Moment

Chen noted that Fireworks launched roughly five to six months before ChatGPT shipped, betting on open-weight models when the best available options could barely sustain a three-turn conversation and had no function-calling capability — a position he describes as genuinely contrarian at the time.

Know someone who'd find this useful?

You just read a 3-minute summary of a 47-minute episode.

Get Software Engineering Daily summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Keep Reading

More from Software Engineering Daily

We summarize every new episode. Want them in your inbox?

Similar Episodes

Related episodes from other podcasts

Explore Related Topics

This podcast is featured in Best Cybersecurity Podcasts (2026) — ranked and reviewed with AI summaries.

Read this week's AI & Machine Learning Podcast Insights — cross-podcast analysis updated weekly.

You're clearly into Software Engineering Daily.

Every Monday, we deliver AI summaries of the latest episodes from Software Engineering Daily and 192+ other podcasts. Free for up to 3 shows.

Start My Monday Digest

No credit card · Unsubscribe anytime