Snap’s Secret to Processing 10 Petabytes a Day: GPU-Accelerated Spark | NVIDIA AI Podcast Ep. 298

May 13, 2026

23 min episode · 2 min read

Pruevi Vatala

Episode

23 min

Read time

2 min

Topics

Fundraising & VC, Artificial Intelligence

AI-Generated Summary

Published May 14, 2026

Key Takeaways

✓GPU workload benchmarking by job type: Before committing to GPU acceleration, benchmark each distinct Spark job category separately. Snap found join-heavy jobs achieved 3x+ speedup, union jobs reached 2x, and aggregation jobs hit 1.5x — because CPUs already handle aggregations efficiently. Matching GPU investment to job type prevents overspending on workloads that won't benefit proportionally.
✓Zero-code migration via NVIDIA Spark Rapids: NVIDIA Spark Rapids integrates into existing PySpark workloads without any code changes, only requiring environment and container image configuration. For teams managing large Spark pipelines, this means GPU acceleration can be evaluated and deployed without rewriting jobs, dramatically reducing migration risk and engineering time during the transition period.
✓Repurpose idle inference GPUs for batch workloads: Snap identified that online serving GPUs sat idle between 1AM and 5AM as major markets slept. By migrating batch Spark jobs onto Kubernetes-managed GKE clusters already hosting inference workloads, teams can reclaim unused GPU capacity at near-zero incremental cost, provided preemption logic returns resources immediately when live traffic spikes.
✓Build graceful fallback chains for production reliability: Snap engineered a three-tier fallback: GPU-accelerated Spark on GKE → CPU-based Spark on GKE → Dataproc clusters. NVIDIA Ether assisted by auto-tuning Spark parameters across environments, keeping performance consistent. Any team deploying GPU-accelerated pipelines should design explicit degradation paths before production launch to maintain SLA compliance during capacity constraints.
✓Quantify infrastructure savings across four dimensions: Snap's migration produced 76% job cost reduction, 62% fewer CPU cores required, 80% lower memory footprint, and elimination of 120 terabytes of disk and memory spill. When building the business case for GPU-accelerated Spark, measure all four metrics — not just runtime — to capture the full financial and operational impact for stakeholders.

What It Covers

Snap's head of engineering platforms, Pruevi Vatala, details how the company migrated its 10-petabyte-per-day A/B testing experimentation pipeline to GPU-accelerated Apache Spark using NVIDIA Spark Rapids on Google Cloud, achieving 76% cost reduction while serving nearly one billion monthly active users.

Key Questions Answered

•GPU workload benchmarking by job type: Before committing to GPU acceleration, benchmark each distinct Spark job category separately. Snap found join-heavy jobs achieved 3x+ speedup, union jobs reached 2x, and aggregation jobs hit 1.5x — because CPUs already handle aggregations efficiently. Matching GPU investment to job type prevents overspending on workloads that won't benefit proportionally.
•Zero-code migration via NVIDIA Spark Rapids: NVIDIA Spark Rapids integrates into existing PySpark workloads without any code changes, only requiring environment and container image configuration. For teams managing large Spark pipelines, this means GPU acceleration can be evaluated and deployed without rewriting jobs, dramatically reducing migration risk and engineering time during the transition period.
•Repurpose idle inference GPUs for batch workloads: Snap identified that online serving GPUs sat idle between 1AM and 5AM as major markets slept. By migrating batch Spark jobs onto Kubernetes-managed GKE clusters already hosting inference workloads, teams can reclaim unused GPU capacity at near-zero incremental cost, provided preemption logic returns resources immediately when live traffic spikes.
•Build graceful fallback chains for production reliability: Snap engineered a three-tier fallback: GPU-accelerated Spark on GKE → CPU-based Spark on GKE → Dataproc clusters. NVIDIA Ether assisted by auto-tuning Spark parameters across environments, keeping performance consistent. Any team deploying GPU-accelerated pipelines should design explicit degradation paths before production launch to maintain SLA compliance during capacity constraints.
•Quantify infrastructure savings across four dimensions: Snap's migration produced 76% job cost reduction, 62% fewer CPU cores required, 80% lower memory footprint, and elimination of 120 terabytes of disk and memory spill. When building the business case for GPU-accelerated Spark, measure all four metrics — not just runtime — to capture the full financial and operational impact for stakeholders.

Notable Moment

Snap discovered that GPU capacity for its data pipelines already existed inside the company — sitting completely unused overnight on inference servers. Recognizing that a social platform's usage follows a daily cycle turned an infrastructure bottleneck into a solved problem without purchasing additional hardware.

Know someone who'd find this useful?

You just read a 3-minute summary of a 20-minute episode.

Get NVIDIA AI Podcast summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Similar Episodes

Related episodes from other podcasts

Equity

May 13

Explore Related Topics

💰Fundraising & VC 🤖Artificial Intelligence

This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.

Read this week's AI & Machine Learning Podcast Insights — cross-podcast analysis updated weekly.

You're clearly into NVIDIA AI Podcast.

Every Monday, we deliver AI summaries of the latest episodes from NVIDIA AI Podcast and 192+ other podcasts. Free for up to 3 shows.

Start My Monday Digest

No credit card · Unsubscribe anytime

Snap’s Secret to Processing 10 Petabytes a Day: GPU-Accelerated Spark | NVIDIA AI Podcast Ep. 298

AI-Generated Summary

Key Takeaways

What It Covers

Key Questions Answered

Notable Moment

Keep Reading

Harrison Chase of LangChain on Deep Agents, LangSmith, and Earning Trust | NVIDIA AI Podcast Ep. 297

Amazon's Steve Schmidt on why your AI agents are your biggest security risk (Live at HumanX)

How Dassault Systèmes Is Building AI That Understands Physics - Ep. 296

Google Search Is Winning Again

More from NVIDIA AI Podcast

Harrison Chase of LangChain on Deep Agents, LangSmith, and Earning Trust | NVIDIA AI Podcast Ep. 297

How Dassault Systèmes Is Building AI That Understands Physics - Ep. 296

One Brain, Any Robot: Skild AI's Skild Brain Explained - Ep. 295

How AI Will Change Quantum Computing - Ep. 294

Building AI Factories: How Red Hat and NVIDIA Turn Enterprise Data Into Intelligence - Ep. 293

Similar Episodes

Amazon's Steve Schmidt on why your AI agents are your biggest security risk (Live at HumanX)

Google Search Is Winning Again

Will AI Populism Decide the 2028 Election? | Jasmine Sun

Bhutan Times the Top, CLARITY Hits Markup, and the Onchain Pokemon Card Boom | The Breakdown

661: Donna’s Corporate Career Ended Overnight — So She Built A $51K Brand In 2 Months

Explore Related Topics

You're clearly into NVIDIA AI Podcast.