Snap’s Secret to Processing 10 Petabytes a Day: GPU-Accelerated Spark | NVIDIA AI Podcast Ep. 298
Episode
23 min
Read time
2 min
Topics
Fundraising & VC, Artificial Intelligence
AI-Generated Summary
Key Takeaways
- ✓GPU workload benchmarking by job type: Before committing to GPU acceleration, benchmark each distinct Spark job category separately. Snap found join-heavy jobs achieved 3x+ speedup, union jobs reached 2x, and aggregation jobs hit 1.5x — because CPUs already handle aggregations efficiently. Matching GPU investment to job type prevents overspending on workloads that won't benefit proportionally.
- ✓Zero-code migration via NVIDIA Spark Rapids: NVIDIA Spark Rapids integrates into existing PySpark workloads without any code changes, only requiring environment and container image configuration. For teams managing large Spark pipelines, this means GPU acceleration can be evaluated and deployed without rewriting jobs, dramatically reducing migration risk and engineering time during the transition period.
- ✓Repurpose idle inference GPUs for batch workloads: Snap identified that online serving GPUs sat idle between 1AM and 5AM as major markets slept. By migrating batch Spark jobs onto Kubernetes-managed GKE clusters already hosting inference workloads, teams can reclaim unused GPU capacity at near-zero incremental cost, provided preemption logic returns resources immediately when live traffic spikes.
- ✓Build graceful fallback chains for production reliability: Snap engineered a three-tier fallback: GPU-accelerated Spark on GKE → CPU-based Spark on GKE → Dataproc clusters. NVIDIA Ether assisted by auto-tuning Spark parameters across environments, keeping performance consistent. Any team deploying GPU-accelerated pipelines should design explicit degradation paths before production launch to maintain SLA compliance during capacity constraints.
- ✓Quantify infrastructure savings across four dimensions: Snap's migration produced 76% job cost reduction, 62% fewer CPU cores required, 80% lower memory footprint, and elimination of 120 terabytes of disk and memory spill. When building the business case for GPU-accelerated Spark, measure all four metrics — not just runtime — to capture the full financial and operational impact for stakeholders.
What It Covers
Snap's head of engineering platforms, Pruevi Vatala, details how the company migrated its 10-petabyte-per-day A/B testing experimentation pipeline to GPU-accelerated Apache Spark using NVIDIA Spark Rapids on Google Cloud, achieving 76% cost reduction while serving nearly one billion monthly active users.
Key Questions Answered
- •GPU workload benchmarking by job type: Before committing to GPU acceleration, benchmark each distinct Spark job category separately. Snap found join-heavy jobs achieved 3x+ speedup, union jobs reached 2x, and aggregation jobs hit 1.5x — because CPUs already handle aggregations efficiently. Matching GPU investment to job type prevents overspending on workloads that won't benefit proportionally.
- •Zero-code migration via NVIDIA Spark Rapids: NVIDIA Spark Rapids integrates into existing PySpark workloads without any code changes, only requiring environment and container image configuration. For teams managing large Spark pipelines, this means GPU acceleration can be evaluated and deployed without rewriting jobs, dramatically reducing migration risk and engineering time during the transition period.
- •Repurpose idle inference GPUs for batch workloads: Snap identified that online serving GPUs sat idle between 1AM and 5AM as major markets slept. By migrating batch Spark jobs onto Kubernetes-managed GKE clusters already hosting inference workloads, teams can reclaim unused GPU capacity at near-zero incremental cost, provided preemption logic returns resources immediately when live traffic spikes.
- •Build graceful fallback chains for production reliability: Snap engineered a three-tier fallback: GPU-accelerated Spark on GKE → CPU-based Spark on GKE → Dataproc clusters. NVIDIA Ether assisted by auto-tuning Spark parameters across environments, keeping performance consistent. Any team deploying GPU-accelerated pipelines should design explicit degradation paths before production launch to maintain SLA compliance during capacity constraints.
- •Quantify infrastructure savings across four dimensions: Snap's migration produced 76% job cost reduction, 62% fewer CPU cores required, 80% lower memory footprint, and elimination of 120 terabytes of disk and memory spill. When building the business case for GPU-accelerated Spark, measure all four metrics — not just runtime — to capture the full financial and operational impact for stakeholders.
Notable Moment
Snap discovered that GPU capacity for its data pipelines already existed inside the company — sitting completely unused overnight on inference servers. Recognizing that a social platform's usage follows a daily cycle turned an infrastructure bottleneck into a solved problem without purchasing additional hardware.
You just read a 3-minute summary of a 20-minute episode.
Get NVIDIA AI Podcast summarized like this every Monday — plus up to 2 more podcasts, free.
Pick Your Podcasts — FreeKeep Reading
More from NVIDIA AI Podcast
Harrison Chase of LangChain on Deep Agents, LangSmith, and Earning Trust | NVIDIA AI Podcast Ep. 297
May 6 · 24 min
Equity
Amazon's Steve Schmidt on why your AI agents are your biggest security risk (Live at HumanX)
May 13
More from NVIDIA AI Podcast
How Dassault Systèmes Is Building AI That Understands Physics - Ep. 296
Apr 29 · 23 min
Marketing School
Google Search Is Winning Again
May 13
More from NVIDIA AI Podcast
We summarize every new episode. Want them in your inbox?
Harrison Chase of LangChain on Deep Agents, LangSmith, and Earning Trust | NVIDIA AI Podcast Ep. 297
How Dassault Systèmes Is Building AI That Understands Physics - Ep. 296
One Brain, Any Robot: Skild AI's Skild Brain Explained - Ep. 295
How AI Will Change Quantum Computing - Ep. 294
Building AI Factories: How Red Hat and NVIDIA Turn Enterprise Data Into Intelligence - Ep. 293
Similar Episodes
Related episodes from other podcasts
Equity
May 13
Amazon's Steve Schmidt on why your AI agents are your biggest security risk (Live at HumanX)
Marketing School
May 13
Google Search Is Winning Again
Bankless
May 13
Will AI Populism Decide the 2028 Election? | Jasmine Sun
The Breakdown
May 13
Bhutan Times the Top, CLARITY Hits Markup, and the Onchain Pokemon Card Boom | The Breakdown
Foundr
May 13
661: Donna’s Corporate Career Ended Overnight — So She Built A $51K Brand In 2 Months
Explore Related Topics
This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.
Read this week's AI & Machine Learning Podcast Insights — cross-podcast analysis updated weekly.
You're clearly into NVIDIA AI Podcast.
Every Monday, we deliver AI summaries of the latest episodes from NVIDIA AI Podcast and 192+ other podcasts. Free for up to 3 shows.
Start My Monday DigestNo credit card · Unsubscribe anytime