Skip to main content
NVIDIA AI Podcast

Snap’s Secret to Processing 10 Petabytes a Day: GPU-Accelerated Spark | NVIDIA AI Podcast Ep. 298

23 min episode · 2 min read
·

Episode

23 min

Read time

2 min

Topics

Fundraising & VC, Artificial Intelligence

AI-Generated Summary

Key Takeaways

  • GPU workload benchmarking by job type: Before committing to GPU acceleration, benchmark each distinct Spark job category separately. Snap found join-heavy jobs achieved 3x+ speedup, union jobs reached 2x, and aggregation jobs hit 1.5x — because CPUs already handle aggregations efficiently. Matching GPU investment to job type prevents overspending on workloads that won't benefit proportionally.
  • Zero-code migration via NVIDIA Spark Rapids: NVIDIA Spark Rapids integrates into existing PySpark workloads without any code changes, only requiring environment and container image configuration. For teams managing large Spark pipelines, this means GPU acceleration can be evaluated and deployed without rewriting jobs, dramatically reducing migration risk and engineering time during the transition period.
  • Repurpose idle inference GPUs for batch workloads: Snap identified that online serving GPUs sat idle between 1AM and 5AM as major markets slept. By migrating batch Spark jobs onto Kubernetes-managed GKE clusters already hosting inference workloads, teams can reclaim unused GPU capacity at near-zero incremental cost, provided preemption logic returns resources immediately when live traffic spikes.
  • Build graceful fallback chains for production reliability: Snap engineered a three-tier fallback: GPU-accelerated Spark on GKE → CPU-based Spark on GKE → Dataproc clusters. NVIDIA Ether assisted by auto-tuning Spark parameters across environments, keeping performance consistent. Any team deploying GPU-accelerated pipelines should design explicit degradation paths before production launch to maintain SLA compliance during capacity constraints.
  • Quantify infrastructure savings across four dimensions: Snap's migration produced 76% job cost reduction, 62% fewer CPU cores required, 80% lower memory footprint, and elimination of 120 terabytes of disk and memory spill. When building the business case for GPU-accelerated Spark, measure all four metrics — not just runtime — to capture the full financial and operational impact for stakeholders.

What It Covers

Snap's head of engineering platforms, Pruevi Vatala, details how the company migrated its 10-petabyte-per-day A/B testing experimentation pipeline to GPU-accelerated Apache Spark using NVIDIA Spark Rapids on Google Cloud, achieving 76% cost reduction while serving nearly one billion monthly active users.

Key Questions Answered

  • GPU workload benchmarking by job type: Before committing to GPU acceleration, benchmark each distinct Spark job category separately. Snap found join-heavy jobs achieved 3x+ speedup, union jobs reached 2x, and aggregation jobs hit 1.5x — because CPUs already handle aggregations efficiently. Matching GPU investment to job type prevents overspending on workloads that won't benefit proportionally.
  • Zero-code migration via NVIDIA Spark Rapids: NVIDIA Spark Rapids integrates into existing PySpark workloads without any code changes, only requiring environment and container image configuration. For teams managing large Spark pipelines, this means GPU acceleration can be evaluated and deployed without rewriting jobs, dramatically reducing migration risk and engineering time during the transition period.
  • Repurpose idle inference GPUs for batch workloads: Snap identified that online serving GPUs sat idle between 1AM and 5AM as major markets slept. By migrating batch Spark jobs onto Kubernetes-managed GKE clusters already hosting inference workloads, teams can reclaim unused GPU capacity at near-zero incremental cost, provided preemption logic returns resources immediately when live traffic spikes.
  • Build graceful fallback chains for production reliability: Snap engineered a three-tier fallback: GPU-accelerated Spark on GKE → CPU-based Spark on GKE → Dataproc clusters. NVIDIA Ether assisted by auto-tuning Spark parameters across environments, keeping performance consistent. Any team deploying GPU-accelerated pipelines should design explicit degradation paths before production launch to maintain SLA compliance during capacity constraints.
  • Quantify infrastructure savings across four dimensions: Snap's migration produced 76% job cost reduction, 62% fewer CPU cores required, 80% lower memory footprint, and elimination of 120 terabytes of disk and memory spill. When building the business case for GPU-accelerated Spark, measure all four metrics — not just runtime — to capture the full financial and operational impact for stakeholders.

Notable Moment

Snap discovered that GPU capacity for its data pipelines already existed inside the company — sitting completely unused overnight on inference servers. Recognizing that a social platform's usage follows a daily cycle turned an infrastructure bottleneck into a solved problem without purchasing additional hardware.

Know someone who'd find this useful?

You just read a 3-minute summary of a 20-minute episode.

Get NVIDIA AI Podcast summarized like this every Monday — plus up to 2 more podcasts, free.

Pick Your Podcasts — Free

Keep Reading

More from NVIDIA AI Podcast

We summarize every new episode. Want them in your inbox?

Similar Episodes

Related episodes from other podcasts

Explore Related Topics

This podcast is featured in Best AI Podcasts (2026) — ranked and reviewed with AI summaries.

Read this week's AI & Machine Learning Podcast Insights — cross-podcast analysis updated weekly.

You're clearly into NVIDIA AI Podcast.

Every Monday, we deliver AI summaries of the latest episodes from NVIDIA AI Podcast and 192+ other podcasts. Free for up to 3 shows.

Start My Monday Digest

No credit card · Unsubscribe anytime