
Snap’s Secret to Processing 10 Petabytes a Day: GPU-Accelerated Spark | NVIDIA AI Podcast Ep. 298
NVIDIA AI PodcastAI Summary
→ WHAT IT COVERS Snap's head of engineering platforms, Pruevi Vatala, details how the company migrated its 10-petabyte-per-day A/B testing experimentation pipeline to GPU-accelerated Apache Spark using NVIDIA Spark Rapids on Google Cloud, achieving 76% cost reduction while serving nearly one billion monthly active users. → KEY INSIGHTS - **GPU workload benchmarking by job type:** Before committing to GPU acceleration, benchmark each distinct Spark job category separately. Snap found join-heavy jobs achieved 3x+ speedup, union jobs reached 2x, and aggregation jobs hit 1.5x — because CPUs already handle aggregations efficiently. Matching GPU investment to job type prevents overspending on workloads that won't benefit proportionally. - **Zero-code migration via NVIDIA Spark Rapids:** NVIDIA Spark Rapids integrates into existing PySpark workloads without any code changes, only requiring environment and container image configuration. For teams managing large Spark pipelines, this means GPU acceleration can be evaluated and deployed without rewriting jobs, dramatically reducing migration risk and engineering time during the transition period. - **Repurpose idle inference GPUs for batch workloads:** Snap identified that online serving GPUs sat idle between 1AM and 5AM as major markets slept. By migrating batch Spark jobs onto Kubernetes-managed GKE clusters already hosting inference workloads, teams can reclaim unused GPU capacity at near-zero incremental cost, provided preemption logic returns resources immediately when live traffic spikes. - **Build graceful fallback chains for production reliability:** Snap engineered a three-tier fallback: GPU-accelerated Spark on GKE → CPU-based Spark on GKE → Dataproc clusters. NVIDIA Ether assisted by auto-tuning Spark parameters across environments, keeping performance consistent. Any team deploying GPU-accelerated pipelines should design explicit degradation paths before production launch to maintain SLA compliance during capacity constraints. - **Quantify infrastructure savings across four dimensions:** Snap's migration produced 76% job cost reduction, 62% fewer CPU cores required, 80% lower memory footprint, and elimination of 120 terabytes of disk and memory spill. When building the business case for GPU-accelerated Spark, measure all four metrics — not just runtime — to capture the full financial and operational impact for stakeholders. → NOTABLE MOMENT Snap discovered that GPU capacity for its data pipelines already existed inside the company — sitting completely unused overnight on inference servers. Recognizing that a social platform's usage follows a daily cycle turned an infrastructure bottleneck into a solved problem without purchasing additional hardware. 💼 SPONSORS None detected 🏷️ GPU-Accelerated Spark, NVIDIA Spark Rapids, Big Data Infrastructure, A/B Testing at Scale, Google Cloud Dataproc