Join us for an evening of Bay Area Apache Spark Meetup featuring tech-talks about using Apache Spark at scale from Grammarly’s Michael Chernetsov, Adobe’s Mandeep Gandhi, and Databricks’ Tim Hunter.
6:30 - 7:00 pm Mingling & Refreshments
7:00 - 7:10 pm Opening Remarks & Introductions
7:10 - 7:45 pm Grammarly Tech Talk 1 from Michael Chernetsov
7:45 - 8:20 pm Adobe Tech Talk 2 from Mandeep Gandhi
8:20 - 8:25 pm Short Break
8:25 - 9:00 pm Databricks Tech Talk 3 from Tim Hunter
Grammarly: Tech-Talk 1: Apache Spark as a Platform for Powerful Custom Analytics Data Pipeline
Abstract:As consumer Internet companies grow, they often prefer building a custom analytics data pipeline in-house over relying on third-party tools to cope with an increasing scale of data, maintain high quality of data, satisfy custom data enrichment, and produce sophisticated reports.
During this talk, we want to share how we tackled this challenge here at Grammarly, why we chose Apache Spark as a primary tool for the task, and what we learned along the way, including several Spark tweaks and gotchas:
– Outputting data to several storages in a single Spark job
– Dealing with Spark memory model using a custom spillable data-structure for your data traversal
– Implementing a custom query language with parser combinators on top of Spark SQL parser
– Custom query optimizer and analyzer when you want something not exactly SQL
– Flexible-schema storage and query against multi-schema data with schema conflicts
– Custom aggregation functions in Spark SQL
Bio: Mikhail Chernetsov is currently a data team lead at Grammarly, working on analytics tooling that makes everyone in the company data-driven. Previously, he worked at Amazon, co-founded start-ups, and graduated from Moscow Institute of Technology (MIPT).
Adobe: Tech-Talk 2: The multi-tenant way - Challenges and learnings running ETL using Apache Spark on Apache Mesos
Abstract: Adobe has developed a self-service infrastructure for Apache Spark over Apache Mesos that supports multi-tenancy, job queueing, quota guarantees, auto-scaling of Spark cluster nodes and managing many of the network-related issues that result. We would like to share our lessons learned and also discuss some of the issues still to be solved.
Bio: Mandeep Gandhi is a Computer Scientist at Adobe
Databricks: Tech-Talk 3: From Pipelines to Refineries: Building Complex Data Applications with Apache Spark
Abstract: Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data.
Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations.
This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.
Bio: Tim Hunter is a software engineer at Databricks and contributes to the Apache Spark MLlib project. He has been building distributed Machine Learning systems with Spark since version 0.2, before Spark was an Apache Software Foundation project. He is also the creator of TensorFrames and a contributor of GraphFrames.
601 Townsend Street