Bay Area Apache Spark Meetup @ Databricks, HQ in San Francisco

Thursday, July 19, 2018 - 18:00
Bay Area Spark Meetup
San Francisco

Join us for an evening of Bay Area Apache Spark Meetup featuring open-source tech-talks about using and innovating with Apache Spark from Databricks (

Thanks to Databricks for hosting and sponsoring this meetup.


6:00 - 6:30 pm Mingling & Refreshments
6:30 - 6:40 pm Welcome opening remarks, announcements & introductions (Jules S. Damji + Reynold Xin)
6:40 - 7:25 pm Tech-Talk-1 Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark
7:25 - 8:10 pm Tech-Talk-2 MLflow: Infrastructure for a Complete Machine Learning Life Cycle
8:10 - 8:30 pm Mingling & Networking

Tech-Talk 1: Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark

Abstract: Big data and AI are joined at the hip: the best AI applications require massive amounts of constantly updated training data to build state-of-the-art models AI has always been on of the most exciting applications of big data and Apache Spark. Increasingly Spark users want to integrate Spark with distributed deep learning and machine learning frameworks built for state-of-the-art training. On the other side, increasingly DL/AI users want to handle large and complex data scenarios needed for their production pipelines.

This talk introduces a new project that substantially improves the performance and fault-recovery of distributed deep learning and machine learning frameworks on Spark. We will introduce the major directions and provide progress updates, including 1) barrier execution mode for distributed DL training, 2) fast data exchange between Spark and DL frameworks, and 3) accelerator-awareness scheduling.

Bio: Xiangrui Meng is an Apache Spark PMC member and a software engineer at Databricks. His main interests center around developing and implementing scalable algorithms for scientific applications. He has been actively involved in the development and maintenance of Spark MLlib since he joined Databricks. Before Databricks, he worked as an applied research engineer at LinkedIn, where he was the main developer of an offline machine learning framework in Hadoop MapReduce. His Ph.D. work at Stanford is on randomized algorithms for large-scale linear regression problems.

Tech-Talk 2: MLflow: Infrastructure for a Complete Machine Learning Life Cycle
Abstract: ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure.

In this talk, we will present MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.

Bio: Members from the MLflow Team (

Databricks, Inc HQ

160 Spear St, Floor 13