Scalable Machine Learning Pipelines with Dask: Jason Carpenter, Manifold

Friday, June 21, 2019 - 12:30
University of San Francisco Seminar Series in Analytics
San Francisco

Talk Title: Scalable Machine Learning Pipelines with Dask

**Coffee is served at the in-person seminar**

A recording of the talk will be posted afterwards on our YouTube channel at

Jason Carpenter is a Machine Learning Engineer at Manifold, an artificial intelligence engineering services firm with offices in Boston and Silicon Valley. He has experience delivering machine learning and data engineering solutions that are integrated into core business strategy for Manifold's clients. The clients he's engineered solutions for range from developers of web and mobile applications to manufacturers of industrial hardware. Prior to joining Manifold full-time, he acquired his Master of Science in Data Science from the University of San Francisco, while simultaneously developing two open-source python packages. His package swifter, which automatically decides the quickest way to apply any function to a pandas dataframe, relies on Dask as it's workhorse for parallel processing. He previously co-presented this talk at AnacondaCon 2019.

Talk Description:
Dask is a powerful library within the PyData ecosystem. There are a number of great resources on how to use Dask for parallel processing, from documentation and blog posts to tutorial videos. However, we noticed that there is not yet any comprehensive resource specific to the applications of Dask in machine learning pipelines. This talk aims to fill that gap.

Dask is useful in various stages of machine learning pipelines, from data preprocessing to hyper-parameter tuning. We will present a unified approach for the application of Dask in ML workflows that can help you build scalable ML pipelines. We will focus on a case study where the goal is to classify journal papers into different topic categories.

Key audience takeaways will include:
1. How to identify challenges that can be addressed using Dask in Machine Learning
2. A set of design patterns for applying Dask to Machine Learning workflows
3. A set of examples with code, taken from real-world applications

101 Howard St, University of San Francisco - Downtown Campus, San Francisco, CA 94105

101 Howard St, University of San Francisco - Downtown Campus, San Francisco, CA 94105