Beyond Word2Vec: Recent Developments in Document Embedding

Tuesday, October 24, 2017 - 18:00
SF Bayarea Machine Learning
San Francisco

Main Talk: Beyond Word2Vec: Recent Developments in Document Embedding - Andrew Blevins (Metis (

Abstract: It is easy to be amazed by their seemingly magical power of word2vec. But in real business use cases, we rarely need to understand single words. So how do we apply the power of word2vec to phrases, sentences, paragraphs or entire documents? We will compare various techniques of generating useful representations of documents of indeterminate length and look at ways of comparing methods.

We will start with bag-of-words approaches and TFIDF. From there we will look at dimensionality reduction techniques like LSA or NMF. After that, we will look at word2vec and sense2vec and various ways to aggregate those word vectors, including summing, weighting, clustering, Chinese restaurant processes, Gensim Doc2vec and developing parse tree representations. Finally, we will look at RNN methods such as LSTMs using Keras. Along the way, we will look at ways to evaluate each of these methods and discuss strengths and weaknesses.

Bio: Andrew comes to Metis from LinkedIn, where he worked as a data scientist, on projects ranging from executive dashboarding, education, inferring profiles and skills standardization. He is passionate about helping people make rational decisions and building cool data products. Prior to that he worked on fraud modelling at IMVU (the lean startup) and studied applied physics at Cornell. Andrew grew up on a sheep farm in North Idaho. He loves snowboarding, traveling, scotch and reading about all kinds of nerdy topics.

Lightning Talk: Machine Learning at TrueAccord - Nadav Samet (True Accord (

Abstract: TrueAccord reinvents debt collection and empowers consumers to regain financial health. Using machine learning and behavioral analytics we replace the majority of human to human interactions with human to machine interactions and make a significant impact on millions of consumers in one of the most regulated industries in the US.

Bio: Nadav has over 20 years of coding experience, with more than 7 years as a solutions engineer for various startups. He began his career at the elite technological unit of the IDF’s Intelligence Corps where he specialized in data and network analysis.

Tentative Schedule:

6:00pm-6:45pm -- pre-reception

6:45pm-7:00pm -- lightning talk

7:00pm-8:00pm -- main talk

8:00pm-8:30pm-- post-reception


