Word Vector Series: Part 1 - Intro

July 20, 2017

Intro to Part 1

In this series, I’ll be showing how to use an off-the-shelf model mapping english words to vectors in your own programs. We’ll start with a simple react application searching a dataset, then use the word vector model to enhance the search. We want to search for more than the exact word the user types. For example, if I type sailing, perhaps we will match the words boat, yacht, etc.

What is a word vector?

It’s a technique used in machine learning where words are translated into a vector space. You might also see it referred to as an embedding. There are a few different techniques to generate these mappings, but that’s not the focus of this blog series. We’ll simply be downloading a well-known pre-trained model and use it in our application: GoogleNews-vectors-negative300.bin.gz.

This one was trained off of 100 billion words from google news and is a few years old at this point. You can find many other pretrained models, some generated with different techniques online; however, I’ve had excellent results using this one.

Each word is encoded as a vector in 300 dimensions. For example, the word chair is represented by this array: [0.118652, -0.375000, 0.161133, 0.002151, ...295 numbers omitted.. , 0.171875]. One interesting feature is that some intuitive concepts might be embedded in simple linear transformations. For example, perhaps adding [0.5, 0.3, 0.6, 0, 0, 0, 0, 0....] to the vector for man ends up being the vector for woman, and adding it to king gets you queen. This is simplifying a bit, the male->female vector is not as clean as my example, but it’s still a very neat result. Here’s a graphic from google showing some linear relationships:

Wordvec linear relationships

I’ve explored this feature some. I’ve found it to be impressive but a bit messy and not as clean as the diagrams. I haven’t found any practical use for it. (Analogy completer?)

Starting point: A simple app without any word vectors

Let’s start with a simple react app that loads a dataset in memory and lets you type to search it.
I made a quick one that searches a list of all S&P 500 companies and their descriptions.

Here is the app. Here is the source code.

I made it quickly using create-react-app. The CSS classes are from tachyons - they will look weird if you’ve never seen them before. It uses lunrjs to search.

Up Next

Next, we will start building a related words service using the GoogleNews-vectors-negative300.bin.gz model and make the app use it.

Written by Matthew Reishus.

© 2020