Word Vector Series: Part 3 - Creating a similar word service
July 21, 2017
Intro to Part 3
In part 3, we’ll be using our new javascript word vector model to create a service that lists words similar to an input word.
Finding similar words
If we’re given a word, we want to find words similar to it. How do we do that? The most fundamental measure of distance is the euclidean distance, however; it’s not used often in machine learning applications. What makes sense to our intuitions tuned for 2d and 3d spaces might not work in a 300d space (see: 1 2 3 4).
Instead, we can use cosine similarity, which will tell us if vectors are roughly pointing in the same direction or not:
Calculating cosine similarity (theory)
This dot product formula contains the cosine of the angle between two vectors:
$\mathbf{a} \cdot \mathbf{b} = || \mathbf{a} ||_2 || \mathbf{b} ||_2 \cos \theta$
Solving for $\cos \theta$, we find:
$\cos \theta = \frac{\mathbf{a} \cdot \mathbf{b}}{ || \mathbf{a} ||_2 || \mathbf{b} ||_2 }$
The dot product can also be expressed as the sum of the products of the components of each vector, or:
$ \mathbf{a} \cdot \mathbf{b} = a_1 b_1 + a_2 b_2 + a_3 b_3 + \cdots + a_n b_n $
As a refresher, $ || \mathbf{a} ||_2 $ is known as the L2 norm, or euclidean norm, or magnitude of a vector.
$ || \mathbf{a} ||_2 := \sqrt{a_1^2 + a_2^2 + \cdots + a_n^2} $
A neat result is: $ || \mathbf{a} ||_2 := \sqrt{\mathbf{a} \cdot \mathbf{a}} $
Calculating cosine similarity (practice)
Now that we know
$\cos \theta = \frac{\mathbf{a} \cdot \mathbf{b}}{ || \mathbf{a} ||_2 || \mathbf{b} ||_2 }$
, let’s write some javascript to calculate it.
My first stab at it was:
const dotProduct = (a, b) => {
if (!Array.isArray(a) || !Array.isArray(b) || a.length != b.length) {
throw 'invalid arguments';
}
const zippedVectors = a.map((x, idx) => [x, b[idx]]);
const products = zippedVectors.map(x => x[0] * x[1]);
return products.reduce((acc, x) => acc + x, 0);
}
const magnitude = a => Math.sqrt(dotProduct(a, a));
const cosineSimilarity = (a, b) => dotProduct(a, b) / (magnitude(a) * magnitude(b));
I default to writing psuedo-immutable + functional style. However in this case, the function
turned out to be much slower than it needed to be. I was able to get about a 10x speedup by
not calculating the zippedVectors
and products
intermediate arrays:
const dotProduct = (a, b) => {
if (!Array.isArray(a) || !Array.isArray(b) || a.length != b.length) {
throw 'invalid arguments';
}
return a.reduce((acc, x, idx) => acc + (x * b[idx]), 0);
}
const magnitude = a => Math.sqrt(dotProduct(a, a));
const cosineSimilarity = (a, b) => dotProduct(a, b) / (magnitude(a) * magnitude(b));
Much better, even if the reduce is slightly awkward in my opinion.
Almost done
Now I just need a quick function to use the dotProduct
function to find me a list a similar words:
export const findSimiliar = (word, threshold = 0.45, maxResults = 50) => {
if (!(word in vectors)) { return []; }
let results = [];
for (let candidate of Object.keys(vectors)) {
if (candidate == word) { continue; }
const similarity = cosineSimilarity(vectors[word], vectors[candidate]);
if (similarity < threshold) { continue; }
results.push({word: candidate, similarity});
}
return results.sort((a, b) => b.similarity - a.similarity).slice(0, maxResults);
}
Making an API
With these functions and the model, we have the guts of our API ready. I used Express & ES6 REST API Boilerplate and wrote a little bit of glue code to make an API that takes an input word and lists similar words. One important detail is I told babel not to transform my 75 megabyte model file. Here it is in action (w/ hard to read json):
The code for this is on github: Code.
Up next
In part 4, we’ll add the service to our react app.