MatthewReishus.com

Word Vector Series: Part 2 - Downloading the model

July 21, 2017

Intro to Part 2

In part 2, we’ll be downloading the model and trimming it to a reasonable size.

Previous: Part 1
Next: Part 3

Downloading the model

Download GoogleNews-vectors-negative300.bin.gz from either code.google.com (requires a browser, ctrl-f negative300) or this github archive (requires git-lfs).

Unzip the file with gzip. It’s quite large and in a binary format made for a python library called word2vec. I’d like to make the service in nodejs, so let’s convert it to text. We can use a program called convertvec to do this. An actual command list looks something like:

$ git clone [email protected]:marekrei/convertvec.git
$ cd convertvec
$ make
$ cp convertvec ../cv
$ cd ..
$ chmod +x cv
$ ./cv bin2txt GoogleNews-vectors-negative300.bin GoogleNews-vectors-negative300.txt
$ rm -rf convertvec

Now you should have GoogleNews-vectors-negative300.txt. It’s huge - 8 gigs - and contains 3 million lines (words). The format looks like this:

in 0.070312 0.086914 0.087891 0.062500 0.069336 -0.108887 -0.081543 -0.154297 0.020752 0.131836 -0.113770 -0.037354 0.069336 0.078125 -0.103027 -0.097656 0.044189 0.102539 -0.060791 -0.036133 -0.045410 0.047363 -0.120605 -0.063965 0.002258 0.037109 -0.002914 0.117676 0.061768 0.063965 0.081055 -0.068848 -0.021362 0.055176 -0.085449 0.068848 -0.127930 -0.033203 0.098633 0.175781 0.110840 -0.034668 -0.047119 -0.008484 0.035889 0.103027 0.026978 -0.028687 -0.005127 0.106445 0.059814 0.094238 0.033691 -0.027100 -0.094238 0.001030 -0.048340 0.034424 0.081055 -0.113281 -0.088867 0.035889 -0.145508 -0.244141 -0.061523 0.052979 0.056885 0.179688 0.061035 0.086914 0.124023 -0.040283 0.022583 0.177734 -0.029663 -0.029663 0.117188 0.031128 -0.096191 0.066406 0.004700 -0.080078 0.062988 -0.020630 -0.054688 -0.135742 -0.063477 0.083496 -0.063965 0.021484 0.077148 -0.037109 -0.033691 -0.183594 -0.072754 0.015869 0.093262 -0.061523 -0.014221 -0.003448 0.011108 -0.158203 -0.017090 0.006195 -0.008728 -0.080566 -0.015259 -0.087891 0.003479 -0.016113 -0.012329 0.097656 -0.139648 -0.085938 -0.026855 0.053955 0.132812 0.112793 0.121094 0.085449 -0.007111 0.044678 -0.145508 -0.003204 -0.117676 -0.065430 0.071289 -0.094238 -0.030273 0.120117 0.080078 -0.094727 -0.162109 -0.077637 0.021240 -0.081543 0.003937 -0.157227 -0.098145 0.039795 0.039307 -0.009094 0.103027 0.067871 -0.042725 0.063477 -0.049072 0.020874 -0.166992 0.093262 0.093750 0.006866 0.053711 0.052490 -0.024414 -0.032471 -0.061523 -0.005554 0.096191 0.037842 0.012207 -0.043945 -0.007477 0.105469 0.020386 0.145508 0.082031 0.005768 0.004578 -0.092773 -0.138672 -0.057373 -0.051514 -0.130859 -0.139648 -0.020508 -0.027100 0.032715 0.104980 -0.002335 -0.022583 0.000504 -0.110840 0.084961 -0.129883 -0.017456 -0.000359 0.107910 0.088867 0.044678 0.025146 0.023804 0.081055 0.023682 -0.109863 0.005371 -0.017700 -0.033936 -0.032959 -0.164062 0.095703 -0.018311 0.005310 -0.034424 -0.044189 -0.066406 -0.017944 -0.029663 -0.007599 -0.051270 -0.054199 0.089355 -0.071777 0.015259 -0.082520 -0.031738 0.035645 -0.021240 -0.059326 -0.013062 0.046875 0.023071 0.020996 -0.078613 -0.008057 0.019531 -0.005554 0.041504 0.027832 0.013611 0.034668 -0.182617 0.120117 0.074219 -0.041016 -0.009949 0.042969 -0.007294 0.123047 0.057617 -0.053467 -0.032227 -0.009094 -0.046631 0.043945 -0.050781 0.068848 0.002991 -0.004181 -0.044189 0.073730 -0.012756 0.067383 0.006287 0.075195 -0.037842 0.004883 0.044678 -0.067383 0.009705 0.004730 0.020508 0.071289 0.170898 0.173828 0.055664 0.091309 -0.037354 0.049805 -0.039307 0.044189 0.062500 0.048584 -0.053223 0.048828 -0.130859 -0.028931 -0.036133 -0.060791 -0.057373 0.123047 -0.082520 -0.011902 0.125000 0.001358 0.063965 -0.106445 -0.143555 -0.042236 0.024048 -0.168945 -0.088867 -0.080566 0.064941 0.061279 -0.047363 -0.058838 -0.047607 0.014465 -0.062500 

I’d like to make a simple service that has a smallish resource footprint, so let’s not load the whole thing.
The words at sorted by popularity. At the top of the file we have </s>, in, for, and that; at the bottom, we have Mezze_Cafe, pulverizes_boulders, and snowcapped_Caucasus.

Some things to notice: The model uses capital letters sometimes, it contains compound phrases separated by underscores, and there are some special codes. Our implementation will only look up single words for simplicity, but we need to make sure to take care of capitalization: the user could type monday but the model only contains Monday.

Trimming the model

So let’s cut off a section of the top of the file and convert it to a javascript object. I wrote a quick ruby program to do this (it won’t win any awards!):

#!env ruby
require 'json'

input_array = ARGV
if input_array.length < 1
  puts "#{$0}: How many words should I include in the model?"
  puts "Use \"./#{$0} 10000\" to make a 10,000 word model."
  exit
end

limit = ARGV.shift
if !(limit =~ /\A\d+\Z/)
  puts "#{$0}: Please provide a limit in numbers."
  puts "Use \"./#{$0} 10000\" to make a 10,000 word model."
  exit
end
limit = limit.to_i

puts "module.exports = {"
words_seen = {}
output_index = 0
File.open("./GoogleNews-vectors-negative300.txt", "r") do |file|
  file.each_with_index do |line, input_index|

    next if input_index == 0
    line_array = line.split
    word = line_array.shift.downcase

    next if words_seen.key?(word)
    words_seen[word] = 1

    vectorPart = line_array.join(", ")

    puts "#{JSON.generate(word)}: [#{vectorPart}],"
    output_index += 1
    break if output_index >= limit
  end
end
puts "};"

Use it like so: $ ./make_model.rb 25000 > ../words25k.js

A 25k word model is about 75 megabytes, and includes a reasonable number of words. Here’s a graph showing estimated vocabulary size for English native speakers from testyourvocab.com:

Vocab graph

Up next

In part 3, we’ll be using some basic linear algebra to find related words.

Previous: Part 1
Next: Part 3


Written by Matthew Reishus.

© 2020