Word2vec and the distance between words based on it

This is a short article to show how the distance between words can be computed using word2vec. I often follow deep theoretical and large practical articles with small articles covering basics. Even basics are important and hence here today is a small article on how to compute the distance between words with help of word2vec.

Word2vec is a neural model to convert a word representation into a vector form. So given a word, say a cat, it converts the word ‘cat’ to its vector form. This has a huge benefit of computations that can be applied to this vector and hence to word ‘cat’.

Today we shall discuss one such application of word2vec, that is in, representing words to compute the distance between words. Once words are represented in vector space form, the distance between these words can be computed using cosine similarity or any other similarity.

Here, goes the code,

Let’s import and connect the drive. Drive is connected as GoogleNews-vectors-negative300.bin.gz is too heavy a file to be uploaded on colab, hence we connect it on the drive and it makes it easy as each time we don’t have to upload it just connect to drive.

import matplotlib.pyplot as plt
import gensim


from google.colab import drive
drive.mount('/content/drive')

We are not training the model, we are using the pre-trained word2vec file. This data was trained from data on the web. The pre-trained file is GoogleNews-vectors-negative300.bin.gz and has been taken from the internet only. It’s available there, to be downloaded.

model =  gensim.models.KeyedVectors.load_word2vec_format('/content/drive/MyDrive/GoogleNews-vectors-negative300.bin.gz', binary=True)

model.wv.similarity('gas', 'climate')

The above code computes the similarity between words ‘gas’ and ‘climate’

Output is as follows giving a similarity of 0.1397

Now let us compute explicit word vectors


gas_wv = model.wv['gas']
climate_wv = model.wv['climate']

The word vectors take the following forms:

Now, compute the cosine similarity between these two-word vectors. Cosine similarity is computed as the following formula and measures how close two vectors are. We have the words in vector format as above and we just need to compute cosine similarity.

import numpy as np

 

def cosine(vector1, vector2):
    similarity = np.dot(vector1, vector2)
    norm1 = np.sqrt(np.sum(vector1**2))
    norm2 = np.sqrt(np.sum(vector2**2))   
    cosineDistance = similarity / (norm1*norm2)
    return cosineDistance

 
cosine(gas_wv , climate_wv)

Output is as follows:

The similarity is 0.1397.

This similarity is the same as the one obtained above by direct function call.

In the coming lecture, we’ll discuss more on this topic.

Word2vec and the distance between words based on it

Published by Nidhika

Leave a comment Cancel reply

Share this:

Published by Nidhika

Leave a comment Cancel reply