This is a short article to show how the distance between words can be computed using word2vec. I often follow deep theoretical and large practical articles with small articles covering basics. Even basics are important and hence here today is a small article on how to compute the distance between words with help of word2vec.
Word2vec is a neural model to convert a word representation into a vector form. So given a word, say a cat, it converts the word ‘cat’ to its vector form. This has a huge benefit of computations that can be applied to this vector and hence to word ‘cat’.
Today we shall discuss one such application of word2vec, that is in, representing words to compute the distance between words. Once words are represented in vector space form, the distance between these words can be computed using cosine similarity or any other similarity.
Here, goes the code,
Let’s import and connect the drive. Drive is connected as GoogleNews-vectors-negative300.bin.gz is too heavy a file to be uploaded on colab, hence we connect it on the drive and it makes it easy as each time we don’t have to upload it just connect to drive.
import matplotlib.pyplot as plt
from google.colab import drive
We are not training the model, we are using the pre-trained word2vec file. This data was trained from data on the web. The pre-trained file is GoogleNews-vectors-negative300.bin.gz and has been taken from the internet only. It’s available there, to be downloaded.
model = gensim.models.KeyedVectors.load_word2vec_format('/content/drive/MyDrive/GoogleNews-vectors-negative300.bin.gz', binary=True)
The above code computes the similarity between words ‘gas’ and ‘climate’
Output is as follows giving a similarity of 0.1397
Now let us compute explicit word vectors
gas_wv = model.wv['gas']
climate_wv = model.wv['climate']
The word vectors take the following forms:
Now, compute the cosine similarity between these two-word vectors. Cosine similarity is computed as the following formula and measures how close two vectors are. We have the words in vector format as above and we just need to compute cosine similarity.
import numpy as np
def cosine(vector1, vector2):
similarity = np.dot(vector1, vector2)
norm1 = np.sqrt(np.sum(vector1**2))
norm2 = np.sqrt(np.sum(vector2**2))
cosineDistance = similarity / (norm1*norm2)
cosine(gas_wv , climate_wv)
Output is as follows:
The similarity is 0.1397.
This similarity is the same as the one obtained above by direct function call.
In the coming lecture, we’ll discuss more on this topic.