User Specific Word Vectors-Customized on Climate Data

Here in this article, two ways are presented to self-train word vectors. This is highly useful if you do not wish to use word vectors trained on Wikipedia or twitter or so datasets. You can load your data file and make your own model, this model can be saved for future work. This has the advantage of self-trained algorithms to be customized in your applications.

Here is the code, with explanations,

Connect to drive to use the data file. The customization used here is on Climate change and the files used are for climate data.

from google.colab import drive
drive.mount('/content/drive')
pip install nltk
import nltk
nltk.download('all')

This has been done from scratch and also with help of advanced toolkits such as NLTK.

The following is from scratch the textual data file on climate change is read. The following are pre-processing functions. Note the text file was pre-processed before using these.

def isSpecialCharacter(word):
for character in word:
if not (character.isalpha() or character.isdigit() ):
return True
return False

def parseFile(fileName):
sentences = []
fileNameObj = open(fileName)
numberSentences = 1
for sentence in fileNameObj:
sentence.replace("\n", " ")
sentences.append(sentence)
numberSentences = numberSentences+1
return sentences

def tokenizeSentences(sentenceText):
words = sentenceText.split()
words = [w for w in words if not(isSpecialCharacter(w))]
words = [w for w in words if w.lower() not in stopwords]
words = [w.lower() for w in words]
return words
print(words)

Here call in the two kinds of word vectors methods from gensim to compute the vectors. First is a continuous bag of words and second one is skim grams model. Similarity between two words climate and weather are computed here.

import gensim
from gensim.models import Word2Vec

stopwords = nltk.corpus.stopwords.words("english")
fileName = "/content/test_file.txt"
sentencesInTraining = parseFile(fileName)

textTrainingInput = []


for sentenceText in sentencesInTraining:
partTraningInput = []
wordsInSentence = tokenizeSentences(sentenceText)
for word in wordsInSentence:
partTraningInput.append(word)
textTrainingInput.append(partTraningInput)


model_bog = gensim.models.Word2Vec(textTrainingInput, window = 3)
print("Similarity Bag of Word: ", model_bog.wv.similarity('climate', 'weather'))

model_skipgram = gensim.models.Word2Vec(textTrainingInput, window = 3, sg = 1)
print("Similarity Skip Gram ", model_skipgram.wv.similarity('climate', 'weather'))

The following is a short and refined and professional way to compute word vectors. Here call in the two kinds of word vectors methods from gensim to compute the vectors. The first is a continuous bag of words and the second one is skip grams model. The similarity between the two words climate and weather are computed here. This method does not need pre-processing files as nltk methods which are inbuilt are used, these are sent_tokenize and word_tokenize. You can try this part of code it’s straightforward. Hence I am not providing this part of data.

import gensim
from gensim.models import Word2Vec
from nltk.tokenize import sent_tokenize, word_tokenize

stopwords = nltk.corpus.stopwords.words("english")
fileName = "/content/test_file.txt"
fileNameObj = open(fileName)
sentencesInTraining = fileNameObj.read()

sentencesInTraining = sentencesInTraining.replace("\n", " ")

. . .

model_bog = gensim.models.Word2Vec(textTrainingInput, window = 3)
print("Similarity Bag of Word: ", model_bog.wv.similarity('climate', 'weather'))

model_skipgram = gensim.models.Word2Vec(textTrainingInput, window = 3, sg = 1)
print("Similarity Skip Gram ", model_skipgram.wv.similarity('climate', 'weather'))

Published by Nidhika

Hi, Apart from profession, I have inherent interest in writing especially about Global Issues of Concern, fiction blogs, poems, stories, doing painting, cooking, photography, music to mention a few! And most important on this website you can find my suggestions to latest problems, views and ideas, my poems, stories, novels, some comments, proposals, blogs, personal experiences and occasionally very short glimpses of my research work as well.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: