User Specific Word Vectors-Customized on Climate Data

Here in this article, two ways are presented to self-train word vectors. This is highly useful if you do not wish to use word vectors trained on Wikipedia or twitter or so datasets. You can load your data file and make your own model, this model can be saved for future work. This has the advantage of self-trained algorithms to be customized in your applications.

Here is the code, with explanations,

Connect to drive to use the data file. The customization used here is on Climate change and the files used are for climate data.

from google.colab import drive
drive.mount('/content/drive')
pip install nltk
import nltk
nltk.download('all')

This has been done from scratch and also with help of advanced toolkits such as NLTK.

The following is from scratch the textual data file on climate change is read. The following are pre-processing functions. Note the text file was pre-processed before using these.

def isSpecialCharacter(word):
  for character in word:
    if not (character.isalpha() or character.isdigit() ):
      return True
  return False

def parseFile(fileName): 
  sentences  = []
  fileNameObj = open(fileName)
  numberSentences = 1
  for sentence in fileNameObj:
    sentence.replace("\n", " ")
    sentences.append(sentence)
    numberSentences = numberSentences+1
  return sentences 

def tokenizeSentences(sentenceText):  
  words = sentenceText.split() 
  words = [w for w in words if not(isSpecialCharacter(w))]
  words = [w for w in words if w.lower() not in stopwords]
  words = [w.lower() for w in words]
  return words
  print(words)

Here call in the two kinds of word vectors methods from gensim to compute the vectors. First is a continuous bag of words and second one is skim grams model. Similarity between two words climate and weather are computed here.

import gensim
from gensim.models import Word2Vec
 
stopwords = nltk.corpus.stopwords.words("english")
fileName = "/content/test_file.txt"
sentencesInTraining = parseFile(fileName)
 
textTrainingInput = []
 
 
for sentenceText in sentencesInTraining:
    partTraningInput = []
    wordsInSentence = tokenizeSentences(sentenceText)
    for word in wordsInSentence:
        partTraningInput.append(word)
    textTrainingInput.append(partTraningInput)
 
 
model_bog = gensim.models.Word2Vec(textTrainingInput,  window = 3)
print("Similarity Bag of Word: ", model_bog.wv.similarity('climate', 'weather'))

model_skipgram = gensim.models.Word2Vec(textTrainingInput, window = 3, sg = 1) 
print("Similarity Skip Gram   ", model_skipgram.wv.similarity('climate', 'weather'))

The following is a short and refined and professional way to compute word vectors. Here call in the two kinds of word vectors methods from gensim to compute the vectors. The first is a continuous bag of words and the second one is skip grams model. The similarity between the two words climate and weather are computed here. This method does not need pre-processing files as nltk methods which are inbuilt are used, these are sent_tokenize and word_tokenize. You can try this part of code it’s straightforward. Hence I am not providing this part of data.

import gensim
from gensim.models import Word2Vec
from nltk.tokenize import sent_tokenize, word_tokenize
 
stopwords = nltk.corpus.stopwords.words("english")
fileName = "/content/test_file.txt"
fileNameObj = open(fileName)
sentencesInTraining = fileNameObj.read()

sentencesInTraining = sentencesInTraining.replace("\n", " ")

. . .
 
model_bog = gensim.models.Word2Vec(textTrainingInput,  window = 3)
print("Similarity Bag of Word: ", model_bog.wv.similarity('climate', 'weather'))

model_skipgram = gensim.models.Word2Vec(textTrainingInput, window = 3, sg = 1) 
print("Similarity Skip Gram   ", model_skipgram.wv.similarity('climate', 'weather'))

User Specific Word Vectors-Customized on Climate Data

Published by Nidhika

Leave a comment Cancel reply

Share this:

Published by Nidhika

Leave a comment Cancel reply