SVD on Text Embedding- AI Exercise

#AI-Exercise

In this article you would find points to solve a exercise in AI in which further reduction of dimension shall be validated.

The aim of this Exercise is to study text data in lower dimensions with text data represented in word embeddings. Compute the SVD of the matrix to analyze the impacts of using SVD on text data in word embedding form.

AI Exercise:

1. Take a text file and represent it in form of word embeddings.

2. Perform SVD on the above formed Embeddings of text, called text embeddings.

3. Now perform, reduction, by taking various combinations of concepts from the SVD decomposed matrices

Doing this you may know which concept or combination of concepts are enough to define, it also lets you understand the meaning of SVD on top of average text embeddings. To answer that SVD on top of embedding is required or not in your problem.

Some hints in solution of this a Exercise

Step 1. Take the text input and read it. I have taken Climate Change text file form internet.

sentences  = []

inputMediumfile = open(“/content/CLIMATE_CHANGE1.txt”)

for sentence in inputMediumfile:

  sentences.append(sentence)

print(sentences[10]) 

Step 2. Load the wordvectors

import gensim

# Load Google’s pre-trained Word2Vec model.

model =  gensim.models.KeyedVectors.load_word2vec_format(‘/content/drive/MyDrive/GoogleNews-vectors-negative300.bin.gz’, binary=True)  

Step 3. Generate a matrix with entries from pre-trained wordvectors.

In this exercise they are taken from googlenews-vectors-negative33.bin.gz. With these vectors, compute the sentence vector as the average of all wordvectors in the sentence. The words not found in embedding are taken care of in Exception try, catch block

from nltk.tokenize import sent_tokenize, word_tokenize

from nltk.corpus import stopwords

row = len(sentences)

col = len(vector2) 

sentenceMatrix = np.zeros((row, col),dtype=float)

i = 0

stopWords = set(stopwords.words(‘english’))

wordsFiltered = []

for i in range(row):

  words = word_tokenize(sentences[i])

  for w in words:

    if w not in stopWords:

      wordsFiltered.append(w)

  vector_total = np.zeros(col)

  for word_ in wordsFiltered:

    try:

      print(word_)

      vector1= model[word_]  

      vector_total = vector1 + vector_total

      #print(vector1)

    except KeyError:

      print(“word ‘%s’ not in vocabulary” % word_)

    sentenceMatrix[i,:] = vector_total

print(“sentenceMatrix is  “, sentenceMatrix[3])

This is the matrix found from the textdata taken as input.

Step 4. Do SVD of matrix of text data so found.

u,s,vh = np.linalg.svd(sentenceMatrix)

Step 5. Lets find concepts in SVD to analyze, some experimental values and finding conclusions on it

Let us check the dimensions

u.shape , vh.shape, s.shape

Output is 

((23, 23), (300, 300), (23,))

Take first 2 concepts from there

sen1 = u[:,:2]

Come out with conclusion with your experiments and this is a good AI Exercise to learn a lot from.

Published by Nidhika

Hi, I have an excellent problems solving skills in the domain of Engineering, and Modelling Solutions for tasks. Apart from profession, I have inherent interest in writing especially about Global Issues of Concern, poems, stories, doing painting, cooking, photography, music to mention a few! And most important on my website you can find my suggestions to latest problems, views and ideas, my poems, stories, novels, my comments, my proposals for people with funding's to implement, my blogs, my interests, my personal experiences and glimpses of my research work as well.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: