SVD on Text Embedding- AI Exercise

#AI-Exercise

In this article you would find points to solve a exercise in AI in which further reduction of dimension shall be validated.

The aim of this Exercise is to study text data in lower dimensions with text data represented in word embeddings. Compute the SVD of the matrix to analyze the impacts of using SVD on text data in word embedding form.

AI Exercise:

1. Take a text file and represent it in form of word embeddings.

2. Perform SVD on the above formed Embeddings of text, called text embeddings.

3. Now perform, reduction, by taking various combinations of concepts from the SVD decomposed matrices

Doing this you may know which concept or combination of concepts are enough to define, it also lets you understand the meaning of SVD on top of average text embeddings. To answer that SVD on top of embedding is required or not in your problem.

Some hints in solution of this a Exercise

Step 1. Take the text input and read it. I have taken Climate Change text file form internet.

sentences = []

inputMediumfile = open(“/content/CLIMATE_CHANGE1.txt”)

for sentence in inputMediumfile:

sentences.append(sentence)

print(sentences[10])

Step 2. Load the wordvectors

import gensim

# Load Google’s pre-trained Word2Vec model.

model = gensim.models.KeyedVectors.load_word2vec_format(‘/content/drive/MyDrive/GoogleNews-vectors-negative300.bin.gz’, binary=True)

Step 3. Generate a matrix with entries from pre-trained wordvectors.

In this exercise they are taken from googlenews-vectors-negative33.bin.gz. With these vectors, compute the sentence vector as the average of all wordvectors in the sentence. The words not found in embedding are taken care of in Exception try, catch block

from nltk.tokenize import sent_tokenize, word_tokenize

from nltk.corpus import stopwords

row = len(sentences)

col = len(vector2)

sentenceMatrix = np.zeros((row, col),dtype=float)

i = 0

stopWords = set(stopwords.words(‘english’))

wordsFiltered = []

for i in range(row):

words = word_tokenize(sentences[i])

for w in words:

if w not in stopWords:

wordsFiltered.append(w)

vector_total = np.zeros(col)

for word_ in wordsFiltered:

try:

print(word_)

vector1= model[word_]

vector_total = vector1 + vector_total

#print(vector1)

except KeyError:

print(“word ‘%s’ not in vocabulary” % word_)

sentenceMatrix[i,:] = vector_total

print(“sentenceMatrix is “, sentenceMatrix[3])

This is the matrix found from the textdata taken as input.

Step 4. Do SVD of matrix of text data so found.

u,s,vh = np.linalg.svd(sentenceMatrix)

Step 5. Lets find concepts in SVD to analyze, some experimental values and finding conclusions on it

Let us check the dimensions

u.shape , vh.shape, s.shape

Output is

((23, 23), (300, 300), (23,))

Take first 2 concepts from there

sen1 = u[:,:2]

SVD on Text Embedding- AI Exercise

AI Exercise:

Some hints in solution of this a Exercise

Come out with conclusion with your experiments and this is a good AI Exercise to learn a lot from.

Published by Nidhika

Leave a comment Cancel reply

AI Exercise:

Some hints in solution of this a Exercise

Come out with conclusion with your experiments and this is a good AI Exercise to learn a lot from.

Share this:

Published by Nidhika

Leave a comment Cancel reply