# SVD on Text Embedding- AI Exercise

#AI-Exercise

In this article you would find points to solve a exercise in AI in which further reduction of dimension shall be validated.

The aim of this Exercise is to study text data in lower dimensions with text data represented in word embeddings. Compute the SVD of the matrix to analyze the impacts of using SVD on text data in word embedding form.

## AI Exercise:

1. Take a text file and represent it in form of word embeddings.

2. Perform SVD on the above formed Embeddings of text, called text embeddings.

3. Now perform, reduction, by taking various combinations of concepts from the SVD decomposed matrices

Doing this you may know which concept or combination of concepts are enough to define, it also lets you understand the meaning of SVD on top of average text embeddings. To answer that SVD on top of embedding is required or not in your problem.

## Some hints in solution of this a Exercise

sentences  = []

inputMediumfile = open(“/content/CLIMATE_CHANGE1.txt”)

for sentence in inputMediumfile:

sentences.append(sentence)

print(sentences[10])

import gensim

In this exercise they are taken from googlenews-vectors-negative33.bin.gz. With these vectors, compute the sentence vector as the average of all wordvectors in the sentence. The words not found in embedding are taken care of in Exception try, catch block

from nltk.tokenize import sent_tokenize, word_tokenize

from nltk.corpus import stopwords

row = len(sentences)

col = len(vector2)

sentenceMatrix = np.zeros((row, col),dtype=float)

i = 0

stopWords = set(stopwords.words(‘english’))

wordsFiltered = []

for i in range(row):

words = word_tokenize(sentences[i])

for w in words:

if w not in stopWords:

wordsFiltered.append(w)

vector_total = np.zeros(col)

for word_ in wordsFiltered:

try:

print(word_)

vector1= model[word_]

vector_total = vector1 + vector_total

#print(vector1)

except KeyError:

print(“word ‘%s’ not in vocabulary” % word_)

sentenceMatrix[i,:] = vector_total

print(“sentenceMatrix is  “, sentenceMatrix[3])

This is the matrix found from the textdata taken as input.

u,s,vh = np.linalg.svd(sentenceMatrix)

Let us check the dimensions

u.shape , vh.shape, s.shape

Output is

((23, 23), (300, 300), (23,))

Take first 2 concepts from there

sen1 = u[:,:2]