#AI-Exercise
In this article you would find points to solve a exercise in AI in which further reduction of dimension shall be validated.
The aim of this Exercise is to study text data in lower dimensions with text data represented in word embeddings. Compute the SVD of the matrix to analyze the impacts of using SVD on text data in word embedding form.
AI Exercise:
1. Take a text file and represent it in form of word embeddings.
2. Perform SVD on the above formed Embeddings of text, called text embeddings.
3. Now perform, reduction, by taking various combinations of concepts from the SVD decomposed matrices
Doing this you may know which concept or combination of concepts are enough to define, it also lets you understand the meaning of SVD on top of average text embeddings. To answer that SVD on top of embedding is required or not in your problem.
Some hints in solution of this a Exercise
Step 1. Take the text input and read it. I have taken Climate Change text file form internet.
sentences = []
inputMediumfile = open(“/content/CLIMATE_CHANGE1.txt”)
for sentence in inputMediumfile:
sentences.append(sentence)
print(sentences[10])
Step 2. Load the wordvectors
import gensim
# Load Google’s pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format(‘/content/drive/MyDrive/GoogleNews-vectors-negative300.bin.gz’, binary=True)
Step 3. Generate a matrix with entries from pre-trained wordvectors.
In this exercise they are taken from googlenews-vectors-negative33.bin.gz. With these vectors, compute the sentence vector as the average of all wordvectors in the sentence. The words not found in embedding are taken care of in Exception try, catch block
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
row = len(sentences)
col = len(vector2)
sentenceMatrix = np.zeros((row, col),dtype=float)
i = 0
stopWords = set(stopwords.words(‘english’))
wordsFiltered = []
for i in range(row):
words = word_tokenize(sentences[i])
for w in words:
if w not in stopWords:
wordsFiltered.append(w)
vector_total = np.zeros(col)
for word_ in wordsFiltered:
try:
print(word_)
vector1= model[word_]
vector_total = vector1 + vector_total
#print(vector1)
except KeyError:
print(“word ‘%s’ not in vocabulary” % word_)
sentenceMatrix[i,:] = vector_total
print(“sentenceMatrix is “, sentenceMatrix[3])
This is the matrix found from the textdata taken as input.
Step 4. Do SVD of matrix of text data so found.
u,s,vh = np.linalg.svd(sentenceMatrix)
Step 5. Lets find concepts in SVD to analyze, some experimental values and finding conclusions on it
Let us check the dimensions
u.shape , vh.shape, s.shape
Output is
((23, 23), (300, 300), (23,))
Take first 2 concepts from there
sen1 = u[:,:2]