In this article you would find points to solve a exercise in AI in which further reduction of dimension shall be validated.
The aim of this Exercise is to study text data in lower dimensions with text data represented in word embeddings. Compute the SVD of the matrix to analyze the impacts of using SVD on text data in word embedding form.
1. Take a text file and represent it in form of word embeddings.
2. Perform SVD on the above formed Embeddings of text, called text embeddings.
3. Now perform, reduction, by taking various combinations of concepts from the SVD decomposed matrices
Doing this you may know which concept or combination of concepts are enough to define, it also lets you understand the meaning of SVD on top of average text embeddings. To answer that SVD on top of embedding is required or not in your problem.
Some hints in solution of this a Exercise
sentences = 
inputMediumfile = open(“/content/CLIMATE_CHANGE1.txt”)
for sentence in inputMediumfile:
# Load Google’s pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format(‘/content/drive/MyDrive/GoogleNews-vectors-negative300.bin.gz’, binary=True)
In this exercise they are taken from googlenews-vectors-negative33.bin.gz. With these vectors, compute the sentence vector as the average of all wordvectors in the sentence. The words not found in embedding are taken care of in Exception try, catch block
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
row = len(sentences)
col = len(vector2)
sentenceMatrix = np.zeros((row, col),dtype=float)
i = 0
stopWords = set(stopwords.words(‘english’))
wordsFiltered = 
for i in range(row):
words = word_tokenize(sentences[i])
for w in words:
if w not in stopWords:
vector_total = np.zeros(col)
for word_ in wordsFiltered:
vector_total = vector1 + vector_total
print(“word ‘%s’ not in vocabulary” % word_)
sentenceMatrix[i,:] = vector_total
print(“sentenceMatrix is “, sentenceMatrix)
This is the matrix found from the textdata taken as input.
u,s,vh = np.linalg.svd(sentenceMatrix)
Let us check the dimensions
u.shape , vh.shape, s.shape
((23, 23), (300, 300), (23,))
Take first 2 concepts from there
sen1 = u[:,:2]