This is a code for word sense evaluation where in instead of bag of common words, we use similarity between the senses and the target word in sense to be evaluated.
The algorithm is similar to Simplified Lesk Algorithm but in place of counting common words between the context of word and the sense lemma we find semantic simalirity between the two. The sense which has highest semantic semilarity is declared as the sense of the word. Various variations can be formulated from this as well. I have not conduced the accuracy computation as of now, as I want more improvement in the code myself, in case you want to try the code is here, data sets can be found online. I have been trying on senseval2 and senseveal3 datasets. My earlier paper evlauate the 2007, 2013, 2015 datasets too. The link to my previous work which produced nice results are given here:
(PDF) Fuzzy Rough Set Span based Unsupervised Word Sense Disambiguation (researchgate.net)
https://www.researchgate.net/publication/354130784_Fuzzy_Rough_Set_Span_based_Unsupervised_Word_Sense_Disambiguation?channel=doi&linkId=61268d205567a03d006e672c&showFulltext=true
This article is not from my research paper above. No.
However in this article I do not present a research article but I present an AI Excercise. This may turn out to be a research article if results are good and more improvements can be made. I am presenting it as an AI-Research Exercise.
Consider the input data from senseval3 dataset, let us take the following lines to consider from this dataset senseval3, this file is an xml file. One need to read xml file and update the entries tags in xml.
<corpus lang="en" source="senseval3">
<text id="d000">
<sentence id="d000.s001">
<wf lemma="haney" pos="NOUN">Haney</wf>
<instance id="d000.s001.t000" lemma="peer" pos="VERB">peered</instance>
<wf lemma="doubtfully" pos="ADV">doubtfully</wf>
<wf lemma="at" pos="ADP">at</wf>
<wf lemma="he" pos="PRON">his</wf>
<wf lemma="drinking" pos="NOUN">drinking</wf>
<instance id="d000.s001.t001" lemma="companion" pos="NOUN">companion</instance>
<wf lemma="through" pos="ADP">through</wf>
<instance id="d000.s001.t002" lemma="bleary" pos="ADJ">bleary</instance>
<wf lemma="," pos=".">,</wf>
<wf lemma="tear-filled" pos="ADJ">tear-filled</wf>
<instance id="d000.s001.t003" lemma="eye" pos="NOUN">eyes</instance>
<wf lemma="." pos=".">.</wf>
</sentence>
</text>
</corpus>
The task is to find the sense ambiguation of word “peer” here.
The senses to disambiguate are:
[Synset(‘peer.n.01’), Synset(‘peer.n.02’), Synset(‘peer.v.01’)]
The gold file has the target senses, which are tagged with human experts in most cases. Example of some tags are as follows:
d000.s000.t000 man%1:18:00::
d000.s000.t001 say%2:32:01::
d000.s001.t000 peer%2:39:00::
d000.s001.t001 companion%1:18:00::
d000.s001.t002 bleary%5:00:00:indistinct:00
d000.s001.t003 eye%1:08:00::
d000.s002.t000 have%2:40:00::
d000.s002.t001 ready%5:00:01:available:00
d000.s002.t002 answer%1:04:00::
d000.s002.t003 much%3:00:00::
d000.s002.t004 surprise%1:12:00::
d000.s002.t005 fit%1:26:00::
d000.s002.t006 coughing%1:26:00::
d000.s003.t000 man%1:18:00::
These are keys and can be converted to synset as explained below.
Lets start reading this file and using the modified WSD algorithm as explained in lines on top of this webpage.
Lets us import the libraries.
import nltk
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.corpus import stopwords
import re
import numpy as np
import re
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
import lxml.etree as et
from nltk.corpus import stopwords
Now, start the evaluations, read the xml file, in the xml tags one finds the context of the word, viz the sentence which needs to be formulated.
def start_evaluations_semeval3(fileName):
dataset_semeval2 = get_Target_Sentence(fileName ,1)
correct_classified = 0
with open('/gdrive/My Drive/output_D_new_file.txt', 'a') as f:
f.write('success ' + 'predicted ' + ' target ' + ' word\n\n' )
f.close()
valid = 0
i=0
correct_sense_predicted = []
for data_point in dataset_semeval3:
target_word = data_point['target_word']
pos_target = data_point['pos_target']
sentence_input = data_point['context']
target_sense_key = data_point['target_sense']
target_synset = get_Target_Synset(target_sense_key)
print('target synset is' , target_synset)
sentence_words = sentence_input
result = get_WSD(sentence_words, pos_target, target_word, target_synset)
correct_sense_predicted.append(result[1])
print('correct sense is ::', correct_sense_predicted)
if result[0] >= 0:
correct_classified += result[0]
valid +=1
i +=1
with open('/gdrive/My Drive/output_D_new_file.txt', 'a') as f:
f.write(str(i) + ' ' + str(result[0]) + ' ' + str(result[1]) + ' ' + str(result[2]) + ' ' + str(result[3]) + ' ' + str(correct_classified) )
f.write('\n')
f.close()
f.close()
print('sentence_words', sentence_words)
accuracy1 = correct_classified/valid
return accuracy1
fileName = '/content/senseval3.data.xml'
print(start_evaluations_semeval3(fileName))
Get to Word Sense Evaluations, compute similarity of a sense with definition of the sense and word context
def get_WSD(sentence_words, pos, word_target, target_synset):
all_senses = get_all_sense(word_target)
count_senses = len(all_senses)
sentence_main_words = sentence_words
i=0
coverSenses = np.zeros(count_senses)
for sense_i in all_senses:
gloss_main = get_gloss_from_synset(sense_i)
print(gloss_main)
Universe_ = sentence_words
words_in_gloss = gloss_main
coverSenses[i] = sentenceCoverage(sense_i, sentence_words, gloss_main)
i+=1
max_val = coverSenses[0]
index_max = 0
for i in range(len(coverSenses)):
if(max_val < coverSenses[i]):
max_val = coverSenses[i]
index_max = i
correct_sense_predicted = []
correct_sense_predicted.append(all_senses[index_max])
for sense_predicted in correct_sense_predicted:
if(sense_predicted in target_synset):
return 1, correct_sense_predicted, target_synset, word_target
return 0, correct_sense_predicted, target_synset, word_target
We use spacy nlp basic model to find similarity of words as against counting bag of words common in simplified algorithms
import spacy
nlp = spacy.load('en_core_web_sm')
def overlap(gloss_main, U1):
len_overlap = 0
gloss_nlp = nlp(gloss_main)
U1_nlp = nlp(U1)
print('The Sentences are \n')
print(U1_nlp)
print(gloss_nlp)
for tok1 in gloss_nlp:
for tok2 in U1_nlp:
len_overlap = len_overlap + tok1.similarity(tok2)
return (len_overlap+1)/(1+len(gloss_main))
def sentenceCoverage(sense_i, U1, gloss_main):
S1 = sense_i
S2 = U1
X = S1
U = U1
print('gloss gloss_main is', gloss_main[0])
print('Universe is ')
print(U1)
sent_U = ""
for word1 in U1:
sent_U = sent_U + word1 + " "
overlapping_score = overlap(gloss_main[0], sent_U)
return overlapping_score
Some helper functions are here:
def get_synset_from_key(key):
lemma = wn.lemma_from_key(key)
synset = lemma.synset()
return synset
Another helper function
def get_all_sense(word, pos=None):
if pos:
synsets = wn.synsets(word,pos)
else:
synsets = wn.synsets(word)
return synsets
Conclusion & Future Work
We have tried to disambiguate the word to its right sense, given its context and we then try to improve the model further, using some nice research ideas. For not this is it, its accuracy can be measured and computed. This is the AI Research Excercise, as it can be extended to reseach tasks in Word Sense Evaluations. This is basics here, with mild variation. All the Best.