Note: This same article appears in my median.com account as well and possibly on researchgate.com as well
Abstract- There is a profound need to build robust evaluation techniques for NLP tasks. The current evaluation techniques are quite old while the algorithm they evaluate are the latest state of art methods. Hence, there is a need for redefining the evaluation tasks, the evaluations now are based on BLEU, ROUGE, or basic n-gram Precision and Recall based evaluations. This article explains the need and also suggests some techniques to be used for evaluations for latest models.
Natural Language Processing (NLP) is moving much ahead with the latest state-of-art papers and many of the AI best models now outperforming prior methods. Do you think now that the word-matching algorithm for precision and recall, be it on its own, or with BLEU and ROUGE scores are enough for today’s challenges? BLEU was declared as standard in many NLP tasks, such as Machine Translation while ROUGE, a recall-based computational formula is a profound calculator for the accuracies in text summarization tasks. Many other NLP task use precision and recall as per se. These need to change as, we head so high in todays computations be it chatGPT or be it efficient search algorithms. Its simply is not word-to-word compassion unless its text categorizations where the target class is a word.
This is an era of advanced developments where in each other week there is some new news on a benchmark being reached. Consider the task of Text Summarization for an example. Now, novel deep learning-based techniques have been developed for both extractive and abstractive summarization. But why to use word n-gram model, only, why not a more inclusive computation of results? Here, the results are written by experts, why not compute new formulas which don’t consider word-to-word match, but a meaning-to-meaning match. More of this is in Section 2, ahead.
2. Proposed Techniques for NLP-based evaluations
The current techniques of many NLP-based techniques evaluations rely on n-gram models with or without both precision and recall. These methods were good when they were used. But now the world has moved on. And the outputs that we produce on NLP engines solving problems are not the same as those few years back. In image recognition, the research has reached major advances too, so not just NLP, image recognition too the evaluations need to be considered. However, in image recognition depends on accuracies, however in NLP, it’s not about accuracies unless its sentiment classification, text cateriogrization or spam detection and likes. There are many NLP task that needs to compare a text generated with another text, yes even in chat bot performance. Here, we need NLP expertise to evaluate these engines.
Traditionally it was recall and precision that grew more in time to n-gram models with recall and precisions. But now we have a translation generated with a transformer or a sequence to sequence model. A summary made with LSTM. Why would we use to evaluate with a word to word n-gram model. Here are some of my recommendation on it, and how these can be used to produce better results. Its ok we were trying to test result of a transformer with a n-gram model, as deep learning entered our lives late, but still we have time now, to update the evaluation criterions.
Some ways in which evaluations in NLP can be performed are as follows:
I. Similarity Assessment using Deep Learning. Herein similarity assessment need to be made between the target and system generate outputs. This keeps over all texts inclusive and provide a genuine mark of performance between 0 and 1. This can use deep learning techniques.
II. N-gram assessments with Deep Learning. The current models check word match of N-grams, but here similar recall and precision-based models can be made but with N-grams based similarities using deep learning models.
III. Predictions based on both texts using LSTM and henceforth the evaluation
This does not mean we will roll out the older evaluation techniques. No, we are using older word to word match models with n-grams, and lot of theories are build on top of it. Hence, this is just an extra work now on to carry in older computations as well as to compute new scores. However, the exact codes and formulas of newer evaluations would take time to come up. I shall upload one when what I have in thinking are out. Till then just some extra work as this is the future, and we are heading in future. Cohesion and coherence too defined in terms of Deep Learning for the new text generated.
3. Conclusion and Future Work
In this article we have explained why plain n-gram word to word evaluation need to be supported by newer means of deep learning based evaluation techniques. This does not mean old n-gram based recall and precision models won’t be needed. No, they would be required, but this article is just to explain, that newer models need to be created to understand that, more recent research are based on state-of-art deep learning techniques and hence require a state of are deep learning based evaluation model which can be based on the similarity of results with deep learning, i.e. how much similar the system generated result is to the reference provided with human evaluations.