- Query Expansion is the technique to expand the query terms to improve the retrieval results.
- It adds more relevant terms in query which helps in right retrieval and index searching.
- Words should be identified by the meaning and semantic relations within the domain it is part of.
- Multiple sense of same word may at times be not easy to comprehend, and hence need NLP solutions.
- Sometimes the query terms given to the IR are not the main words that are indexed by the search engine.
- Query written in natural language is often vague and may not result in higher precision of results.
- Require natural language techniques
- Humans are provided with an enormous amount of knowledge but algorithms do not have this vast knowledge, so we can use WordNet for different languages.
- Expanded query gives right direction to Information Retrieval Algorithms.
- This can be done by query expansion which does Word Sense Disambiguation in case of anonymous and multiple ambiguous meanings.
- WordNet is used here to provide a Knowledge Base to solve the above problem.
- In WordNet terms are connected to each other with Part of Speech annotation as synonymy, antonymy, hypernymy, hyponymy, entailment and so on.
- These connections have weights and distance can be calculated between terms for their relative similarity.
- Graph can be built for the query with words as nodes and weights between words as edges.
- Fuzzy Graph can be formed with membership measured to belonging to [0,1].
- Important nodes i.e. words are determined by local connectivity measures or global connectivity measures. These include and are not limited by: Degree Centrality, PageRank, HITS, and Closeness Centrality, edge density.
- Synonyms can be added to the original query for query expansion but Word Sense Disambiguation may be needed here.
- Consider WordNet as a graph in which nodes represent concepts/synsets and edges represent their relationships.
- To represent Query Expansion, the query can be represented as a subgraph of WordNet.
A generalized algorithm based on the algorithm in [1] made especially for Hindi Query Expansion and Hindi Information Retrieval (IR) is as follows.
Note: The modified algorithm below has not been published as of now. This can be modified and implemented in your works, based on the language you are working with. This has not been evaluated yet, it proposed from Hindi IR [1] to Generic IR.
Note: This algorithm is the transformation of Hindi language algorithm to English or any other language WordNet. It has not been tested on the English datasets, but just converted to English WordNet to be understood by more researchers and to be reused.
Note: The algorithm is converted step by step from original algorithm [1]. The original algorithm in Hindi dataset have been tested and evaluated on 3 datasets and produced improvements as per authors.
Algorithm for Generic Language Query Expansion using Graph Measures and Information Retrieval
- Initialize the Graph Gf = (Vf, Ef) where V and E are empty in the beginning.
- Add the senses of query terms Qi to vertices of Graph Gq = (Vq, Eq).
- For each word in Vq perform a search in WordNet considered as a graph. The search can be a Depth First Search (DFS) or a Breath First Search (BFS).
- The Traversal in WordNet should be up to depth delta, in case DFS is used.
- Add new vertices and paths encounters in this traversal to graph Gf if these relations are not present there previously.
- In the graph Gf assigns weights to nodes as per the relation the nodes=words senses have with the word. The relation can be synonymy, hypernymy, hyponymy, entailment and so on. These can also represent word and word gloss relationships.
- Find the Graph based connectivity measures of Gf. These include, Degree Centrality, PageRank, HITS, and Closeness Centrality, and so on.
- For each v in Vf, compute average connectivity measure by calculating the average of above connectivity scores.
- Find V_lcm as all those v such that v is not in Vq and the average score is more than a threshold alpha.
- Let Vi represent words in the i_th sense.
- V_i and V_lcm, is the the i_th interpretation of the query.
- V_int is set of nodes connecting V_i and V_lcm.
- G`i=(V`i, E`i), V`i contains all of V_i, V_int, V_lcm,E`i corresponding edges in these paths.
- Discard any G`i where G`i contains a disconnected node, as disconnection may lead to null interpretation in meanings.
The final words that remains are the words in expanded query.
The algorithm seem fine, however it needs changes and improvements to be used in generic WordNet or to be used with English WordNet. Further, this needs robust modifications to be used in English and other languages. Still, we provided what was can be a proposed paper, our algorithm from scratch shall be presented in coming articles.
References
[1] Jain, A., Vij, S., & Castillo, O. (2019). Hindi query expansion based on semantic importance of Hindi WordNet relations and fuzzy graph connectivity measures. Computación y Sistemas, 23(4), 1337–1355.