Genetic Algorithm (GA)
Genetic Algorithm is one of the nature inspired algorithm which works on Darwins theory of survival of fittest. The key operators that we have in Genetic Algorithms are
- Selection
- Crossover
- Mutation
These three operators also forms the basis reproduction and subsequent workings in natural process be it be in human biology or in botony and so on.
The terms are self explanatory, but I can explain them in one line each for completion.
- Selection: Select the best parents form a population for reproduction
- Crossover: Choose two new offsprings from parents by one of the several crossover techniques
- Mutation: Flip some bits to explore unexplored solutions to problems or in biological terms to get new features not in parents
Now how we can use these terms in solving the problem of Information Retrieval.
Information Retrieval (IR)
Here, from a set of documents, documents related to a query is to be found. This can be used in search engines or desktop search as well. The documents which are most relevant to the query are produced as outputs.
There are various algorithms to compute similarity between a document and query and then to rank the document based on similarity score. Many kind of similarity are applied here, most popular one in cosine similarity. Several Graph based solutions are produced as well, one that is very famous is PageRank Algorithm.
Information Retrieval with Genetic Algorithm
Here we take the following points:
- The documents numbered from 1 to N
- Chromosome represented to solution, where 1 represents that gene is a 1, that is a document selected for ranking
- The similarity score between query and document, computed per chromosome
- Running iterations to get best chromosome, the output, where 1 represents a document is selected and 0 represents that document is not selected.
Code and steps in IR with GA
Similarity Computation is the key factor that drives the GA algorithm and selection of next chromosome.
Here is how we compute similarity between text and query
pip install semantic-text-similarity
pip install geneticalgorithm
from semantic_text_similarity.models import WebBertSimilarity
web_model = WebBertSimilarity(device='cpu', batch_size=10)
def similarityBetweenText(text1, text2):
return web_model.predict([(text1,text2)])
Query initialization
Initialize the query and call the function to compute similarity between query and document
query = "With climate change glaciers are melting"
def similarity(doc):
return similarityBetweenText(query, doc)
Call to this method:
Assuming all documents are loaded in document_text third dimention, we can compute similarity as follows.
You can load documents in any other way that is convenient to you.
simly = similarity(document_text[0][2])
print(simly)
Code to Run GA for IR
The optimization objective function is defined as follows
The following is demo fitness function, this can be updated
import numpy as np
from geneticalgorithm import geneticalgorithm as ga
length = len(document_text)
similarity_i = np.zeros(length)
print("length is : ", length)
def f(combinational_array):
for i in range(20):
if combinational_array[i] >0:
similarity_i[i] = similarity(document_text[i][2])
else:
similarity_i[i] = 0
sum = np.sum(similarity_i)
return -1*sum
The penalty function is here, this will make sure how many documents we have to retrieve. This is a demo penalty function, it can be optimised.
def p(combinational_array):
penality=0
M= 50
N=length/4
if sum(combinational_array) < N:
penality= np.sum(combinational_array))
else:
penalty = M
return penality
Now final call.
combinational_array=np.array([[0,1]]*length)
model=ga(function=f+p,dimension=length,variable_type='int',variable_boundaries=varbound)
model.run()
print(combinational_array)