Note: This same article appears in my median.com account as well

In the last article we discussed, on how Soft Sets, a tool for uncertainity handling can be used for summarization of text data. We emphasized the role of WordCloud in this as well.

In previous article, Soft set was defined as:

**Definition 1:** [Molodtsov 1999] A pair (F, E) is called a Soft Set over a universe under consideration U, if there is a mapping from the set E to P(U) the power set of U.

Technically it means that, F is a mapping/function from E to P(U), P(U) is power set of U, that is the set of all subsets of U. This in technical terms means F is a mapping in which the attributes of E are mapped to subsets of the problem space universe under consideration. See example below.

Thus, let us consider the attribute sets as E={a1, a2,….an}. These are the attributes that define the universe under consideration. Then F(ai) is a subset of U, similarily F(a1), F(an) are subsets of U, hence belong to the set of all subsets of U, called power set of U, denoted as P(U).

Now, lets come back to the problem at hand, which is text summarization. Here universe U = {collection of all sentences}

and Attributes can be extracted form WordCloud as described in last article.

The top most words and phrases in the WordCloud shall be the attributes that would describe the text.

Say wc1, wc2……wck are the k top wordcloud phrases or words. Now these are considered as the attributes defining the Universe

Here, the mapping is automatically determined by the semantic similarity of the sentence. Thus construct the mapping F as follows:

F(wc1) = { Si in U, such that Si is related to wc1 by a threshold }

This is the function F, and E are the topmost WordCloud attributes.

Now what ? We have defined a soft set (F, E) over a universe U, of sentences.

How to proceed further, is the question now.

Let us start with the summarization model.

Consider the information system with us:

This is the information table provided with us from text data, means we need to generate this table from the text. The table is having sentences in rows and words as columns, the words are typically non-stop words.

The information table can be one of the following:

- This can be frequency based table where each sentence is marked for frequency of word.
- The words which are columns here, can be phrases as well
- The matrix can be a tf-idf matrix, formed with rules of tf-idf, idf here would be replaced with inverse sentence frequency.
- Or this can be taken as word embeddings and you may try sentence embeddings as well.

Now, that we have a information system, let us define the properties of information table. The following lines give brief introduction to Rough Sets, know that a set which cant be precisely defined with the available knowledge is a Rough Set, these shall be discussed in coming articles. Right now a basic definition is enough to start the implementation.

**Definition 2. Rough Set **Let universe *U* and define and information system as table (U, P). Then, we define the following:

- Lower approximation, these are elements that surely define the elements of X.

2. Upper approximation, these are elements that may possibly define the elements of X.

3. Boundary Region, these are elements that belong to difference of upper approximation and lower approximation

4. Span, this is a weighted mean of lower approximation and upper approximation

span(X, P)=u1* lower(U, P) + u2*BoundaryP(X, P), where u1,u2[0,1], u1+u2=1

Here span of X determines how much of a Universe it is covering.

So we now have a soft set (F, E) on U and we have an information system (U, A).

And see what summarization means, in summarization, we need to find a set X, where in X is the best representation of U.

This means we need a subset of Universe which spans the the Universe, that is covers all essential information and is minimal in description. Though we can control the amount of summarization to be x% shrinking of original text, that is a different issue we’ll deal with later, let us now see how to get this extract.

Let’s start with Greedy Approach and know literal meaning of span without getting in technical grutches. Span of a set if how much of a text it covers, and extract or a summary should have maximum span, this is the aim. How span is computed you can refer my earlier paper or coming articles.

This is a proposed greedy algorithm to create the extract.

cover =0

extract = [] #empty set

for wc (wordcloud) in E:

if span(F(wc)) > cover

extract ← extract U {F(wc)}

cover = cover + span(F(wc))

else

continue

repeat

output extract

Let me explain you the extract formation with greedy algorithm to you here. This is not a complete model though as some points are not covered here yet, these would be explained in coming articles, with time. One area is redudancy that needs to be covered, I shall present the algorithm for taking care of duplicacy in data sooner.

Initially, set cover to be 0, as at the beginning of the algorithm the set X, the extract is empty. Then iterate over all attributes of soft set wc derived form wordcloud, called E. Then compare if F(wc), the subset of U greater than cover, if so, take extract as union of extract and F(wc). Update covering ability of extract by adding the span of F(wc). And repeat this on till all attributes in E are exhausted.

**NOTE: This is a model, it may not be the best technique to solve the problem of Text summarization. But an attempt to solve it can make this technique merge with other sophisticated techniques to yield a high-accuracy output for text summarization problems or other problems.**

We solve as we are problem solvers, solving problems will only tell which path to follow next, and to lead to real-time applications of theoretical topics like these.

**References**

Molodtsov, D. (1999). Soft set theory — first results. *Computers & Mathematics with Applications*, *37*(4–5), 19–31.

Sezgin, A., & Atagün, A. O. (2011). On operations of soft sets. *Computers & Mathematics with Applications*, *61*(5), 1457–1467.

Feng, F., Liu, X., Leoreanu-Fotea, V., & Jun, Y. B. (2011). Soft sets and soft rough sets. *Information Sciences*, *181*(6), 1125–1137.