In a recent article I read about non conventional features. The article is given in reference . However I was more interested in this since I has previously worked on non-conventional features, feature clustering as a part of my research works. It is not just that we need to take the words given in an NLP task to make features, and to most perform stemmatization and lemmatization. Now, some people have recently suggested using Soundex as a feature in the vector space model as well.
Well to the use of Soundex as a feature in ML, NLP and AI tasks there are various uses and benefits one can reap out of it. Not very evident though, but look in depth. When a word that sounds similar this can let us know of the origins of words, how the words travelled along with the travel of the trade merchants, not just along the silk road but along many other trade routes. This tells us the use of Soundex from the point of view of study of civilizations, knowing the ancient history, the way words were developed. For the sounds that are more popular in a particular place, script or language are more dominant and must have travelled to other parts where they are used as a gesture of may be love or dominance form the other civilizations. This is application in study of history.
Now how does one use a particular sound and which collection of societies use a particular sound more and does it have a benifit on then, how some sounds corelate to sounds present in nature. For example the word cool and kool is often used in English text chats. The emphasis here is to tell there is a bird called koyal that makes this naturally occurring sound as well — the sound is “KOO”
All this doesn’t end at Soundex, the implementation of Soundex, makes hardcoded elimination of certain alphabets in the word with digits and remove more commonly occurring alphabets. This gives good similarity between pronunciations, but we need a more comprehensive library used that Soundex, which paves in depth analysis into emphasis of words, how long certain sounds within a text fragment are spoken and so on. This text fragment can be a word or phrase or even a sentence.
The data once appended in the feature space in ML, AI, NLP tasks can help to know patterns of kind of speech and the way it is said. Further, it is not just limited to sound, many more feature compression techniques, which I have personally experimented on are to be included in feature map- one to reduce dimensionality, two to make data more dense, and three to find more hidden details and in a way making it easier to understand the logics behind some of the balck box algorithms.
To be continued..