Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find substring in text which has the highest similarity to a given keyword

Say I have this text = I love apples, kiwis, oranges and bananas and the searchString = kiwis and bananas and a similarity algorithm say Jaccard index. How can I efficiently find the substring in text which has the highest similarity to searchString.

Basically I am trying to find portions of text (text has high errors, misspellings, extra symbols and spaces) which match a list of keywords I have.

like image 744
pathikrit Avatar asked Sep 13 '16 23:09

pathikrit


3 Answers

Jaccard index is "lucky" similarity algorithm, because you can update it's value for new symbol without recalculating all previous stuff. So, you can view text as a sequence of diffs for resulting index value. After that, problem can be reduced to https://en.wikipedia.org/wiki/Maximum_subarray_problem.

What about your second paragraph, if you are doing some NLP-like research, I'd suggest to clean your data (remove those extra symbols and spaces, whenever that's possible) before further processing. That's known as "spelling correction", and there are tons of different algorithms and libraries. To choose appropriate one, extra information about your domain is needed.

like image 140
dveim Avatar answered Oct 13 '22 07:10

dveim


Take a look at shingling technique, and try to find the similarity. you can follow this link: http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html

For example use 9 shingle and compare each subset with your specific keyword

like image 28
Masoud Avatar answered Oct 13 '22 06:10

Masoud


I Use Stemming and Levenshtein distance

This is the algorithm in action: https://wizsearch.wizsoft.com/index.php/demo/

This demo searches all wiki titles, try the "show search terms" option to see the Levenshtein distance and error correction algorithm in action.

like image 1
O_Z Avatar answered Oct 13 '22 08:10

O_Z