Find substring in text which has the highest similarity to a given keyword

Question

Say I have this text = I love apples, kiwis, oranges and bananas and the searchString = kiwis and bananas and a similarity algorithm say Jaccard index. How can I efficiently find the substring in text which has the highest similarity to searchString.

Basically I am trying to find portions of text (text has high errors, misspellings, extra symbols and spaces) which match a list of keywords I have.

dveim · Accepted Answer

Jaccard index is "lucky" similarity algorithm, because you can update it's value for new symbol without recalculating all previous stuff. So, you can view text as a sequence of diffs for resulting index value. After that, problem can be reduced to https://en.wikipedia.org/wiki/Maximum_subarray_problem.

What about your second paragraph, if you are doing some NLP-like research, I'd suggest to clean your data (remove those extra symbols and spaces, whenever that's possible) before further processing. That's known as "spelling correction", and there are tons of different algorithms and libraries. To choose appropriate one, extra information about your domain is needed.

Masoud · Answer

Take a look at shingling technique, and try to find the similarity. you can follow this link: http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html

For example use 9 shingle and compare each subset with your specific keyword

O_Z · Answer

I Use Stemming and Levenshtein distance

This is the algorithm in action: https://wizsearch.wizsoft.com/index.php/demo/

This demo searches all wiki titles, try the "show search terms" option to see the Levenshtein distance and error correction algorithm in action.

Find substring in text which has the highest similarity to a given keyword

Tags:

text

machine-learning

data-mining

string-algorithm

pathikrit

3 Answers

dveim

Masoud

O_Z

Recent Activity

Donate For Us

Find substring in text which has the highest similarity to a given keyword

Tags:

text

machine-learning

data-mining

string-algorithm

pathikrit

3 Answers

dveim

Masoud

O_Z

Related questions

Recent Activity

Donate For Us