Say I have this text = I love apples, kiwis, oranges and bananas
and the searchString = kiwis and bananas
and a similarity algorithm say Jaccard index. How can I efficiently find the substring in text
which has the highest similarity to searchString
.
Basically I am trying to find portions of text (text has high errors, misspellings, extra symbols and spaces) which match a list of keywords I have.
Jaccard index is "lucky" similarity algorithm, because you can update it's value for new symbol without recalculating all previous stuff. So, you can view text
as a sequence of diffs for resulting index value. After that, problem can be reduced to https://en.wikipedia.org/wiki/Maximum_subarray_problem.
What about your second paragraph, if you are doing some NLP-like research, I'd suggest to clean your data (remove those extra symbols and spaces, whenever that's possible) before further processing. That's known as "spelling correction", and there are tons of different algorithms and libraries. To choose appropriate one, extra information about your domain is needed.
Take a look at shingling technique, and try to find the similarity. you can follow this link: http://nlp.stanford.edu/IR-book/html/htmledition/near-duplicates-and-shingling-1.html
For example use 9 shingle and compare each subset with your specific keyword
I Use Stemming and Levenshtein distance
This is the algorithm in action: https://wizsearch.wizsoft.com/index.php/demo/
This demo searches all wiki titles, try the "show search terms" option to see the Levenshtein distance and error correction algorithm in action.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With