Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performing a fuzzy contains check

I would like to check if a keyword string is contained within a text string. This must be a fuzzy contains.

My first attempt was to use the library fuzzywuzzy. This seemed to have unexpected behavior producing high match values when the strings differed quite a lot when using the partial ratio.

I've tried using levenshtein's distance which works for comparing one string to another but not for finding if a string contains a keyword.
One idea I tried was to split the text into individual words and then loop through them all calculating the distance to see if there is a match. The problem is that the keyword may have white space in it which means it wouldn't find any matches using this method.

I've now tried using a Bitap algorithm to find if the keyword is in the text but this come back as true when the keyword and text are very different. The algorithm can be found here.

final String keyword = "br0wn foxes very nice and hfhjdfgdfgdfgfvffdbdffgjfjfhjgjfdghfghghfg".toLowerCase();
final String text = "The Quick Brown Fox Jumps Over the Lazy Dog".toLowerCase();

final Bitap bitap = new Bitap(keyword, alphabet);   
bitap.within(text, 20);    // Returns true

I've looked into using Lucene. The problem with this is that a lot of it is based around creating indexes from all the data and then performing the search. In my case this can't be done as it needs to be a method that takes a keyword and text separately. If there are any resources to do with performing a fuzzy contains without indexing using Lucene it would be very useful.

What is the best approach for this?

like image 355
Michael Avatar asked Jan 24 '18 11:01

Michael


1 Answers

I've had the same problem a while ago. The requirement was that incoming texts that contained url's that are registered as blocked in the system should be detected and removed.

However they wouldn't match 100% because the detection of the incoming texts was done through an OCR algorithm.

Let's say we have a String that is blocked "www.blockedwebsite.com" and an incoming String that is "I like the website www.blockdwebsite.com :)" (notice the 'e' was removed from the url). Calculating the levenshtein distance would result in a big distance because of the 'I like the website ', so no match. (I use the apache.commons.similarity.LevenshteinDistance library)

What I did was I iterated over the incoming String, taking the substring from i to the length of the blocked String.

    LevenshteinDistance ld = LevenshteinDistance.getDefaultInstance();
    String incomingString = "I like the website www.blockdwebsite.com";
    String blockedString = "www.blockedwebsite.com";
    for (int i = 0; i < incomingString.length()-blockedString.length(); i++) {
        String substring = incomingString.substring(i, i+blockedString.length());
        Integer distance = ld.apply(substring, blockedString);
        if (distance < 5)
            System.out.println("match found");
    }

When the distance drops below 5 a match has been detected. You can alter this to be a 90% match or something like that. I hope this helps. Good luck.

like image 141
Jellis Torfs Avatar answered Oct 17 '22 17:10

Jellis Torfs