Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trying to optimise fuzzy matching

I have 2,500,000 product names and I want to try and group them together, i.e. find products that have similar names. For example, I could have three products:

  • Heinz Baked Beans 400g;
  • Hz Bkd Beans 400g;
  • Heinz Beans 400g.

that are actually the same product and can be merged together.

My plan was to use an implementation of Jaro–Winkler distance to find matches. The process works as follows:

  • make a big list of all the product names in memory;
  • pick the first product on the list;
  • compare it to every product that comes after it in the list and calculate a "Jaro Score";
  • any products that have a high match (say 0.95f or higher) are reported;
  • move to the next product.

So this has some optimisation in that it only matches each product one way around, saving half the processing time.

I coded this up and tested it. It works perfectly and found dozens of matches to investigate.

It takes roughly 20 seconds to compare 1 product to 2,500,000 other products and calculate the "Jaro Score". Assuming my calculations are correct this means it will take the best part of one year to complete the processing.

Obviously this isn't practical.

I have had colleagues go over the code and they managed to make a 20% improvement to the speed of the Jaro Score calculation part. They made the process multithreaded and that has made it a little bit faster. We also removed some of the pieces of information being stored, reducing it to just a list of product names and unique identifiers; this didn't seem to make any difference to the processing time.

With these improvements we still think this will take a number of months to process and we need it to take hours (or a few days at most).

I don't want to go into too much detail as I don't think this is entirely relevant, but I load the product details into a list:

private class Product
{
    public int MemberId;
    public string MemberName;
    public int ProductId;
    public string ProductCode;
    public string ProductName;
}
private class ProductList : List<Product> { }
private readonly ProductList _pl = new ProductList();

Then I use the following to process each product:

{Outer loop...
var match = _pl[matchCount];

for (int count = 1; count < _pl.Count; count++)
{
    var search = _pl[count];
    //Don't match products with themselves (redundant in a one-tailed match)
    if (search.MemberId == match.MemberId && search.ProductId == match.ProductId)
        continue;
    float jaro = Jaro.GetJaro(search.ProductName, match.ProductName);

    //We only log matches that pass the criteria
    if (jaro > target)
    {
        //Load the details into the grid
        var row = new string[7];
        row[0] = search.MemberName;
        row[1] = search.ProductCode;
        row[2] = search.ProductName;
        row[3] = match.MemberName;
        row[4] = match.ProductCode;
        row[5] = match.ProductName;
        row[6] = (jaro*100).ToString("#,##0.0000");
        JaroGrid.Rows.Add(row);
    }
}

I think for the purposes of this question we can assume that the Jaro.GetJaro method is a "black box", i.e. it doesn't really matter how it works as this part of the code has been optimised as much as possible and I can't think how it could be improved.

Any ideas for a better way to fuzzy match this list of products?

I am wondering if there is a "clever" way to pre-process the list to get most matches out at the start of the matching process. For example, if it takes 3 months to compare all products but only 3 days to compare "likely" products then we could live with this.

Okay, two common things are coming up. Firstly, yes, I do take advantage of a single tailed matching process. The real code for this is:

for (int count = matchCount + 1; count < _pl.Count; count++)

I regret posting the amended version; I was trying to simplify this a little (bad idea).

Secondly, a lot of people want to see the Jaro code, so here goes (it is rather long and it wasn't originally mine - I might have even found it on here somewhere?). By the way I love the idea of exiting before completion as soon as a bad match is indicated. I will start looking at that now!

using System;
using System.Text;

namespace EPICFuzzyMatching
{
    public static class Jaro
    {
        private static string CleanString(string clean)
        {
            clean = clean.ToUpper();
            return clean;
        }

        //Gets the similarity of the two strings using Jaro distance
        //param string1 the first input string
        //param string2 the second input string
        //return a value between 0-1 of the similarity
        public static float GetJaro(String string1, String string2)
        {
            //Clean the strings, we do some tricks here to help matching
            string1 = CleanString(string1);
            string2 = CleanString(string2);

            //Get half the length of the string rounded up - (this is the distance used for acceptable transpositions)
            int halflen = ((Math.Min(string1.Length, string2.Length)) / 2) + ((Math.Min(string1.Length, string2.Length)) % 2);

            //Get common characters
            String common1 = GetCommonCharacters(string1, string2, halflen);
            String common2 = GetCommonCharacters(string2, string1, halflen);

            //Check for zero in common
            if (common1.Length == 0 || common2.Length == 0)
                return 0.0f;

            //Check for same length common strings returning 0.0f is not the same
            if (common1.Length != common2.Length)
                return 0.0f;

            //Get the number of transpositions
            int transpositions = 0;
            int n = common1.Length;
            for (int i = 0; i < n; i++)
            {
                if (common1[i] != common2[i])
                    transpositions++;
            }
            transpositions /= 2;

            //Calculate jaro metric
            return (common1.Length / ((float)string1.Length) + common2.Length / ((float)string2.Length) + (common1.Length - transpositions) / ((float)common1.Length)) / 3.0f;
        }

        //Returns a string buffer of characters from string1 within string2 if they are of a given
        //distance seperation from the position in string1.
        //param string1
        //param string2
        //param distanceSep
        //return a string buffer of characters from string1 within string2 if they are of a given
        //distance seperation from the position in string1
        private static String GetCommonCharacters(String string1, String string2, int distanceSep)
        {
            //Create a return buffer of characters
            var returnCommons = new StringBuilder(string1.Length);

            //Create a copy of string2 for processing
            var copy = new StringBuilder(string2);

            //Iterate over string1
            int n = string1.Length;
            int m = string2.Length;
            for (int i = 0; i < n; i++)
            {
                char ch = string1[i];

                //Set boolean for quick loop exit if found
                bool foundIt = false;

                //Compare char with range of characters to either side
                for (int j = Math.Max(0, i - distanceSep); !foundIt && j < Math.Min(i + distanceSep, m); j++)
                {
                    //Check if found
                    if (copy[j] == ch)
                    {
                        foundIt = true;
                        //Append character found
                        returnCommons.Append(ch);
                        //Alter copied string2 for processing
                        copy[j] = (char)0;
                    }
                }
            }
            return returnCommons.ToString();
        }
    }
}

Seeing as this question is still getting some views, I thought I would give a quick update on what happened:

  • I really wished I had originally posted the actual code I was using, as people are still telling me to half my iterations (obviously not reading beyond the first paragraph or so);
  • I took some of the suggestions made here, and some that other people outside SO came up with, and got the run time down to around 70 hours;
  • the main improvement was preprocessing the data, to only consider items that had a reasonably high number of sales attached to them. Not great, but it made the work load substantially smaller;
  • I had an issue with my laptop overheating, and so I ran most of the job with the laptop in my fridge over a weekend. In doing this I learned that fridges are NOT a good environment for a laptop (too damp), and my laptop died about a week later;
  • the net result was that I achieved what I set out to do, maybe not as comprehensively as I had hoped, but overall I would judge it as being a success;
  • why didn't I accept an answer? Well really none of the answers below entirely solved my initial problem, and although they mostly helped (some answers that came in years after I first posted this question certainly didn't help), I felt it was unfair to pick one as "THE ANSWER".
like image 663
Richard Hansell Avatar asked Sep 20 '13 12:09

Richard Hansell


People also ask

How do you evaluate a fuzzy match?

1) How to calculate the score in fuzzy string matching? One of the most effective ways to calculate scores for a fuzzy string matching algorithm is by using cosine similarity. The cosine similarity between two non-zero vectors is simply the cosine of the angle between these vectors.

What is fuzzy data matching?

Fuzzy matching (FM), also known as fuzzy logic, approximate string matching, fuzzy name matching, or fuzzy string matching is an artificial intelligence and machine learning technology that identifies similar, but not identical elements in data table sets.

How long is fuzzy matching?

fuzzyset indexes strings by their bigrams and trigrams so it finds approximate matches in O(log(N)) vs O(N) for difflib . For my fuzzyset of 1M+ words and word-pairs it can compute the index in about 20 seconds and find the closest match in less than a 100 ms.

Why is fuzzy matching?

Fuzzy matching is a method that provides an improved ability to process word-based matching queries to find matching phrases or sentences from a database. When an exact match is not found for a sentence or phrase, fuzzy matching can be applied.


2 Answers

IMHO you should definitely post the GetJaro() Code, as it is the part of your program that needs time.

It involves string operations and is executed millions of times. If StackOverflow users see more improvements which will only remove a fraction of the calculating time, this will bring more improvements to the overall time than removing a microsecond of the list handling.

tl;dr: optimize the code that takes time, not the loop around it.

edit: i have to put this into the answer. do not use floats, but use integer types. they are a lot faster as they do not need the FPU. Also i do suggest to normalize the input, as in ToUpper() or something to make all items "similar" in their appearance.

like image 170
damaltor Avatar answered Oct 02 '22 12:10

damaltor


For a start it looks like "outer loop" is also looping over _pl, since you have a matchCount and then take a match out of it.

If I am correct in that, then your loop counter count should start at matchCount, so that you don't test a vs b and then later test b vs a again. It would eliminate your need to check an item for being itself at the top of the loop, and cut your number of iterations in half.

edit, another idea

Some people have said you should pre-process your match string so you aren't repeating operations like the ToUpper on it. If you do that, here's something else (possibly small) you can do.

Search your match string for double letters. If you find any, remove them from your match string, but mark that you did so (store a list of indexes where letters were doubled). Inside GetCommonCharacters, just add 1 to the loop end condition when comparing with the single remaining instance of that letter. Subsequent comparisons need to adjust for the missing letter as well. Specifically, make your loop go from i - distanceSep to i + distanceSep + 1 (keep the min and max checks of course).

Let's say your string1 contains "ee", with distanceSep of 1. Instead of 6 comparisons, you have 4, 33% savings. The savings is greater with higher distanceSep. If it were 2, you would drop from 10 to 6, 40% savings.

An example if that was confusing. string1 has "ee", string2 just has "abcd" so it won't match. distanceSep is 1. Instead of having to compare "e/a", "e/b", "e/c" ... and then "e/b", "e/c", "e/d", kill the second 'e' in string1 and only compare that e with all four letters.

like image 32
Tesserex Avatar answered Oct 02 '22 10:10

Tesserex