Trying to optimise fuzzy matching

Tags:

I have 2,500,000 product names and I want to try and group them together, i.e. find products that have similar names. For example, I could have three products:

Heinz Baked Beans 400g;
Hz Bkd Beans 400g;
Heinz Beans 400g.

that are actually the same product and can be merged together.

My plan was to use an implementation of Jaro–Winkler distance to find matches. The process works as follows:

make a big list of all the product names in memory;
pick the first product on the list;
compare it to every product that comes after it in the list and calculate a "Jaro Score";
any products that have a high match (say 0.95f or higher) are reported;
move to the next product.

So this has some optimisation in that it only matches each product one way around, saving half the processing time.

I coded this up and tested it. It works perfectly and found dozens of matches to investigate.

It takes roughly 20 seconds to compare 1 product to 2,500,000 other products and calculate the "Jaro Score". Assuming my calculations are correct this means it will take the best part of one year to complete the processing.

Obviously this isn't practical.

I have had colleagues go over the code and they managed to make a 20% improvement to the speed of the Jaro Score calculation part. They made the process multithreaded and that has made it a little bit faster. We also removed some of the pieces of information being stored, reducing it to just a list of product names and unique identifiers; this didn't seem to make any difference to the processing time.

With these improvements we still think this will take a number of months to process and we need it to take hours (or a few days at most).

I don't want to go into too much detail as I don't think this is entirely relevant, but I load the product details into a list:

private class Product
{
    public int MemberId;
    public string MemberName;
    public int ProductId;
    public string ProductCode;
    public string ProductName;
}
private class ProductList : List<Product> { }
private readonly ProductList _pl = new ProductList();

Then I use the following to process each product:

{Outer loop...
var match = _pl[matchCount];

for (int count = 1; count < _pl.Count; count++)
{
    var search = _pl[count];
    //Don't match products with themselves (redundant in a one-tailed match)
    if (search.MemberId == match.MemberId && search.ProductId == match.ProductId)
        continue;
    float jaro = Jaro.GetJaro(search.ProductName, match.ProductName);

    //We only log matches that pass the criteria
    if (jaro > target)
    {
        //Load the details into the grid
        var row = new string[7];
        row[0] = search.MemberName;
        row[1] = search.ProductCode;
        row[2] = search.ProductName;
        row[3] = match.MemberName;
        row[4] = match.ProductCode;
        row[5] = match.ProductName;
        row[6] = (jaro*100).ToString("#,##0.0000");
        JaroGrid.Rows.Add(row);
    }
}

I think for the purposes of this question we can assume that the Jaro.GetJaro method is a "black box", i.e. it doesn't really matter how it works as this part of the code has been optimised as much as possible and I can't think how it could be improved.

Any ideas for a better way to fuzzy match this list of products?

I am wondering if there is a "clever" way to pre-process the list to get most matches out at the start of the matching process. For example, if it takes 3 months to compare all products but only 3 days to compare "likely" products then we could live with this.

Okay, two common things are coming up. Firstly, yes, I do take advantage of a single tailed matching process. The real code for this is:

for (int count = matchCount + 1; count < _pl.Count; count++)

I regret posting the amended version; I was trying to simplify this a little (bad idea).

Secondly, a lot of people want to see the Jaro code, so here goes (it is rather long and it wasn't originally mine - I might have even found it on here somewhere?). By the way I love the idea of exiting before completion as soon as a bad match is indicated. I will start looking at that now!

using System;
using System.Text;

namespace EPICFuzzyMatching
{
    public static class Jaro
    {
        private static string CleanString(string clean)
        {
            clean = clean.ToUpper();
            return clean;
        }

        //Gets the similarity of the two strings using Jaro distance
        //param string1 the first input string
        //param string2 the second input string
        //return a value between 0-1 of the similarity
        public static float GetJaro(String string1, String string2)
        {
            //Clean the strings, we do some tricks here to help matching
            string1 = CleanString(string1);
            string2 = CleanString(string2);

            //Get half the length of the string rounded up - (this is the distance used for acceptable transpositions)
            int halflen = ((Math.Min(string1.Length, string2.Length)) / 2) + ((Math.Min(string1.Length, string2.Length)) % 2);

            //Get common characters
            String common1 = GetCommonCharacters(string1, string2, halflen);
            String common2 = GetCommonCharacters(string2, string1, halflen);

            //Check for zero in common
            if (common1.Length == 0 || common2.Length == 0)
                return 0.0f;

            //Check for same length common strings returning 0.0f is not the same
            if (common1.Length != common2.Length)
                return 0.0f;

            //Get the number of transpositions
            int transpositions = 0;
            int n = common1.Length;
            for (int i = 0; i < n; i++)
            {
                if (common1[i] != common2[i])
                    transpositions++;
            }
            transpositions /= 2;

            //Calculate jaro metric
            return (common1.Length / ((float)string1.Length) + common2.Length / ((float)string2.Length) + (common1.Length - transpositions) / ((float)common1.Length)) / 3.0f;
        }

        //Returns a string buffer of characters from string1 within string2 if they are of a given
        //distance seperation from the position in string1.
        //param string1
        //param string2
        //param distanceSep
        //return a string buffer of characters from string1 within string2 if they are of a given
        //distance seperation from the position in string1
        private static String GetCommonCharacters(String string1, String string2, int distanceSep)
        {
            //Create a return buffer of characters
            var returnCommons = new StringBuilder(string1.Length);

            //Create a copy of string2 for processing
            var copy = new StringBuilder(string2);

            //Iterate over string1
            int n = string1.Length;
            int m = string2.Length;
            for (int i = 0; i < n; i++)
            {
                char ch = string1[i];

                //Set boolean for quick loop exit if found
                bool foundIt = false;

                //Compare char with range of characters to either side
                for (int j = Math.Max(0, i - distanceSep); !foundIt && j < Math.Min(i + distanceSep, m); j++)
                {
                    //Check if found
                    if (copy[j] == ch)
                    {
                        foundIt = true;
                        //Append character found
                        returnCommons.Append(ch);
                        //Alter copied string2 for processing
                        copy[j] = (char)0;
                    }
                }
            }
            return returnCommons.ToString();
        }
    }
}

Seeing as this question is still getting some views, I thought I would give a quick update on what happened:

I really wished I had originally posted the actual code I was using, as people are still telling me to half my iterations (obviously not reading beyond the first paragraph or so);
I took some of the suggestions made here, and some that other people outside SO came up with, and got the run time down to around 70 hours;
the main improvement was preprocessing the data, to only consider items that had a reasonably high number of sales attached to them. Not great, but it made the work load substantially smaller;
I had an issue with my laptop overheating, and so I ran most of the job with the laptop in my fridge over a weekend. In doing this I learned that fridges are NOT a good environment for a laptop (too damp), and my laptop died about a week later;
the net result was that I achieved what I set out to do, maybe not as comprehensively as I had hoped, but overall I would judge it as being a success;
why didn't I accept an answer? Well really none of the answers below entirely solved my initial problem, and although they mostly helped (some answers that came in years after I first posted this question certainly didn't help), I felt it was unfair to pick one as "THE ANSWER".

663

asked Sep 20 '13 12:09

Richard Hansell

2 Answers

IMHO you should definitely post the GetJaro() Code, as it is the part of your program that needs time.

It involves string operations and is executed millions of times. If StackOverflow users see more improvements which will only remove a fraction of the calculating time, this will bring more improvements to the overall time than removing a microsecond of the list handling.

tl;dr: optimize the code that takes time, not the loop around it.

edit: i have to put this into the answer. do not use floats, but use integer types. they are a lot faster as they do not need the FPU. Also i do suggest to normalize the input, as in ToUpper() or something to make all items "similar" in their appearance.

170

answered Oct 02 '22 12:10

damaltor

For a start it looks like "outer loop" is also looping over _pl, since you have a matchCount and then take a match out of it.

If I am correct in that, then your loop counter count should start at matchCount, so that you don't test a vs b and then later test b vs a again. It would eliminate your need to check an item for being itself at the top of the loop, and cut your number of iterations in half.

edit, another idea

Some people have said you should pre-process your match string so you aren't repeating operations like the ToUpper on it. If you do that, here's something else (possibly small) you can do.

Search your match string for double letters. If you find any, remove them from your match string, but mark that you did so (store a list of indexes where letters were doubled). Inside GetCommonCharacters, just add 1 to the loop end condition when comparing with the single remaining instance of that letter. Subsequent comparisons need to adjust for the missing letter as well. Specifically, make your loop go from i - distanceSep to i + distanceSep + 1 (keep the min and max checks of course).

Let's say your string1 contains "ee", with distanceSep of 1. Instead of 6 comparisons, you have 4, 33% savings. The savings is greater with higher distanceSep. If it were 2, you would drop from 10 to 6, 40% savings.

An example if that was confusing. string1 has "ee", string2 just has "abcd" so it won't match. distanceSep is 1. Instead of having to compare "e/a", "e/b", "e/c" ... and then "e/b", "e/c", "e/d", kill the second 'e' in string1 and only compare that e with all four letters.

answered Oct 02 '22 10:10

Tesserex

Related questions
                            
                                Securing the JWT with a X509Certificate2 (JwtSecurityTokenHandler)
                            
                                Strange NullRefereneceException in Razor helper
                            
                                In a simple Viewbag.Title, getting a RuntimeBinderException
                            
                                OpenFileDialog cuts off pre-populated file name [duplicate]
                            
                                How do I map a single .NET type to multiple nested object types in ElasticSearch/NEST?
                            
                                System.Runtime.InteropServices.COMException when launching a pdf file on Windows Phone
                            
                                401 when calling Web Service only on particular machines
                            
                                C#.NET & Translation of external component
                            
                                Why is Entity Framework generating the following nested SQL for Azure Mobile Services Table Controllers
                            
                                What is the behaviour of the '==' operator for a generic type value and the 'default' keyword?
                            
                                What tools exist for testing multithreaded .net code?
                            
                                Can I list all registered event sources?
                            
                                Screen capture with C# and Remote Desktop problems
                            
                                How to Close another Process from c# [duplicate]
                            
                                Windows 7 OCR API
                            
                                Why is my string.indexof(char) faster?
                            
                                Is this a good example of the "Bastard injection anti-pattern"?
                            
                                How to disable a particular compiler warning for a particular file
                            
                                Is there a ReaderWriterLockSlim equivalent that favors readers?
                            
                                Cannot change ObservableCollection during a CollectionChanged event

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Trying to optimise fuzzy matching

Tags:

c#

algorithm

duplicates

grouping

fuzzy-logic