Let's say that I have an MDM system (Master Data Management), whose primary application is to detect and prevent duplication of records. Every time a sales rep enters a new customer in the system, my MDM platform performs a check on existing records, computes the Levenshtein or Jaccard or XYZ distance between pair of words or phrases or attributes, considers weights and coefficients and outputs a similarity score, and so on. Your typical fuzzy matching scenario. I would like to know if it makes sense at all to apply machine learning techniques to optimize the matching output, i.e. find duplicates with maximum accuracy. And where exactly it makes the most sense. <ul> <li>optimizing the weights of the attributes?</li> <li>increase the algorithm confidence by predicting the outcome of the match?</li> <li>learn the matching rules that otherwise I would configure into the algorithm?</li> <li>something else?</li> </ul> There's also this excellent answer about the topic but I didn't quite get whether the guy actually made use of ML or not. Also my understanding is that weighted fuzzy matching is already a good enough solution, probably even from a financial perspective, since whenever you deploy such an MDM system you have to do some analysis and preprocessing anyway, be it either manually encoding the matching rules or training an ML algorithm. So I'm not sure that the addition of ML would represent a significant value proposition. Any thoughts are appreciated.

The main advantage of using machine learning is the time saving. It is very likely that, given enough time, you could hand tune weights and come up with matching rules that are very good for your particular dataset. A machine learning approach could have a hard time outperforming your hand made system customized for a particular dataset. However, this will probably take days to make a good matching system by hand. If you use an existing ML for matching tool, like Dedupe, then good weights and rules can be learned in an hour (including set up time). So, if you have already built a matching system that is performing well on your data, it may not be worth investigating ML. But, if this is a new data project, then it almost certainly will be.

How to apply machine learning to fuzzy matching

Tags:

algorithm

machine-learning

fuzzy-comparison

record-linkage

Let's say that I have an MDM system (Master Data Management), whose primary application is to detect and prevent duplication of records.

Every time a sales rep enters a new customer in the system, my MDM platform performs a check on existing records, computes the Levenshtein or Jaccard or XYZ distance between pair of words or phrases or attributes, considers weights and coefficients and outputs a similarity score, and so on.

Your typical fuzzy matching scenario.

I would like to know if it makes sense at all to apply machine learning techniques to optimize the matching output, i.e. find duplicates with maximum accuracy.
And where exactly it makes the most sense.

optimizing the weights of the attributes?
increase the algorithm confidence by predicting the outcome of the match?
learn the matching rules that otherwise I would configure into the algorithm?
something else?

There's also this excellent answer about the topic but I didn't quite get whether the guy actually made use of ML or not.

Also my understanding is that weighted fuzzy matching is already a good enough solution, probably even from a financial perspective, since whenever you deploy such an MDM system you have to do some analysis and preprocessing anyway, be it either manually encoding the matching rules or training an ML algorithm.

So I'm not sure that the addition of ML would represent a significant value proposition.

Any thoughts are appreciated.

281

asked Apr 12 '17 10:04

blackgreen

1 Answers

The main advantage of using machine learning is the time saving.

It is very likely that, given enough time, you could hand tune weights and come up with matching rules that are very good for your particular dataset. A machine learning approach could have a hard time outperforming your hand made system customized for a particular dataset.

However, this will probably take days to make a good matching system by hand. If you use an existing ML for matching tool, like Dedupe, then good weights and rules can be learned in an hour (including set up time).

So, if you have already built a matching system that is performing well on your data, it may not be worth investigating ML. But, if this is a new data project, then it almost certainly will be.

105

answered Oct 02 '22 12:10

fgregg

Related questions
                            
                                Binary search algorithm in python
                            
                                How would you print out the data in a binary tree, level by level, starting at the top?
                            
                                A better algorithm to find the next palindrome of a number string
                            
                                In simple terms, how is compression commonly implemented?
                            
                                Dijkstra's algorithm in python
                            
                                Dynamic programming: Find longest subsequence that is zig zag
                            
                                Kadane Algorithm Negative Numbers
                            
                                What would a P=NP proof be like, hypothetically?
                            
                                C#: Removing common invalid characters from a string: improve this algorithm
                            
                                Maximum sum of all subarrays of size k for each k=1..n
                            
                                Nodejs createCipher vs createCipheriv
                            
                                compute bootstrapping algorithm using Map/Reduce
                            
                                Fastest way to Find a m x n submatrix in M X N matrix
                            
                                How to check similarity of two Xml trees (Tree Edit Distance in C#)
                            
                                Shortest Distance from Leaf to Root of a Directed tree
                            
                                Provably correct permutation in less than O(n^2)
                            
                                Demosaicing algorithm that contains downsampling
                            
                                Project Euler #163 understanding
                            
                                Listing all interesting sections of a tetrahedron
                            
                                Why is it faster to calculate the product of a consecutive array of integers by performing the calculation in pairs?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With