Fuzzy record matching with multiple columns of information

Tags:

I have a question that is somewhat high level, so I'll try to be as specific as possible.

I'm doing a lot of research that involves combining disparate data sets with header information that refers to the same entity, usually a company or a financial security. This record linking usually involves header information in which the name is the only common primary identifier, but where some secondary information is often available (such as city and state, dates of operation, relative size, etc). These matches are usually one-to-many, but may be one-to-one or even many-to-many. I have usually done this matching by hand or with very basic text comparison of cleaned substrings. I have occasionally used a simple matching algorithm like a Levenshtein distance measure, but I never got much out of it, in part because I didn't have a good formal way of applying it.

My guess is that this is a fairly common question and that there must be some formalized processes that have been developed to do this type of thing. I've read a few academic papers on the subject that deal with theoretical appropriateness of given approaches, but I haven't found any good source that walks through a recipe or at least a practical framework.

My question is the following:

Does anyone know of a good source for implementing multi-dimensional fuzzy record matching, like a book or a website or a published article or working paper?
I'd prefer something that had practical examples and a well defined approach.
The approach could be iterative, with human checks for improvement at intermediate stages.
(edit) The linked data is used for statistical analysis. As such, a little bit of noise is OK, but there is a strong preference for fewer "incorrect matches" over fewer "incorrect non-matches".
If they were in Python that would be fantastic, but not necessary.

One last thing, if it matters, is that I don't care much about computational efficiency. I'm not implementing this dynamically and I'm usually dealing with a few thousand records.

447

asked Mar 08 '11 19:03

WildGunman

1 Answers

One common method that shouldn't be terribly expensive for "a few thousand records" would be cosine similarity. Although most often used for comparing text documents, you can easily modify it to work with any kind of data.

The linked Wikipedia article is pretty sparse on details, but following links and doing a few searches will get you some good info. Potentially an implementation that you can modify to fit your purposes. In fact, take a look at Simple implementation of N-Gram, tf-idf and Cosine similarity in Python

A simpler calculation, and one that might be "good enough" for your purposes would be a Jaccard index. The primary difference is that typically cosine similarity takes into account the number of times a word is used in a document and in the entire set of documents, whereas the Jaccard index only cares that a particular word is in the document. There are other differences, but that one strikes me as the most important.

answered Nov 15 '22 10:11

Jim Mischel

Related questions
                            
                                Optimizing a vector image by removing unnecessary points and stacking shapes
                            
                                Structure/algorithm for solving game with overlapping cards
                            
                                Do you have genetic algorithm in production?
                            
                                Teleporting Traveler, Optimal Profit over time Problem
                            
                                How would you go about tackling this exercise?
                            
                                How to determine the best case and worst case of an program(algorithm)?
                            
                                Is there "good" PRNG generating values without hidden state?
                            
                                Required Working Precision for the BBP Algorithm?
                            
                                Algorithm for activation key- Security
                            
                                Trilateration in a 2D plane with signal strengths
                            
                                Finding The Max of sum of elements in matrix in distinct rows and columns
                            
                                What is the concatenation complexity of balanced ropes?
                            
                                What's the name of this algorithm/routine?
                            
                                Can I calculate an element without looping through all preceding elements in my case (see the question body)?
                            
                                Generating synthetic social networks?
                            
                                Circle Separation Distance - Nearest Neighbor Problem
                            
                                Data structure and algorithm for representing/allocating free space in a file
                            
                                Error with optimality in Iterative Deepening Depth First Search algorithm
                            
                                Resource placement (optimal strategy)
                            
                                My naive maximal clique finding algorithm runs faster than Bron-Kerbosch's. What's wrong?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fuzzy record matching with multiple columns of information

Tags:

string-matching

algorithm

theory

fuzzy-comparison

record-linkage

WildGunman

People also ask

1 Answers

Jim Mischel

Recent Activity

Donate For Us