Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fuzzy record matching with multiple columns of information

I have a question that is somewhat high level, so I'll try to be as specific as possible.

I'm doing a lot of research that involves combining disparate data sets with header information that refers to the same entity, usually a company or a financial security. This record linking usually involves header information in which the name is the only common primary identifier, but where some secondary information is often available (such as city and state, dates of operation, relative size, etc). These matches are usually one-to-many, but may be one-to-one or even many-to-many. I have usually done this matching by hand or with very basic text comparison of cleaned substrings. I have occasionally used a simple matching algorithm like a Levenshtein distance measure, but I never got much out of it, in part because I didn't have a good formal way of applying it.

My guess is that this is a fairly common question and that there must be some formalized processes that have been developed to do this type of thing. I've read a few academic papers on the subject that deal with theoretical appropriateness of given approaches, but I haven't found any good source that walks through a recipe or at least a practical framework.

My question is the following:

  • Does anyone know of a good source for implementing multi-dimensional fuzzy record matching, like a book or a website or a published article or working paper?

  • I'd prefer something that had practical examples and a well defined approach.

  • The approach could be iterative, with human checks for improvement at intermediate stages.

  • (edit) The linked data is used for statistical analysis. As such, a little bit of noise is OK, but there is a strong preference for fewer "incorrect matches" over fewer "incorrect non-matches".

  • If they were in Python that would be fantastic, but not necessary.

One last thing, if it matters, is that I don't care much about computational efficiency. I'm not implementing this dynamically and I'm usually dealing with a few thousand records.

like image 447
WildGunman Avatar asked Mar 08 '11 19:03

WildGunman


People also ask

What is fuzzy matching example?

Fuzzy Matching (also called Approximate String Matching) is a technique that helps identify two elements of text, strings, or entries that are approximately similar but are not exactly the same. For example, let's take the case of hotels listing in New York as shown by Expedia and Priceline in the graphic below.

What is fuzzy data matching?

Fuzzy matching uses natural language processing (a form of AI) to find errors, and the uniqueness of a word is used as an anchor to correct other elements. Fuzzy matching takes place after two letters have been typed when an exact match has not been found.

What is the difference between exact matching and fuzzy matching?

You can use the exact matching method for almost any field, including custom fields. The fuzzy matching methods look for strings that approximately match a pattern. Some fuzzy matching methods, such as Acronym and Name Variant, identify similarities using hard-coded dictionaries.

What is fuzzy Wuzzy matching?

FuzzyWuzzy is a library of Python which is used for string matching. Fuzzy string matching is the process of finding strings that match a given pattern. Basically it uses Levenshtein Distance to calculate the differences between sequences.


1 Answers

One common method that shouldn't be terribly expensive for "a few thousand records" would be cosine similarity. Although most often used for comparing text documents, you can easily modify it to work with any kind of data.

The linked Wikipedia article is pretty sparse on details, but following links and doing a few searches will get you some good info. Potentially an implementation that you can modify to fit your purposes. In fact, take a look at Simple implementation of N-Gram, tf-idf and Cosine similarity in Python

A simpler calculation, and one that might be "good enough" for your purposes would be a Jaccard index. The primary difference is that typically cosine similarity takes into account the number of times a word is used in a document and in the entire set of documents, whereas the Jaccard index only cares that a particular word is in the document. There are other differences, but that one strikes me as the most important.

like image 52
Jim Mischel Avatar answered Nov 15 '22 10:11

Jim Mischel