Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Match entities by fuzzy matching of multiple variables

I have a fuzzy string matching problem of multiple dimensions:

Assume I have a pandas dataframe which contains the variables "Company name", "Ticker" and "Country". A simplified subset may look like this:

pd.DataFrame(columns = ["Company name", "Ticker", "Country"], 
             data = [["Vestas Wind Systems", "VWS.CO", "Denmark"],
                     ["Vestas", "VWS", "Denmark"],
                     ["Vestas Wind", "VWS", np.nan], 
                     ["Amazon.com Inc", np.nan, "United States of America"],
                     ["AMAZONIA", "BAZA3 BZ", "Brazil"],
                     ["AMAZON.COM", "AMZN US", "United States"]])

Subset of dataset

In its entirety, the dataframe will contain several hundred thousands of rows.

What I want is to identify the companies in the dataframe, which are the same. In this case that means identifying that rows 0, 1, 2 all are different expressions of the company "Vestas Wind Systems", row 3, 5 both represent "Amazon.com Inc" and row 4 represents "Amazonia".

To increase the chance of correct matching, I assume that utilizing the information of all three columns is preferred.

However, all three columns need to be compared through fuzzy logic: Both the company, the ticker and the country may be written in different ways. E.g. "Vestas Wind Systems" versus "Vestas" or "United States of America" vs. "United States".

An additional complexity is that both the Ticker and the Country column may contain NaN values (the Company name is never null).

QUESTION 1: What is the ideal approach for tackling this problem?


My current plan:

I would like to to match companies by utilizing information across the three columns. The more similar the entities are across the columns, the higher probability of a match. Furthermore, there should be different weights of each column: just because two companies are based in the US, doesn't mean that they are the same company. So the Country column, for example, should have a low weight.

I currently tried to use a fuzzy algorithm on each column to identify similar string representations. This will yield results like this, where the score represents the string similarity:

pd.DataFrame(columns = ["Company name 1", "Company name 2", "Score"], 
             data = [["vestas wind systems", "vestas wind", 0.9],
                     ["vestas wind", "vestas", 0.85],
                     ["amazon.com inc", "amazon.com", 0.84],
                     ["amazon.com", "amazonia", 0.79],
                     ["vestas wind systems", "vestas", 0.75],
                     ["amazon.com inc", "amazonia", 0.70], 
                     ["vestas", "amazonia", 0.4],
                     ["...", "...", "..."]])

Company name matching

pd.DataFrame(columns = ["Ticker 1", "Ticker 2", "Score"], 
             data = [["vws.co", "vws", 0.8],
                     ["baza3 bz", "amzn us", 0.6],
                     ["vws", "amzn us", 0.4],
                     ["vws.co", "amzn us", 0.35],
                     ["baza3 bz", "vws.co", 0.3],
                     ["baza3 bz", "vws", 0.28]])

Ticker matching

pd.DataFrame(columns = ["Country 1", "Country 2", "Score"], 
             data = [["united states", "united states of america", 0.8],
                     ["brazil", "denmark", 0.3],
                     ["brazil", "united states", 0.28],
                     ["brazil", "united states of america", 0.26],
                     ["denmark", "united states", 0.25],
                     ["denmark", "united states of america", 0.23]])

Country matching

NB: I realize that I should do some simple string cleaning through regex'es before fuzzy matching, but let's for simplicity assume that I have already done this. Likewise, I have converted all strings to lowercase in the above results.

So now I have similarity scores across the different columns. I then want to use these similarities to identify which rows of the initial dataframe represent the same companies. As I mentioned earlier, I want to apply different weightings of the column similarities: Let's say I want to use the following weights:

weights = {"Company name" : 0.45, "Ticker" : 0.45, "Country" : 0.1}

That is, when comparing any two lines in the dataframe, their similarity score would be

similarity_score = 0.45 * Company Name similarity score + 0.45 * Ticker Name similarity score + 0.1 * Country similarity score

E.g. the similarity score of row 0 and row 1 is :

similarity_score_0_1 = 0.45 * 0.75 + 0.45 * 0.8 + 0.1 * 1.0 = 0.7975

This of course becomes a problem when some rows have null values for tickers and/or countries.

And finally - when I have several hundred thousands of rows in the dataframe, computing similarity scores between all rows becomes very time consuming.

QUESTION 2: How do I accomplish this in the most efficient way?

like image 883
Emjora Avatar asked Aug 10 '18 10:08

Emjora


People also ask

What is fuzzy matching example?

Fuzzy Matching (also called Approximate String Matching) is a technique that helps identify two elements of text, strings, or entries that are approximately similar but are not exactly the same. For example, let's take the case of hotels listing in New York as shown by Expedia and Priceline in the graphic below.

What is fuzzy matching in SQL?

You can use the T-SQL algorithm to perform fuzzy matching, comparing two strings and returning a score between 1 and 0 (with 1 being an exact match). With this method, you can use fuzzy logic for address matching, which helps you account for partial matches.

What is levenshtein fuzzy matching?

The Levenshtein Distance (LD) is one of the fuzzy matching techniques that measure between two strings, with the given number representing how far the two strings are from being an exact match. The higher the number of the Levenshtein edit distance, the further the two terms are from being identical.


1 Answers

I would approach it the following way:

  1. Make sure the 'Country' column is clean. Make some exploration to detect cases such as 'USA' and 'United States', 'Russia' and 'Russian Federation' etc. Make sure every country is spelled in a consistent way.

  2. If your goal is to find identical companies, you can narrow your comparison space by only comparing a record to companies from the same country (given that you've done 1.). So you will only compare eg. a Danish company to all Danish companies. This will save you time. Records with missing countries will have to be compared to all records though.

  3. Look into TFIDF, a simple and efficient method used in information retrieval. I've worked on a very similar task and TFIDF proved to be better than Levenshtein distance. In this case the advantage of TFIDF is that it will give less weight to common phrases (inc., co., company, ltd etc.), whereas Fuzzy will see ltd and think it's a good match (although you might have CocaCola Ltd. and Pepsi Ltd.). For TFIDF you might consider concatenating all relevant columns when doing the comparison. I used the TfidfVectorizer from sklearn.

like image 81
nagini Avatar answered Sep 17 '22 04:09

nagini