Match entities by fuzzy matching of multiple variables

Tags:

I have a fuzzy string matching problem of multiple dimensions:

Assume I have a pandas dataframe which contains the variables "Company name", "Ticker" and "Country". A simplified subset may look like this:

pd.DataFrame(columns = ["Company name", "Ticker", "Country"], 
             data = [["Vestas Wind Systems", "VWS.CO", "Denmark"],
                     ["Vestas", "VWS", "Denmark"],
                     ["Vestas Wind", "VWS", np.nan], 
                     ["Amazon.com Inc", np.nan, "United States of America"],
                     ["AMAZONIA", "BAZA3 BZ", "Brazil"],
                     ["AMAZON.COM", "AMZN US", "United States"]])

Subset of dataset

In its entirety, the dataframe will contain several hundred thousands of rows.

What I want is to identify the companies in the dataframe, which are the same. In this case that means identifying that rows 0, 1, 2 all are different expressions of the company "Vestas Wind Systems", row 3, 5 both represent "Amazon.com Inc" and row 4 represents "Amazonia".

To increase the chance of correct matching, I assume that utilizing the information of all three columns is preferred.

However, all three columns need to be compared through fuzzy logic: Both the company, the ticker and the country may be written in different ways. E.g. "Vestas Wind Systems" versus "Vestas" or "United States of America" vs. "United States".

An additional complexity is that both the Ticker and the Country column may contain NaN values (the Company name is never null).

QUESTION 1: What is the ideal approach for tackling this problem?

My current plan:

I would like to to match companies by utilizing information across the three columns. The more similar the entities are across the columns, the higher probability of a match. Furthermore, there should be different weights of each column: just because two companies are based in the US, doesn't mean that they are the same company. So the Country column, for example, should have a low weight.

I currently tried to use a fuzzy algorithm on each column to identify similar string representations. This will yield results like this, where the score represents the string similarity:

pd.DataFrame(columns = ["Company name 1", "Company name 2", "Score"], 
             data = [["vestas wind systems", "vestas wind", 0.9],
                     ["vestas wind", "vestas", 0.85],
                     ["amazon.com inc", "amazon.com", 0.84],
                     ["amazon.com", "amazonia", 0.79],
                     ["vestas wind systems", "vestas", 0.75],
                     ["amazon.com inc", "amazonia", 0.70], 
                     ["vestas", "amazonia", 0.4],
                     ["...", "...", "..."]])

Company name matching

pd.DataFrame(columns = ["Ticker 1", "Ticker 2", "Score"], 
             data = [["vws.co", "vws", 0.8],
                     ["baza3 bz", "amzn us", 0.6],
                     ["vws", "amzn us", 0.4],
                     ["vws.co", "amzn us", 0.35],
                     ["baza3 bz", "vws.co", 0.3],
                     ["baza3 bz", "vws", 0.28]])

Ticker matching

pd.DataFrame(columns = ["Country 1", "Country 2", "Score"], 
             data = [["united states", "united states of america", 0.8],
                     ["brazil", "denmark", 0.3],
                     ["brazil", "united states", 0.28],
                     ["brazil", "united states of america", 0.26],
                     ["denmark", "united states", 0.25],
                     ["denmark", "united states of america", 0.23]])

Country matching

NB: I realize that I should do some simple string cleaning through regex'es before fuzzy matching, but let's for simplicity assume that I have already done this. Likewise, I have converted all strings to lowercase in the above results.

So now I have similarity scores across the different columns. I then want to use these similarities to identify which rows of the initial dataframe represent the same companies. As I mentioned earlier, I want to apply different weightings of the column similarities: Let's say I want to use the following weights:

weights = {"Company name" : 0.45, "Ticker" : 0.45, "Country" : 0.1}

That is, when comparing any two lines in the dataframe, their similarity score would be

similarity_score = 0.45 * Company Name similarity score + 0.45 * Ticker Name similarity score + 0.1 * Country similarity score

E.g. the similarity score of row 0 and row 1 is :

similarity_score_0_1 = 0.45 * 0.75 + 0.45 * 0.8 + 0.1 * 1.0 = 0.7975

This of course becomes a problem when some rows have null values for tickers and/or countries.

And finally - when I have several hundred thousands of rows in the dataframe, computing similarity scores between all rows becomes very time consuming.

QUESTION 2: How do I accomplish this in the most efficient way?

883

asked Aug 10 '18 10:08

Emjora

1 Answers

I would approach it the following way:

Make sure the 'Country' column is clean. Make some exploration to detect cases such as 'USA' and 'United States', 'Russia' and 'Russian Federation' etc. Make sure every country is spelled in a consistent way.
If your goal is to find identical companies, you can narrow your comparison space by only comparing a record to companies from the same country (given that you've done 1.). So you will only compare eg. a Danish company to all Danish companies. This will save you time. Records with missing countries will have to be compared to all records though.
Look into TFIDF, a simple and efficient method used in information retrieval. I've worked on a very similar task and TFIDF proved to be better than Levenshtein distance. In this case the advantage of TFIDF is that it will give less weight to common phrases (inc., co., company, ltd etc.), whereas Fuzzy will see ltd and think it's a good match (although you might have CocaCola Ltd. and Pepsi Ltd.). For TFIDF you might consider concatenating all relevant columns when doing the comparison. I used the TfidfVectorizer from sklearn.

answered Sep 17 '22 04:09

nagini

Related questions
                            
                                Platform independent way to find path to libpython (e.g., for use in cmake)
                            
                                Querying a Neo4j DB using Django
                            
                                Plotly: How to show all the stacked y axis data values while hovering for three y layout and one x axis shared graph?
                            
                                Django celery worker to send real-time status and result messages to front end
                            
                                Why aiohttp deprecated the loop parameter in ClientSession?
                            
                                Python `argparse`: Is there a clean way to add a flag that sets multiple flags (e.g. `--all`" is equivalent to `--x --y`)
                            
                                How to declare multiple variables with type annotation syntax in Python?
                            
                                How to overcome "OperationalError: too many SQL variables"
                            
                                Python Deployment Package with SKLEARN, PANDAS and NUMPY issue?
                            
                                What's the sequence of middleware execution in django when error occurs in process_request?
                            
                                Getting "title already used as a name or title" error while reading SPSS (.sav) file in Python
                            
                                Can't import subprocess python3.6
                            
                                Typing __exit__ in 3.5 fails on runtime, but typechecks
                            
                                How to really create n tasks in a SubDAG based on the result of a previous task
                            
                                TemplateResponseMixin requires either a definition of 'template_name' or an implementation of 'get_template_names()'
                            
                                Pandas pivot_table with pd.grouper and Margins
                            
                                Use TensorFlow python code with android app
                            
                                Sorting a dictionary with multiple sized values
                            
                                PyInstaller - How do you handle environmental variables?
                            
                                Selenium does not work with a chromedriver modified to avoid detection

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Match entities by fuzzy matching of multiple variables

Tags:

python

string

matching

pandas

Emjora

People also ask

1 Answers

nagini

Recent Activity

Donate For Us