Background I have 2 data frames which has no common key to which I can merge them. Both df have a column that contains "entity name". One df contains 8000+ entities and the other close to 2000 entities. Sample Data: <pre class="prettyprint"><code>vendor_df= Name of Vendor City State ZIP FREDDIE LEES AMERICAN GOURMET SAUCE St. Louis MO 63101 CITYARCHRIVER 2015 FOUNDATION St. Louis MO 63102 GLAXOSMITHKLINE CONSUMER HEALTHCARE St. Louis MO 63102 LACKEY SHEET METAL St. Louis MO 63102 regulator_df = Name of Entity Committies LACKEY SHEET METAL Private PRIMUS STERILIZER COMPANY LLC Private HELGET GAS PRODUCTS INC Autonomous ORTHOQUEST LLC Governmant </code></pre> Problem Stmt: I have to fuzzy match the entities of these two(<code>Name of vendor</code> & <code>Name of Entity</code>) columns and get a score. So, need to know if 1st value of dataframe 1(<code>vendor_df</code>) is matching with any of the 2000 entities of dataframe2(regulator_df). StackOverflow Links I checked: <blockquote> fuzzy match between 2 columns (Python) create new column in dataframe using fuzzywuzzy Apply fuzzy matching across a dataframe column and save results in a new column </blockquote> Code <pre class="prettyprint"><code>import pandas as pd from fuzzywuzzy import fuzz from fuzzywuzzy import process vendor_df = pd.read_excel('C:\\Users\\40101584\\Desktop\\AUS CUB AML\\Vendors_Sheet.xlsx', sheet_name=0) regulator_df = pd.read_excel('C:\\Users\\40101584\\Desktop\\AUS CUB AML\\Regulated_Vendors_Sheet.xlsx', sheet_name=0) compare = pd.MultiIndex.from_product([vendor_df['Name of vendor'], regulator_df['Name of Entity']]).to_series() def metrics(tup): return pd.Series([fuzz.ratio(*tup), fuzz.token_sort_ratio(*tup)], ['ratio', 'token']) #compare.apply(metrics) -- Either this works or the below line result = compare.apply(metrics).unstack().idxmax().unstack(0) </code></pre> Problems with Above Code: The code works if the two dataframes are small but it is taking forever when I give the complete dataset. Above code is taken from 3rd link. Any solution if the same thing can work fast or can work with large dataset? UPDATE 1 Can the above code be made faster if we pass or hard-code a score say 80 which will filter series/dataframe only with fuzzyscore > 80 ?

in my case also i need to look for only above 80. i modified your code as per my use case.hope it helps. <pre class="prettyprint"><code>compare = compare.apply(metrics) compare_80=compare[(compare['ratio'] >80) & (compare['token'] >80)] </code></pre>

Fuzzy Match columns of Different Dataframe

Background

I have 2 data frames which has no common key to which I can merge them. Both df have a column that contains "entity name". One df contains 8000+ entities and the other close to 2000 entities.

Sample Data:

vendor_df=
     Name of Vendor                             City         State  ZIP
     FREDDIE LEES AMERICAN GOURMET SAUCE       St. Louis    MO     63101
     CITYARCHRIVER 2015 FOUNDATION             St. Louis    MO     63102
     GLAXOSMITHKLINE CONSUMER HEALTHCARE       St. Louis    MO     63102
     LACKEY SHEET METAL                        St. Louis    MO     63102

regulator_df = 
     Name of Entity                    Committies
     LACKEY SHEET METAL                 Private
     PRIMUS STERILIZER COMPANY LLC      Private  
     HELGET GAS PRODUCTS INC            Autonomous
     ORTHOQUEST LLC                     Governmant

Problem Stmt:

I have to fuzzy match the entities of these two(Name of vendor & Name of Entity) columns and get a score. So, need to know if 1st value of dataframe 1(vendor_df) is matching with any of the 2000 entities of dataframe2(regulator_df).

StackOverflow Links I checked:

fuzzy match between 2 columns (Python)

create new column in dataframe using fuzzywuzzy

Apply fuzzy matching across a dataframe column and save results in a new column

Code

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

vendor_df = pd.read_excel('C:\\Users\\40101584\\Desktop\\AUS CUB AML\\Vendors_Sheet.xlsx', sheet_name=0)

regulator_df = pd.read_excel('C:\\Users\\40101584\\Desktop\\AUS CUB AML\\Regulated_Vendors_Sheet.xlsx', sheet_name=0)

compare = pd.MultiIndex.from_product([vendor_df['Name of vendor'],
                                      regulator_df['Name of Entity']]).to_series()


def metrics(tup):
    return pd.Series([fuzz.ratio(*tup),
                      fuzz.token_sort_ratio(*tup)],
                     ['ratio', 'token'])

#compare.apply(metrics) -- Either this works or the below line

result = compare.apply(metrics).unstack().idxmax().unstack(0)

Problems with Above Code:

The code works if the two dataframes are small but it is taking forever when I give the complete dataset. Above code is taken from 3rd link.

Any solution if the same thing can work fast or can work with large dataset?

UPDATE 1

Can the above code be made faster if we pass or hard-code a score say 80 which will filter series/dataframe only with fuzzyscore > 80 ?

How to use fuzzy string matching for column names in MySQL?

The given name, surname, address columns will probably have typos and inconsistencies, so we will use fuzzy string matching for them: For fuzzy string matching, we will use .string method. The parameters for column names are the same.

What is the use of fuzzy matching in SQL Server?

Fuzzy row matching helps to remove duplicates and introduces consistency to your data. With that goal in mind, let me introduce you to recordlinkage package. It provides all the tools needed for record linkage and deduplication.

What is fuzzy string matching in Excel?

Fuzzy string matching or searching is a process of approximating strings that match a particular pattern. It is a very popular add on in Excel. It gives an approximate match and there is no guarantee that the string can be exact, however, sometimes the string accurately matches the pattern.

How to match only the closest Dataframe between two DataFrames?

Now we will use the process.extractOne () method to match only the closest between the two dataframes. Inside this method we will apply different fuzzy matching functions which are as follows:

Below solution is faster than what I posted but if someone has a more faster approach please tell:

matched_vendors = []

for row in vendor_df.index:
    vendor_name = vendor_df.get_value(row,"Name of vendor")
    for columns in regulator_df.index:
        regulated_vendor_name=regulator_df.get_value(columns,"Name of Entity")
        matched_token=fuzz.partial_ratio(vendor_name,regulated_vendor_name)
        if matched_token> 80:
            matched_vendors.append([vendor_name,regulated_vendor_name,matched_token])

I've implemented the code in Python with parallel processing, which will be much faster than serial computation. Furthermore, where a fuzzy metric score exceeds a threshold, only those computations are performed in parallel. Please see the link below for the code:

https://github.com/ankitcoder123/Important-Python-Codes/blob/main/Faster%20Fuzzy%20Match%20between%20two%20columns/Fuzzy_match.py

Vesrion Compatibility:

pandas version :: 1.1.5 ,
numpy vesrion:: 1.19.5,
fuzzywuzzy version :: 1.1.0 ,
joblib version :: 0.18.0

Fuzzywuzzy metric explanation: link text

Output from code: enter image description here

in my case also i need to look for only above 80. i modified your code as per my use case.hope it helps.

compare = compare.apply(metrics)
compare_80=compare[(compare['ratio'] >80) & (compare['token'] >80)]

Fuzzy Match columns of Different Dataframe

Tags:

python

pandas

fuzzy-comparison

fuzzywuzzy

fuzzy-logic

Rahul Agarwal

People also ask

3 Answers

Rahul Agarwal

Ankit Kamboj

Sathish Kothandam

Recent Activity

Donate For Us

Fuzzy Match columns of Different Dataframe

Tags:

python

pandas

fuzzy-comparison

fuzzywuzzy

fuzzy-logic

Rahul Agarwal

People also ask

3 Answers

Rahul Agarwal

Ankit Kamboj

Sathish Kothandam

Related questions

Recent Activity

Donate For Us