Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching one Python dataframe / dictionary for fuzzy matches in another dataframe

I have the following pandas dataframe with 50,000 unique rows and 20 columns (included is a snippet of the relevant columns):

df1:

            PRODUCT_ID           PRODUCT_DESCRIPTION
0           165985858958         "Fish Burger with Lettuce"
1           185965653252         "Chicken Salad with Dressing"
2           165958565556         "Pork and Honey Rissoles"
3           655262522233         "Cheese, Ham and Tomato Sandwich"
4           857485966653         "Coleslaw with Yoghurt Dressing"
5           524156285551         "Lemon and Raspberry Cheesecake"

I also have the following dataframe (which I also have saved in dictionary form) which has 2 columns and 20,000 unique rows:

df2 (also saved as dict_2)

       PROD_ID   PROD_DESCRIPTION
0      548576    "Fish Burger"
1      156956    "Chckn Salad w/Ranch Dressing"
2      257848    "Rissoles - Lamb & Rosemary"
3      298770    "Lemn C-cake"
4      651452    "Potato Salad with Bacon"
5      100256    "Cheese Cake - Lemon Raspberry Coulis"

What I am wanting to do is compare the "PRODUCT_DESCRIPTION" field in df1 to the the "PROD_DESCRIPTION" field in df2 and find the closest match/matches to help with the heavy lifting part. I would then need to manually check the matches but it would be a lot quicker The ideal outcome would look like this, e.g. with one or more part matches noted:

     PRODUCT_ID      PRODUCT_DESCRIPTION               PROD_ID   PROD_DESCRIPTION
0    165985858958    "Fish Burger with Lettuce"        548576    "Fish Burger"
1    185965653252    "Chicken Salad with Dressing"     156956    "Chckn Salad w/Ranch Dressing"
2    165958565556    "Pork and Honey Rissoles"         257848    "Rissoles - Lamb & Rosemary"     
3    655262522233    "Cheese, Ham and Tomato Sandwich" NaN       NaN
4    857485966653    "Coleslaw with Yoghurt Dressing"  NaN       NaN
5    524156285551    "Lemon and Raspberry Cheesecake"  298770    "Lemn C-cake"
6    524156285551    "Lemon and Raspberry Cheesecake"  100256    "Cheese Cake - Lemon Raspberry Coulis"

I have already completed a join which has identified the exact matches. It's not important that the index is retained as the Product ID's in each df are unique. The results can also be saved into a new dataframe as this will then be applied to a third dataframe that has around 14 million rows.

I've used the following questions and answers (amongst others):

Is it possible to do fuzzy match merge with python pandas
Fuzzy merge match with duplicates including trying jellyfish module as suggested in one of the answers
Python fuzzy matching fuzzywuzzy keep only the best match
Fuzzy match items in a column of an array

and also various loops/functions/mapping etc. but have had no success, either getting the first "fuzzy match" which has a low score or no matches being detected.

I like the idea of a matching/distance score column being generated as per here as it would then allow me to speed up the manual checking process.

I'm using Python 2.7, pandas and have fuzzywuzzy installed.

like image 822
gincard Avatar asked Jan 04 '17 02:01

gincard


People also ask

Is it possible to do fuzzy match merge with Python pandas?

Often you may want to join together two datasets in pandas based on imperfectly matching strings. This is called fuzzy matching. The easiest way to perform fuzzy matching in pandas is to use the get_close_matches() function from the difflib package.

How do I find a DataFrame from a dictionary?

We can convert a dictionary to a pandas dataframe by using the pd. DataFrame. from_dict() class-method.

What is fuzzy logic matching?

Fuzzy matching (FM), also known as fuzzy logic, approximate string matching, fuzzy name matching, or fuzzy string matching is an artificial intelligence and machine learning technology that identifies similar, but not identical elements in data table sets.


2 Answers

using fuzz.ratio as my distance metric, calculate my distance matrix like this

df3 = pd.DataFrame(index=df.index, columns=df2.index)

for i in df3.index:
    for j in df3.columns:
        vi = df.get_value(i, 'PRODUCT_DESCRIPTION')
        vj = df2.get_value(j, 'PROD_DESCRIPTION')
        df3.set_value(
            i, j, fuzz.ratio(vi, vj))

print(df3)

    0   1   2   3   4   5
0  63  15  24  23  34  27
1  26  84  19  21  52  32
2  18  31  33  12  35  34
3  10  31  35  10  41  42
4  29  52  32  10  42  12
5  15  28  21  49   8  55

Set a threshold for acceptable distance. I set 50
Find the index value (for df2) that has maximum value for every row.

threshold = df3.max(1) > 50
idxmax = df3.idxmax(1)

Make assignments

df['PROD_ID'] = np.where(threshold, df2.loc[idxmax, 'PROD_ID'].values, np.nan)
df['PROD_DESCRIPTION'] = np.where(threshold, df2.loc[idxmax, 'PROD_DESCRIPTION'].values, np.nan)
df

enter image description here

like image 164
piRSquared Avatar answered Oct 13 '22 11:10

piRSquared


You should be able to iterate over both dataframes and populate either a dict of a 3rd dataframe with your desired information:

d = {
    'df1_id': [],
    'df1_prod_desc': [],
    'df2_id': [],
    'df2_prod_desc': [],
    'fuzzywuzzy_sim': []
}
for _, df1_row in df1.iterrows():
    for _, df2_row in df2.iterrows():
        d['df1_id'] = df1_row['PRODUCT_ID']
        ...
df3 = pd.DataFrame.from_dict(d)
like image 32
Rick Avatar answered Oct 13 '22 11:10

Rick