I have 2 dataframes of restaurant information to merge.
df1 = pd.DataFrame ({'Restaurant_Name': ['Apple', 'Banana', 'Orange', 'apple','apple1'],
'Postal Code': [12345, 12345, 54321, 54321,1111]})
df2 = pd.DataFrame ({'Restaurant_Name': ['apple', 'apple', 'Banana'],
'Postal Code': [12345, 54321, 12345],
'Phone':[100,200,300]})
d1:

df2:
Ideally I want to produce a table that looks like this:

I tried to match the restaurant names based on fuzzy matching followed by a match of postal code, but was not able to get a very accurate result. I also tried to concatenate the restaurant name with postal code for each of the dataframe and do a fuzzy matching of the concatenated result but I don't think this is the best way.
Is there any way to achieve 100% accuracy in matching the two dataframes?
Check difflib.get_close_matches().
I tried this using your sample dataframe. Does it help?
import pandas as pd
import difflib
df1 = pd.DataFrame ({'Restaurant_Name': ['Apple', 'Banana', 'Orange', 'apple','apple1'],
'Postal Code': [12345, 12345, 54321, 54321,1111]})
df2 = pd.DataFrame ({'Restaurant_Name': ['apple', 'apple', 'Banana'],
'Postal Code': [12345, 54321, 12345],
'Phone':[100,200,300]})
df1['key'] = df1['Restaurant_Name']+df1['Postal Code'].astype(str)
df2['key'] = df2['Restaurant_Name']+df2['Postal Code'].astype(str)
df2['key'] = df2['key'].apply(lambda x: difflib.get_close_matches(x, df1['key'])[0])
df1.merge(df2, on='key', how='outer')[['Restaurant_Name_x','Restaurant_Name_y','Postal Code_x','Phone']]
Output:
Restaurant_Name_x Restaurant_Name_y Postal Code_x Phone
0 Apple apple 12345 100.0
1 Banana Banana 12345 300.0
2 Orange NaN 54321 NaN
3 apple apple 54321 200.0
4 apple1 NaN 1111 NaN
As you said, I did concatenate the restaurant name with postal code to get a unique combination.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With