I have two dataframe df1 and df2.
df1 = pd.DataFrame ({'Name': ['Adam Smith', 'Anne Kim', 'John Weber', 'Ian Ford'],
'Age': [43, 21, 55, 24]})
df2 = pd.DataFrame ({'Name': ['adam Smith', 'Annie Kim', 'John Weber', 'Ian Ford'],
'gender': ['M', 'F', 'M', 'M']})
I need to join these two dataframe with pandas.merge on the column Name. However, as you notice, there are some slight difference between column Name from the two dataframe. Let's assume they are the same person. If I simply do:
pd.merge(df1, df2, how='inner', on='Name')
I only got a dataframe back with only one row, which is 'Ian Ford'.
Does anyone know how to merge these two dataframe ? I guess this is pretty common situation if we join two tables on a string column. I have absolutely no idea how to handle this. Thanks a lot in advance.
I am using fuzzywuzzy here
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df2['key']=df2.Name.apply(lambda x : [process.extract(x, df1.Name, limit=1)][0][0][0])
df2.merge(df1,left_on='key',right_on='Name')
Out[1238]:
Name_x gender key Age Name_y
0 adam Smith M Adam Smith 43 Adam Smith
1 Annie Kim F Anne Kim 21 Anne Kim
2 John Weber M John Weber 55 John Weber
3 Ian Ford M Ian Ford 24 Ian Ford
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With