Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to 'fuzzy' match strings when merge two dataframe in pandas

Tags:

python

pandas

I have two dataframe df1 and df2.

df1 = pd.DataFrame ({'Name': ['Adam Smith', 'Anne Kim', 'John Weber', 'Ian Ford'],
                     'Age': [43, 21, 55, 24]})
df2 = pd.DataFrame ({'Name': ['adam Smith', 'Annie Kim', 'John  Weber', 'Ian Ford'],
                     'gender': ['M', 'F', 'M', 'M']})

I need to join these two dataframe with pandas.merge on the column Name. However, as you notice, there are some slight difference between column Name from the two dataframe. Let's assume they are the same person. If I simply do:

pd.merge(df1, df2, how='inner', on='Name')

I only got a dataframe back with only one row, which is 'Ian Ford'.

Does anyone know how to merge these two dataframe ? I guess this is pretty common situation if we join two tables on a string column. I have absolutely no idea how to handle this. Thanks a lot in advance.

like image 913
zesla Avatar asked Mar 05 '18 22:03

zesla


1 Answers

I am using fuzzywuzzy here

from fuzzywuzzy import fuzz
from fuzzywuzzy import process



df2['key']=df2.Name.apply(lambda x : [process.extract(x, df1.Name, limit=1)][0][0][0])

df2.merge(df1,left_on='key',right_on='Name')
Out[1238]: 
        Name_x gender         key  Age      Name_y
0   adam Smith      M  Adam Smith   43  Adam Smith
1    Annie Kim      F    Anne Kim   21    Anne Kim
2  John  Weber      M  John Weber   55  John Weber
3     Ian Ford      M    Ian Ford   24    Ian Ford
like image 164
BENY Avatar answered Sep 18 '22 23:09

BENY