I have two example dataframes as follows:
df1 = pd.DataFrame({'Name': {0: 'John', 1: 'Bob', 2: 'Shiela'},
'Degree': {0: 'Masters', 1: 'Graduate', 2: 'Graduate'},
'Age': {0: 27, 1: 23, 2: 21}})
df2 = pd.DataFrame({'Name': {0: 'John S.', 1: 'Bob K.', 2: 'Frank'},
'Degree': {0: 'Master', 1: 'Graduated', 2: 'Graduated'},
'GPA': {0: 3, 1: 3.5, 2: 4}})
I want to merge them together based on two columns Name
and Degree
with fuzzy matching method to drive out possible duplicates. This is what I have realized with the help from reference here:
Apply fuzzy matching across a dataframe column and save results in a new column
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
compare = pd.MultiIndex.from_product([df1['Name'],
df2['Name']]).to_series()
def metrics(tup):
return pd.Series([fuzz.ratio(*tup),
fuzz.token_sort_ratio(*tup)],
['ratio', 'token'])
compare.apply(metrics)
compare.apply(metrics).unstack().idxmax().unstack(0)
compare.apply(metrics).unstack(0).idxmax().unstack(0)
Let's say fuzz.ratio of one's Name
and Degree
both are higher than 80 we consider they are same person. And taken Name
and Degree
from df1 as default. How can I get a following expected result? Thanks.
df = df1.merge(df2, on = ['Name', 'Degree'], how = 'outer')
Name Degree Age GPA duplicatedName duplicatedDegree
0 John Masters 27.0 3.0 John S. Master
1 Bob Graduate 23.0 3.5 Bob K. Graduated
2 Shiela Graduate 21.0 NaN NaN Graduated
3 Frank Graduated NaN 4.0 NaN Graduate
I think ratio should be lower, for me working 60
. Create Series
with list comprehension
, filter by N
and get maximal value. Last map
with fillna
and last merge
:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from itertools import product
N = 60
names = {tup: fuzz.ratio(*tup) for tup in
product(df1['Name'].tolist(), df2['Name'].tolist())}
s1 = pd.Series(names)
s1 = s1[s1 > N]
s1 = s1[s1.groupby(level=0).idxmax()]
print (s1)
John S. John
Bob K. Bob
dtype: object
degrees = {tup: fuzz.ratio(*tup) for tup in
product(df1['Degree'].tolist(), df2['Degree'].tolist())}
s2 = pd.Series(degrees)
s2 = s2[s2 > N]
s2 = s2[s2.groupby(level=0).idxmax()]
print (s2)
Graduated Graduate
Master Masters
dtype: object
df2['Name'] = df2['Name'].map(s1).fillna(df2['Name'])
df2['Degree'] = df2['Degree'].map(s2).fillna(df2['Degree'])
#generally slowier alternative
#df2['Name'] = df2['Name'].replace(s1)
#df2['Degree'] = df2['Degree'].replace(s2)
df = df1.merge(df2, on = ['Name', 'Degree'], how = 'outer')
print (df)
Name Degree Age GPA
0 John Masters 27.0 3.0
1 Bob Graduate 23.0 3.5
2 Shiela Graduate 21.0 NaN
3 Frank Graduate NaN 4.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With