I would like to merge two dataframes df1 and df2 in order to compare two values info 1 and info 2. The key to merge them is hidden in the name columns. Df1 is 'clean' as it has a first name column and a last name column. Df2, however, is tricky. There is only a name column and the names can be given in different ways. The standard case is first and last name but as shown in the picture below it can contain two names separated by an 'and' or a '&' or it can even be something totally different like a school.
Here is the dummy data in code:
data1 = [['Anna','Tessmann',10], ['Ben','Fachmann',20], ['John','Smith',10]]
df1 = pd.DataFrame(data1, columns=['FirstName','LastName','Info1'])
data2 = [['Ben Fachmann',30], ['School AAA',40], ['John and Melissa Smith',50], ['Bob & Anna Tessmann',20]]
df2= pd.DataFrame(data2, columns=['Name','Info2'])
Would anyone know an efficient way to merge these two? Is there the possibility to merge on st like 'df2.Name contains df1.Lastname'? Or I was looking into trying to parse df2.Name, I found nameparser import HumanName but I think it can't deal with 'and' and '&'.
I apologize if something is unclear. Thanks a lot for any help in advance!
You can use a double substring merge
:
import re
pattern1 = '|'.join(map(re.escape, df1['FirstName']))
pattern2 = '|'.join(map(re.escape, df1['LastName']))
match1 = df2['Name'].str.extractall(f'(?P<FirstName>{pattern1})').droplevel(1)
match2 = df2['Name'].str.extractall(f'(?P<LastName>{pattern2})').droplevel(1)
out = df1.merge(df2.join(match1).join(match2),
on=['FirstName', 'LastName'])
Output:
FirstName LastName Info1 Name Info2
0 Anna Tessmann 10 Bob & Anna Tessmann 20
1 Ben Fachmann 20 Ben Fachmann 30
2 John Smith 10 John and Melissa Smith 50
I think you need to make a column that can match names. Then it will work fine.
Here is something that works. It may not always work depending on the uniqueness of the names in the data.
Also, there was a typo in your example data but I fixed it below. (tessmann was testmann)
import pandas as pd
data1 = [['Anna','Tessmann',10], ['Ben','Fachmann',20], ['John','Smith',10]]
df1 = pd.DataFrame(data1, columns=['FirstName','LastName','Info1'])
data2 = [['Ben Fachmann',30], ['School AAA',40], ['John and Melissa Smith',50], ['Bob & Anna Tessmann',20]]
df2= pd.DataFrame(data2, columns=['Name','Info2'])
# make a column to identify which indices in df1 match to df2
df2['merge_index'] = None
for _ind, _row in enumerate(df1.to_dict(orient='records')):
df2.loc[df2.Name.str.contains(_row['FirstName']) & df2.Name.str.contains(_row['LastName']), 'merge_index'] = _ind
# merge df1 index to df2.merge_index column and select columns to keep
merged = pd.merge(left=df1, right=df2, how='left', left_index=True, right_on='merge_index')[['FirstName', 'LastName', 'Info1', 'Info2']]
Output: merged
FirstName LastName Info1 Info2
3 Anna Tessmann 10 20
0 Ben Fachmann 20 30
2 John Smith 10 50
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With