Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

in python merge two dataframes with the merge key of one dataframe contained in key of other dataframe

I would like to merge two dataframes df1 and df2 in order to compare two values info 1 and info 2. The key to merge them is hidden in the name columns. Df1 is 'clean' as it has a first name column and a last name column. Df2, however, is tricky. There is only a name column and the names can be given in different ways. The standard case is first and last name but as shown in the picture below it can contain two names separated by an 'and' or a '&' or it can even be something totally different like a school.

enter image description here

Here is the dummy data in code:

data1 = [['Anna','Tessmann',10], ['Ben','Fachmann',20], ['John','Smith',10]]
df1 = pd.DataFrame(data1, columns=['FirstName','LastName','Info1'])


data2 = [['Ben Fachmann',30], ['School AAA',40], ['John and Melissa Smith',50], ['Bob & Anna Tessmann',20]]
df2= pd.DataFrame(data2, columns=['Name','Info2'])

Would anyone know an efficient way to merge these two? Is there the possibility to merge on st like 'df2.Name contains df1.Lastname'? Or I was looking into trying to parse df2.Name, I found nameparser import HumanName but I think it can't deal with 'and' and '&'.

I apologize if something is unclear. Thanks a lot for any help in advance!

like image 298
Anna Avatar asked Sep 19 '25 20:09

Anna


2 Answers

You can use a double substring merge:

import re

pattern1 = '|'.join(map(re.escape, df1['FirstName']))
pattern2 = '|'.join(map(re.escape, df1['LastName']))

match1 = df2['Name'].str.extractall(f'(?P<FirstName>{pattern1})').droplevel(1)
match2 = df2['Name'].str.extractall(f'(?P<LastName>{pattern2})').droplevel(1)

out = df1.merge(df2.join(match1).join(match2),
                on=['FirstName', 'LastName'])

Output:

  FirstName  LastName  Info1                    Name  Info2
0      Anna  Tessmann     10     Bob & Anna Tessmann     20
1       Ben  Fachmann     20            Ben Fachmann     30
2      John     Smith     10  John and Melissa Smith     50
like image 185
mozway Avatar answered Sep 21 '25 11:09

mozway


I think you need to make a column that can match names. Then it will work fine.

Here is something that works. It may not always work depending on the uniqueness of the names in the data.

Also, there was a typo in your example data but I fixed it below. (tessmann was testmann)

import pandas as pd

data1 = [['Anna','Tessmann',10], ['Ben','Fachmann',20], ['John','Smith',10]]
df1 = pd.DataFrame(data1, columns=['FirstName','LastName','Info1'])


data2 = [['Ben Fachmann',30], ['School AAA',40], ['John and Melissa Smith',50], ['Bob & Anna Tessmann',20]]
df2= pd.DataFrame(data2, columns=['Name','Info2'])

# make a column to identify which indices in df1 match to df2
df2['merge_index'] = None
for _ind, _row in enumerate(df1.to_dict(orient='records')):
    df2.loc[df2.Name.str.contains(_row['FirstName']) & df2.Name.str.contains(_row['LastName']), 'merge_index'] = _ind

# merge df1 index to df2.merge_index column and select columns to keep
merged = pd.merge(left=df1, right=df2, how='left', left_index=True, right_on='merge_index')[['FirstName', 'LastName', 'Info1', 'Info2']]

Output: merged

      FirstName  LastName  Info1  Info2
3      Anna      Tessmann     10     20
0       Ben      Fachmann     20     30
2      John      Smith        10     50
like image 40
ak_slick Avatar answered Sep 21 '25 12:09

ak_slick