Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas dataframe merge based on str.contains

I have two dataframe that I would like to merge based on if column value from df2 contains column value from df1. I've been trying to use str.contains and series.isin. But no luck so far. Example below.

df1

       Domain              Visits
         aaa                  1 
         bbb                  3
         ddd                  5

df2

       Domain                Ads
         aaa.com              2 
         bbb                  4
         c.com                3
         e.net                6

Merged Dataframe should be like this

mergeDF

       Domain              Visits       Ads
         aaa.com              1          2
         bbb                  3          4
         c.com                           3
         ddd                  5          
         e.net                           6

Thx in advance.

like image 375
SwagZ Avatar asked Sep 20 '25 05:09

SwagZ


2 Answers

Setup

We can start by creating a new series which is the row each domain in df2 corresponds to in df1, if such a row exists:

r = '({})'.format('|'.join(df1.Domain))
merge_df = df2.Domain.str.extract(r, expand=False).fillna(df2.Domain)

Our merge_df now looks like this:

0      aaa
1      bbb
2    c.com
3    e.net
Name: Domain, dtype: object

Now we merge. We want an outer merge here, merging our first DataFrame on the Domain column, and our second DataFrame on our merge_df series:

df1.merge(df2.drop('Domain', 1), left_on='Domain', right_on=merge_df, how='outer')

 Domain  Visits  Ads
0    aaa     1.0  2.0
1    bbb     3.0  4.0
2    ddd     5.0  NaN
3  c.com     NaN  3.0
4  e.net     NaN  6.0

Since we stored a separate series to merge on, and didn't modify df2, we are done after this step, and you have your merged DataFrame.

like image 169
user3483203 Avatar answered Sep 22 '25 19:09

user3483203


Try and preprocess df2, to get domain in the same format as df1:

df2['key'] = df2['Domain'].str.extract('(\w+)\.?')

df_out = df1.merge(df2, left_on='Domain', right_on='key', suffixes=('_x',''), how='outer')

df_out['Domain'] = df_out['Domain'].combine_first(df_out['Domain_x'])

df_out = df_out.drop(['Domain_x','key'], axis=1)

print(df_out)

Output:

   Visits   Domain  Ads
0     1.0  aaa.com  2.0
1     3.0      bbb  4.0
2     5.0      ddd  NaN
3     NaN    c.com  3.0
4     NaN    e.net  6.0
like image 28
Scott Boston Avatar answered Sep 22 '25 19:09

Scott Boston