Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas: Merge or Filter DataFrame by Another. Is there a Better Way?

One situation I sometimes encounter is, I have two dataframes (df1, df2) and I want to create a new dataframe (df3) based on the intersection of multiple columns between df1 and df2.

For example, I want to create df3 by filtering df1 by columns Campaign and Group.

import pandas as pd
df1 = pd.DataFrame({'Campaign':['Campaign 1', 'Campaign 2', 'Campaign 3', 'Campaign 3', 'Campaign 4'], 'Group':['Some group', 'Arbitrary Group', 'Group 1', 'Group 2', 'Done Group'], 'Metric':[245,91,292,373,32]}, columns=['Campaign', 'Group', 'Metric'])
df2 = pd.DataFrame({'Campaign':['Campaign 3', 'Campaign 3'], 'Group':['Group 1', 'Group 2'], 'Metric':[23, 456]}, columns=['Campaign', 'Group', 'Metric'])

df1

     Campaign            Group  Metric
0  Campaign 1       Some group     245
1  Campaign 2  Arbitrary Group      91
2  Campaign 3          Group 1     292
3  Campaign 3          Group 2     373
4  Campaign 4       Done Group      32

df2

     Campaign    Group  Metric
0  Campaign 3  Group 1      23
1  Campaign 3  Group 2     456

I know I can do this with merge...

df3 = df1.merge(df2, how='inner', on=['Campaign', 'Group'], suffixes=('','_del'))
#df3
     Campaign    Group  Metric  Metric_del
0  Campaign 3  Group 1     292          23
1  Campaign 3  Group 2     373         456

but then I have to figure out how to drop columns that end with _del. I guess this:

df3.select(lambda x: not re.search('_del', x), axis=1)
##The result I'm going for but required merge, then select (2-steps)
     Campaign    Group  Metric
0  Campaign 3  Group 1     292
1  Campaign 3  Group 2     373

Questions

What I'm mainly interested in is returning df1 that's simply filtered on df2's Campaign|Group values.

  1. Is there a better way to return df1 without resorting to merge?

  2. Is there a way to merge but NOT return df2's columns to the merge and returning only df1's columns?

like image 298
Jarad Avatar asked Aug 10 '15 17:08

Jarad


People also ask

Is pandas merge efficient?

Merge can be used in cases where both the left and right columns are not unique, and therefore cannot be an index. A merge is also just as efficient as a join as long as: Merging is done on indexes if possible.

Is join or merge faster pandas?

As you can see, the merge is faster than joins, though it is small value, but over 4000 iterations, that small value becomes a huge number, in minutes.

What is the difference between merging and concatenation in pandas?

Concat function concatenates dataframes along rows or columns. We can think of it as stacking up multiple dataframes. Merge combines dataframes based on values in shared columns. Merge function offers more flexibility compared to concat function because it allows combinations based on a condition.

When to Use join or merge pandas?

Both join and merge can be used to combines two dataframes but the join method combines two dataframes on the basis of their indexes whereas the merge method is more versatile and allows us to specify columns beside the index to join on for both dataframes.


1 Answers

Assuming that your df1 and df2 have exactly the same columns. You can first set those join-key columns as index and use df1.reindex(df2.index) and a further .dropna() to produce the intersection.

df3 = df1.set_index(['Campaign', 'Group'])
df4 = df2.set_index(['Campaign', 'Group'])
# reindex first and dropna will produce the intersection
df3.reindex(df4.index).dropna(how='all').reset_index()

     Campaign    Group  Metric
0  Campaign 3  Group 1     292
1  Campaign 3  Group 2     373

Edit:

Use .isin when key is not unique.

# create some duplicated keys and values
df3 = df3.append(df3)
df4 = df4.append(df4)

# isin
df3[df3.index.isin(df4.index)].reset_index()

     Campaign    Group  Metric
0  Campaign 3  Group 1     292
1  Campaign 3  Group 2     373
2  Campaign 3  Group 1     292
3  Campaign 3  Group 2     373
like image 196
Jianxun Li Avatar answered Sep 23 '22 00:09

Jianxun Li