One situation I sometimes encounter is, I have two dataframes (df1
, df2
) and I want to create a new dataframe (df3
) based on the intersection of multiple columns between df1
and df2
.
For example, I want to create df3
by filtering df1
by columns Campaign
and Group
.
import pandas as pd
df1 = pd.DataFrame({'Campaign':['Campaign 1', 'Campaign 2', 'Campaign 3', 'Campaign 3', 'Campaign 4'], 'Group':['Some group', 'Arbitrary Group', 'Group 1', 'Group 2', 'Done Group'], 'Metric':[245,91,292,373,32]}, columns=['Campaign', 'Group', 'Metric'])
df2 = pd.DataFrame({'Campaign':['Campaign 3', 'Campaign 3'], 'Group':['Group 1', 'Group 2'], 'Metric':[23, 456]}, columns=['Campaign', 'Group', 'Metric'])
df1
Campaign Group Metric
0 Campaign 1 Some group 245
1 Campaign 2 Arbitrary Group 91
2 Campaign 3 Group 1 292
3 Campaign 3 Group 2 373
4 Campaign 4 Done Group 32
df2
Campaign Group Metric
0 Campaign 3 Group 1 23
1 Campaign 3 Group 2 456
I know I can do this with merge...
df3 = df1.merge(df2, how='inner', on=['Campaign', 'Group'], suffixes=('','_del'))
#df3
Campaign Group Metric Metric_del
0 Campaign 3 Group 1 292 23
1 Campaign 3 Group 2 373 456
but then I have to figure out how to drop
columns that end with _del
. I guess this:
df3.select(lambda x: not re.search('_del', x), axis=1)
##The result I'm going for but required merge, then select (2-steps)
Campaign Group Metric
0 Campaign 3 Group 1 292
1 Campaign 3 Group 2 373
Questions
What I'm mainly interested in is returning df1
that's simply filtered on df2
's Campaign|Group
values.
Is there a better way to return df1
without resorting to merge
?
Is there a way to merge
but NOT return df2
's columns to the merge
and returning only df1
's columns?
Merge can be used in cases where both the left and right columns are not unique, and therefore cannot be an index. A merge is also just as efficient as a join as long as: Merging is done on indexes if possible.
As you can see, the merge is faster than joins, though it is small value, but over 4000 iterations, that small value becomes a huge number, in minutes.
Concat function concatenates dataframes along rows or columns. We can think of it as stacking up multiple dataframes. Merge combines dataframes based on values in shared columns. Merge function offers more flexibility compared to concat function because it allows combinations based on a condition.
Both join and merge can be used to combines two dataframes but the join method combines two dataframes on the basis of their indexes whereas the merge method is more versatile and allows us to specify columns beside the index to join on for both dataframes.
Assuming that your df1
and df2
have exactly the same columns. You can first set those join-key columns as index and use df1.reindex(df2.index)
and a further .dropna()
to produce the intersection.
df3 = df1.set_index(['Campaign', 'Group'])
df4 = df2.set_index(['Campaign', 'Group'])
# reindex first and dropna will produce the intersection
df3.reindex(df4.index).dropna(how='all').reset_index()
Campaign Group Metric
0 Campaign 3 Group 1 292
1 Campaign 3 Group 2 373
Use .isin
when key is not unique.
# create some duplicated keys and values
df3 = df3.append(df3)
df4 = df4.append(df4)
# isin
df3[df3.index.isin(df4.index)].reset_index()
Campaign Group Metric
0 Campaign 3 Group 1 292
1 Campaign 3 Group 2 373
2 Campaign 3 Group 1 292
3 Campaign 3 Group 2 373
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With