I have two data frames and the second is a subset of the first. How do I now find the portion of the first dataframe that is not contained in the second one? For example:
new_dataframe_1
A B C D
1 a b c d
2 e f g h
3 i j k l
4 m n o p
new_dataframe_2
A B C D
1 a b c d
3 i j k l
new_dataframe_3 = not intersection of new_dataframe_1 and new_dataframe_2
A B C D
2 e f g h
4 m n o p
Thanks for your help!
Edit: I initially was calling the intersection the union, but have since changed this.
Intersection of two dataframes in pandas can be achieved in roundabout way using merge() function.
You can use pandas. concat to concatenate the two dataframes rowwise, followed by drop_duplicates to remove all the duplicated rows in them.
MergeError: No common columns to perform merge on. to overcome the merge error, we can use pandas argument 'left_on' and 'right_on' to explicitly indicate pandas on what key columns we want to merge data frames, rest everything remains similar. 2. join() is used for combining data on a key column or an index.
Well, one way to do this is using isin
(but you can also do it with the merge
command ... I show examples for both). For example:
>>> df1
A B C D
0 a b c d
1 e f g h
2 i j k l
3 m n o p
>>> df2
A B C D
0 a b c d
1 i j k l
>>> df1[~df1.isin(df2.to_dict('list')).all(axis=1)]
A B C D
1 e f g h
3 m n o p
Explanation. isin
can check using multiple columns if you feed it a dict:
>>> df2.to_dict('list')
{'A': ['a', 'i'], 'C': ['c', 'k'], 'B': ['b', 'j'], 'D': ['d', 'l']}
And then isin
will create a booleen df which I can use to select the columns we want (in this case require all the columns to match and then negate with ~
):
>>> df1.isin(df2.to_dict('list'))
A B C D
0 True True True True
1 False False False False
2 True True True True
3 False False False False
In the specific example we don't need to feed isin
a dict version of the dataframe because we can identify the valid rows by only looking at column A:
>>> df1[~df1['A'].isin(df2['A'])]
A B C D
1 e f g h
3 m n o p
You can also do this with merge
. Create a unique column in the subset dataframe. When you merge, the unique rows from the larger dataframe will have NaN
for the column you created:
>>> df2['test'] = 1
>>> new = df1.merge(df2,on=['A','B','C','D'],how='left')
>>> new
A B C D test
0 a b c d 1
1 e f g h NaN
2 i j k l 1
3 m n o p NaN
So select the rows where test == NaN and drop the test column:
>>> new[new.test.isnull()].drop('test',axis=1)
A B C D
1 e f g h
3 m n o p
Edit: @user3654387 notes that the merge method performs much better for large dataframes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With