Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 2.7 with Pandas: How does one recover the non intersecting parts of two dataframes?

I have two data frames and the second is a subset of the first. How do I now find the portion of the first dataframe that is not contained in the second one? For example:

new_dataframe_1

    A   B   C   D
1   a   b   c   d
2   e   f   g   h
3   i   j   k   l
4   m   n   o   p


new_dataframe_2

    A   B   C   D
1   a   b   c   d
3   i   j   k   l


new_dataframe_3 = not intersection of new_dataframe_1 and new_dataframe_2


    A   B   C   D
2   e   f   g   h
4   m   n   o   p

Thanks for your help!

Edit: I initially was calling the intersection the union, but have since changed this.

like image 826
user3654387 Avatar asked May 25 '14 02:05

user3654387


People also ask

How do you find the intersection of two DataFrames in Python?

Intersection of two dataframes in pandas can be achieved in roundabout way using merge() function.

How do you remove common elements in two DataFrames pandas?

You can use pandas. concat to concatenate the two dataframes rowwise, followed by drop_duplicates to remove all the duplicated rows in them.

How do I merge two DataFrames in pandas without common columns?

MergeError: No common columns to perform merge on. to overcome the merge error, we can use pandas argument 'left_on' and 'right_on' to explicitly indicate pandas on what key columns we want to merge data frames, rest everything remains similar. 2. join() is used for combining data on a key column or an index.


1 Answers

Well, one way to do this is using isin (but you can also do it with the merge command ... I show examples for both). For example:

>>> df1

   A  B  C  D
0  a  b  c  d
1  e  f  g  h
2  i  j  k  l
3  m  n  o  p

>>> df2

   A  B  C  D
0  a  b  c  d
1  i  j  k  l

>>> df1[~df1.isin(df2.to_dict('list')).all(axis=1)]

   A  B  C  D
1  e  f  g  h
3  m  n  o  p

Explanation. isin can check using multiple columns if you feed it a dict:

>>> df2.to_dict('list')

{'A': ['a', 'i'], 'C': ['c', 'k'], 'B': ['b', 'j'], 'D': ['d', 'l']}

And then isin will create a booleen df which I can use to select the columns we want (in this case require all the columns to match and then negate with ~):

>>> df1.isin(df2.to_dict('list'))

      A      B      C      D
0   True   True   True   True
1  False  False  False  False
2   True   True   True   True
3  False  False  False  False

In the specific example we don't need to feed isin a dict version of the dataframe because we can identify the valid rows by only looking at column A:

>>> df1[~df1['A'].isin(df2['A'])]

   A  B  C  D
1  e  f  g  h
3  m  n  o  p

You can also do this with merge. Create a unique column in the subset dataframe. When you merge, the unique rows from the larger dataframe will have NaN for the column you created:

>>> df2['test'] = 1
>>> new = df1.merge(df2,on=['A','B','C','D'],how='left')
>>> new

   A  B  C  D  test
0  a  b  c  d     1
1  e  f  g  h   NaN
2  i  j  k  l     1
3  m  n  o  p   NaN

So select the rows where test == NaN and drop the test column:

>>> new[new.test.isnull()].drop('test',axis=1)

   A  B  C  D
1  e  f  g  h
3  m  n  o  p

Edit: @user3654387 notes that the merge method performs much better for large dataframes.

like image 76
Karl D. Avatar answered Nov 03 '22 06:11

Karl D.