Python 2.7 with Pandas: How does one recover the non intersecting parts of two dataframes?

Tags:

I have two data frames and the second is a subset of the first. How do I now find the portion of the first dataframe that is not contained in the second one? For example:

new_dataframe_1

    A   B   C   D
1   a   b   c   d
2   e   f   g   h
3   i   j   k   l
4   m   n   o   p


new_dataframe_2

    A   B   C   D
1   a   b   c   d
3   i   j   k   l


new_dataframe_3 = not intersection of new_dataframe_1 and new_dataframe_2


    A   B   C   D
2   e   f   g   h
4   m   n   o   p

Thanks for your help!

Edit: I initially was calling the intersection the union, but have since changed this.

826

asked May 25 '14 02:05

user3654387

1 Answers

Well, one way to do this is using isin (but you can also do it with the merge command ... I show examples for both). For example:

>>> df1

   A  B  C  D
0  a  b  c  d
1  e  f  g  h
2  i  j  k  l
3  m  n  o  p

>>> df2

   A  B  C  D
0  a  b  c  d
1  i  j  k  l

>>> df1[~df1.isin(df2.to_dict('list')).all(axis=1)]

   A  B  C  D
1  e  f  g  h
3  m  n  o  p

Explanation. isin can check using multiple columns if you feed it a dict:

>>> df2.to_dict('list')

{'A': ['a', 'i'], 'C': ['c', 'k'], 'B': ['b', 'j'], 'D': ['d', 'l']}

And then isin will create a booleen df which I can use to select the columns we want (in this case require all the columns to match and then negate with ~):

>>> df1.isin(df2.to_dict('list'))

      A      B      C      D
0   True   True   True   True
1  False  False  False  False
2   True   True   True   True
3  False  False  False  False

In the specific example we don't need to feed isin a dict version of the dataframe because we can identify the valid rows by only looking at column A:

>>> df1[~df1['A'].isin(df2['A'])]

   A  B  C  D
1  e  f  g  h
3  m  n  o  p

You can also do this with merge. Create a unique column in the subset dataframe. When you merge, the unique rows from the larger dataframe will have NaN for the column you created:

>>> df2['test'] = 1
>>> new = df1.merge(df2,on=['A','B','C','D'],how='left')
>>> new

   A  B  C  D  test
0  a  b  c  d     1
1  e  f  g  h   NaN
2  i  j  k  l     1
3  m  n  o  p   NaN

So select the rows where test == NaN and drop the test column:

>>> new[new.test.isnull()].drop('test',axis=1)

   A  B  C  D
1  e  f  g  h
3  m  n  o  p

Edit: @user3654387 notes that the merge method performs much better for large dataframes.

answered Nov 03 '22 06:11

Karl D.

Related questions
                            
                                How do I install Buildozer on Ubuntu to create an Android APK of a Kivy App?
                            
                                Python: open existing Excel file and count rows in sheet
                            
                                Creating a list of functions in python (python function closure bug?)
                            
                                Scikit Learn RandomForest Memory Error
                            
                                how to install custom packages on amazon EMR bootstrap action in code?
                            
                                Display non ascii (Japanese) characters in pandas plot legend
                            
                                Pyinstaller & Pycrypto
                            
                                asyncio event_loop declared outside of a class with asyncio.coroutine methods fails with "AttributeError: 'NoneType' object has no attribute 'select'"
                            
                                Parsing human-readable recurring dates in Python
                            
                                numpy - how to add a value to every element in the first column of an array?
                            
                                scrapy Import Error: scrapy.core.downloader.handlers.s3.S3DownloadHandler
                            
                                Stripe payments on Google AppEngine using Python API library
                            
                                iPython notebook avoid printing within a function
                            
                                regular expression - string replacement "as is"
                            
                                Get a sorted list of folders based on modification date
                            
                                tkinter ttk widgets ignoring background color?
                            
                                What is a " AttributeError: '_io.TextIOWrapper' object has no attribute 'replace' " in python?
                            
                                If I get a 500 internal server error in Scrapy, how do I skip the URL?
                            
                                Simulate interactive python session
                            
                                Avoid copying when indexing a numpy arrays using lists

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python 2.7 with Pandas: How does one recover the non intersecting parts of two dataframes?

Tags:

python

pandas

python-2.7

user3654387

People also ask

1 Answers

Karl D.

Recent Activity

Donate For Us