Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

set difference for pandas

A simple pandas question:

Is there a drop_duplicates() functionality to drop every row involved in the duplication?

An equivalent question is the following: Does pandas have a set difference for dataframes?

For example:

In [5]: df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})  In [6]: df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})  In [7]: df1 Out[7]:     col1  col2 0     1     2 1     2     3 2     3     4  In [8]: df2 Out[8]:     col1  col2 0     4     6 1     2     3 2     5     5 

so maybe something like df2.set_diff(df1) will produce this:

   col1  col2 0     4     6 2     5     5 

However, I don't want to rely on indexes because in my case, I have to deal with dataframes that have distinct indexes.

By the way, I initially thought about an extension of the current drop_duplicates() method, but now I realize that the second approach using properties of set theory would be far more useful in general. Both approaches solve my current problem, though.

Thanks!

like image 856
Robert Smith Avatar asked Aug 12 '13 06:08

Robert Smith


People also ask

How is the difference in pandas calculated?

Difference between rows or columns of a pandas DataFrame object is found using the diff() method. The axis parameter decides whether difference to be calculated is between rows or between columns. When the periods parameter assumes positive values, difference is found by subtracting the previous row from the next row.

How do you use sets in pandas?

Set Operations in Pandas Although pandas does not offer specific methods for performing set operations, we can easily mimic them using the below methods: Union: concat() + drop_duplicates() Intersection: merge() Difference: isin() + Boolean indexing.

What is diff () in Python?

diff(arr[, n[, axis]]) function is used when we calculate the n-th order discrete difference along the given axis. The first order difference is given by out[i] = arr[i+1] – arr[i] along the given axis. If we have to calculate higher differences, we are using diff recursively. Syntax: numpy.diff()

How do I compare two pandas DataFrames?

The compare method in pandas shows the differences between two DataFrames. It compares two data frames, row-wise and column-wise, and presents the differences side by side. The compare method can only compare DataFrames of the same shape, with exact dimensions and identical row and column labels.


2 Answers

Bit convoluted but if you want to totally ignore the index data. Convert the contents of the dataframes to sets of tuples containing the columns:

ds1 = set(map(tuple, df1.values)) ds2 = set(map(tuple, df2.values)) 

This step will get rid of any duplicates in the dataframes as well (index ignored)

set([(1, 2), (3, 4), (2, 3)])   # ds1 

can then use set methods to find anything. Eg to find differences:

ds1.difference(ds2) 

gives: set([(1, 2), (3, 4)])

can take that back to dataframe if needed. Note have to transform set to list 1st as set cannot be used to construct dataframe:

pd.DataFrame(list(ds1.difference(ds2))) 
like image 98
Joop Avatar answered Oct 05 '22 00:10

Joop


Here's another answer that keeps the index and does not require identical indexes in two data frames. (EDIT: make sure there is no duplicates in df2 beforehand)

pd.concat([df2, df1, df1]).drop_duplicates(keep=False) 

It is fast and the result is

   col1  col2 0     4     6 2     5     5 
like image 27
radream Avatar answered Oct 04 '22 22:10

radream