Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I do a SQL style disjoint or set difference on two Pandas DataFrame objects?

Tags:

python

pandas

I'm trying to use Pandas to solve an issue courtesy of an idiot DBA not doing a backup of a now crashed data set, so I'm trying to find differences between two columns. For reasons I won't get into, I'm using Pandas rather than a database.

What I'd like to do is, given:

Dataset A = [A, B, C, D, E]  
Dataset B = [C, D, E, F]

I would like to find values which are disjoint.

Dataset A!=B = [A, B, F]

In SQL, this is standard set logic, accomplished differently depending on the dialect, but a standard function. How do I elegantly apply this in Pandas? I would love to input some code, but nothing I have is even remotely correct. It's a situation in which I don't know what I don't know..... Pandas has set logic for intersection and union, but nothing for disjoint/set difference.

Thanks!

like image 648
JPKab Avatar asked Jan 18 '13 19:01

JPKab


People also ask

How do you compare two DataFrame for differences?

Overview. The compare method in pandas shows the differences between two DataFrames. It compares two data frames, row-wise and column-wise, and presents the differences side by side. The compare method can only compare DataFrames of the same shape, with exact dimensions and identical row and column labels.

How do you compare the elements of two pandas?

It is possible to compare two pandas Series with help of Relational operators, we can easily compare the corresponding elements of two series at a time. The result will be displayed in form of True or False. And we can also use a function like Pandas Series. equals() to compare two pandas series.

How do I get the difference between two columns in pandas?

Difference between rows or columns of a pandas DataFrame object is found using the diff() method. The axis parameter decides whether difference to be calculated is between rows or between columns.

How do you tell the difference between two series in pandas?

Pandas Series: diff() function The diff() function is used to first discrete difference of element. Calculates the difference of a Series element compared with another element in the Series (default is element in previous row). Periods to shift for calculating difference, accepts negative values.


1 Answers

You can use the set.symmetric_difference function:

In [1]: df1 = DataFrame(list('ABCDE'), columns=['x'])

In [2]: df1
Out[2]:
   x
0  A
1  B
2  C
3  D
4  E

In [3]: df2 = DataFrame(list('CDEF'), columns=['y'])

In [4]: df2
Out[4]:
   y
0  C
1  D
2  E
3  F

In [5]: set(df1.x).symmetric_difference(df2.y)
Out[5]: set(['A', 'B', 'F'])
like image 75
Zelazny7 Avatar answered Oct 18 '22 06:10

Zelazny7