A simple pandas question: Is there a <code>drop_duplicates()</code> functionality to drop every row involved in the duplication? An equivalent question is the following: Does pandas have a set difference for dataframes? For example: <pre class="prettyprint"><code>In [5]: df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]}) In [6]: df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]}) In [7]: df1 Out[7]: col1 col2 0 1 2 1 2 3 2 3 4 In [8]: df2 Out[8]: col1 col2 0 4 6 1 2 3 2 5 5 </code></pre> so maybe something like <code>df2.set_diff(df1)</code> will produce this: <pre class="prettyprint"><code> col1 col2 0 4 6 2 5 5 </code></pre> However, I don't want to rely on indexes because in my case, I have to deal with dataframes that have distinct indexes. By the way, I initially thought about an extension of the current <code>drop_duplicates()</code> method, but now I realize that the second approach using properties of set theory would be far more useful in general. Both approaches solve my current problem, though. Thanks!

Bit convoluted but if you want to totally ignore the index data. Convert the contents of the dataframes to sets of tuples containing the columns: <pre class="prettyprint"><code>ds1 = set(map(tuple, df1.values)) ds2 = set(map(tuple, df2.values)) </code></pre> This step will get rid of any duplicates in the dataframes as well (index ignored) <pre class="prettyprint"><code>set([(1, 2), (3, 4), (2, 3)]) # ds1 </code></pre> can then use set methods to find anything. Eg to find differences: <pre class="prettyprint"><code>ds1.difference(ds2) </code></pre> gives: set([(1, 2), (3, 4)]) can take that back to dataframe if needed. Note have to transform set to list 1st as set cannot be used to construct dataframe: <pre class="prettyprint"><code>pd.DataFrame(list(ds1.difference(ds2))) </code></pre>

Here's another answer that keeps the index and does not require identical indexes in two data frames. (EDIT: make sure there is no duplicates in df2 beforehand) <pre class="prettyprint"><code>pd.concat([df2, df1, df1]).drop_duplicates(keep=False) </code></pre> It is fast and the result is <pre class="prettyprint"><code> col1 col2 0 4 6 2 5 5 </code></pre>

set difference for pandas

A simple pandas question:

Is there a drop_duplicates() functionality to drop every row involved in the duplication?

An equivalent question is the following: Does pandas have a set difference for dataframes?

For example:

In [5]: df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})  In [6]: df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})  In [7]: df1 Out[7]:     col1  col2 0     1     2 1     2     3 2     3     4  In [8]: df2 Out[8]:     col1  col2 0     4     6 1     2     3 2     5     5

so maybe something like df2.set_diff(df1) will produce this:

   col1  col2 0     4     6 2     5     5

However, I don't want to rely on indexes because in my case, I have to deal with dataframes that have distinct indexes.

By the way, I initially thought about an extension of the current drop_duplicates() method, but now I realize that the second approach using properties of set theory would be far more useful in general. Both approaches solve my current problem, though.

Thanks!

How is the difference in pandas calculated?

Difference between rows or columns of a pandas DataFrame object is found using the diff() method. The axis parameter decides whether difference to be calculated is between rows or between columns. When the periods parameter assumes positive values, difference is found by subtracting the previous row from the next row.

How do you use sets in pandas?

Set Operations in Pandas Although pandas does not offer specific methods for performing set operations, we can easily mimic them using the below methods: Union: concat() + drop_duplicates() Intersection: merge() Difference: isin() + Boolean indexing.

What is diff () in Python?

diff(arr[, n[, axis]]) function is used when we calculate the n-th order discrete difference along the given axis. The first order difference is given by out[i] = arr[i+1] – arr[i] along the given axis. If we have to calculate higher differences, we are using diff recursively. Syntax: numpy.diff()

How do I compare two pandas DataFrames?

The compare method in pandas shows the differences between two DataFrames. It compares two data frames, row-wise and column-wise, and presents the differences side by side. The compare method can only compare DataFrames of the same shape, with exact dimensions and identical row and column labels.

Bit convoluted but if you want to totally ignore the index data. Convert the contents of the dataframes to sets of tuples containing the columns:

ds1 = set(map(tuple, df1.values)) ds2 = set(map(tuple, df2.values))

This step will get rid of any duplicates in the dataframes as well (index ignored)

set([(1, 2), (3, 4), (2, 3)])   # ds1

can then use set methods to find anything. Eg to find differences:

ds1.difference(ds2)

gives: set([(1, 2), (3, 4)])

can take that back to dataframe if needed. Note have to transform set to list 1st as set cannot be used to construct dataframe:

pd.DataFrame(list(ds1.difference(ds2)))

Here's another answer that keeps the index and does not require identical indexes in two data frames. (EDIT: make sure there is no duplicates in df2 beforehand)

pd.concat([df2, df1, df1]).drop_duplicates(keep=False)

It is fast and the result is

   col1  col2 0     4     6 2     5     5

set difference for pandas

Tags:

python

pandas

dataframe

Robert Smith

People also ask

2 Answers

Joop

radream

Recent Activity

Donate For Us

set difference for pandas

Tags:

python

pandas

dataframe

Robert Smith

People also ask

2 Answers

Joop

radream

Related questions

Recent Activity

Donate For Us