A simple pandas question:
Is there a drop_duplicates()
functionality to drop every row involved in the duplication?
An equivalent question is the following: Does pandas have a set difference for dataframes?
For example:
In [5]: df1 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]}) In [6]: df2 = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]}) In [7]: df1 Out[7]: col1 col2 0 1 2 1 2 3 2 3 4 In [8]: df2 Out[8]: col1 col2 0 4 6 1 2 3 2 5 5
so maybe something like df2.set_diff(df1)
will produce this:
col1 col2 0 4 6 2 5 5
However, I don't want to rely on indexes because in my case, I have to deal with dataframes that have distinct indexes.
By the way, I initially thought about an extension of the current drop_duplicates()
method, but now I realize that the second approach using properties of set theory would be far more useful in general. Both approaches solve my current problem, though.
Thanks!
Difference between rows or columns of a pandas DataFrame object is found using the diff() method. The axis parameter decides whether difference to be calculated is between rows or between columns. When the periods parameter assumes positive values, difference is found by subtracting the previous row from the next row.
Set Operations in Pandas Although pandas does not offer specific methods for performing set operations, we can easily mimic them using the below methods: Union: concat() + drop_duplicates() Intersection: merge() Difference: isin() + Boolean indexing.
diff(arr[, n[, axis]]) function is used when we calculate the n-th order discrete difference along the given axis. The first order difference is given by out[i] = arr[i+1] – arr[i] along the given axis. If we have to calculate higher differences, we are using diff recursively. Syntax: numpy.diff()
The compare method in pandas shows the differences between two DataFrames. It compares two data frames, row-wise and column-wise, and presents the differences side by side. The compare method can only compare DataFrames of the same shape, with exact dimensions and identical row and column labels.
Bit convoluted but if you want to totally ignore the index data. Convert the contents of the dataframes to sets of tuples containing the columns:
ds1 = set(map(tuple, df1.values)) ds2 = set(map(tuple, df2.values))
This step will get rid of any duplicates in the dataframes as well (index ignored)
set([(1, 2), (3, 4), (2, 3)]) # ds1
can then use set methods to find anything. Eg to find differences:
ds1.difference(ds2)
gives: set([(1, 2), (3, 4)])
can take that back to dataframe if needed. Note have to transform set to list 1st as set cannot be used to construct dataframe:
pd.DataFrame(list(ds1.difference(ds2)))
Here's another answer that keeps the index and does not require identical indexes in two data frames. (EDIT: make sure there is no duplicates in df2 beforehand)
pd.concat([df2, df1, df1]).drop_duplicates(keep=False)
It is fast and the result is
col1 col2 0 4 6 2 5 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With