I have a data frame with two columns, A
and B
. The order of A
and B
is unimportant in this context; for example, I would consider (0,50)
and (50,0)
to be duplicates. In pandas, what is an efficient way to remove these duplicates from a dataframe?
import pandas as pd
# Initial data frame.
data = pd.DataFrame({'A': [0, 10, 11, 21, 22, 35, 5, 50],
'B': [50, 22, 35, 5, 10, 11, 21, 0]})
data
A B
0 0 50
1 10 22
2 11 35
3 21 5
4 22 10
5 35 11
6 5 21
7 50 0
# Desired output with "duplicates" removed.
data2 = pd.DataFrame({'A': [0, 5, 10, 11],
'B': [50, 21, 22, 35]})
data2
A B
0 0 50
1 5 21
2 10 22
3 11 35
Ideally, the output would be sorted by values of column A
.
Use DataFrame. drop_duplicates() to Drop Duplicate and Keep First Rows. You can use DataFrame. drop_duplicates() without any arguments to drop rows with the same values on all columns.
To drop duplicate columns from pandas DataFrame use df. T. drop_duplicates(). T , this removes all columns that have the same data regardless of column names.
Code 1: Find duplicate columns in a DataFrame. To find duplicate columns we need to iterate through all columns of a DataFrame and for each and every column it will search if any other column exists in DataFrame with the same contents already. If yes then that column name will be stored in the duplicate column set.
You can sort each row of the data frame before dropping the duplicates:
data.apply(lambda r: sorted(r), axis = 1).drop_duplicates()
# A B
#0 0 50
#1 10 22
#2 11 35
#3 5 21
If you prefer the result to be sorted by column A
:
data.apply(lambda r: sorted(r), axis = 1).drop_duplicates().sort_values('A')
# A B
#0 0 50
#3 5 21
#1 10 22
#2 11 35
Here is bit uglier, but faster solution:
In [44]: pd.DataFrame(np.sort(data.values, axis=1), columns=data.columns).drop_duplicates()
Out[44]:
A B
0 0 50
1 10 22
2 11 35
3 5 21
Timing: for 8K rows DF
In [50]: big = pd.concat([data] * 10**3, ignore_index=True)
In [51]: big.shape
Out[51]: (8000, 2)
In [52]: %timeit big.apply(lambda r: sorted(r), axis = 1).drop_duplicates()
1 loop, best of 3: 3.04 s per loop
In [53]: %timeit pd.DataFrame(np.sort(big.values, axis=1), columns=big.columns).drop_duplicates()
100 loops, best of 3: 3.96 ms per loop
In [59]: %timeit big.apply(np.sort, axis = 1).drop_duplicates()
1 loop, best of 3: 2.69 s per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With