I have a data frame with two columns, <code>A</code> and <code>B</code>. The order of <code>A</code> and <code>B</code> is unimportant in this context; for example, I would consider <code>(0,50)</code> and <code>(50,0)</code> to be duplicates. In pandas, what is an efficient way to remove these duplicates from a dataframe? <pre class="prettyprint"><code>import pandas as pd # Initial data frame. data = pd.DataFrame({'A': [0, 10, 11, 21, 22, 35, 5, 50], 'B': [50, 22, 35, 5, 10, 11, 21, 0]}) data A B 0 0 50 1 10 22 2 11 35 3 21 5 4 22 10 5 35 11 6 5 21 7 50 0 # Desired output with "duplicates" removed. data2 = pd.DataFrame({'A': [0, 5, 10, 11], 'B': [50, 21, 22, 35]}) data2 A B 0 0 50 1 5 21 2 10 22 3 11 35 </code></pre> Ideally, the output would be sorted by values of column <code>A</code>.

You can sort each row of the data frame before dropping the duplicates: <pre class="prettyprint"><code>data.apply(lambda r: sorted(r), axis = 1).drop_duplicates() # A B #0 0 50 #1 10 22 #2 11 35 #3 5 21 </code></pre> If you prefer the result to be sorted by column <code>A</code>: <pre class="prettyprint"><code>data.apply(lambda r: sorted(r), axis = 1).drop_duplicates().sort_values('A') # A B #0 0 50 #3 5 21 #1 10 22 #2 11 35 </code></pre>

Remove reverse duplicates from dataframe

Tags:

python

pandas

dataframe

I have a data frame with two columns, A and B. The order of A and B is unimportant in this context; for example, I would consider (0,50) and (50,0) to be duplicates. In pandas, what is an efficient way to remove these duplicates from a dataframe?

import pandas as pd

# Initial data frame.
data = pd.DataFrame({'A': [0, 10, 11, 21, 22, 35, 5, 50], 
                     'B': [50, 22, 35, 5, 10, 11, 21, 0]})
data
    A   B
0   0  50
1  10  22
2  11  35
3  21   5
4  22  10
5  35  11
6   5  21
7  50   0

# Desired output with "duplicates" removed. 
data2 = pd.DataFrame({'A': [0, 5, 10, 11], 
                      'B': [50, 21, 22, 35]})
data2
    A   B
0   0  50
1   5  21
2  10  22
3  11  35

Ideally, the output would be sorted by values of column A.

734

asked Nov 07 '16 21:11

Adam

2 Answers

You can sort each row of the data frame before dropping the duplicates:

data.apply(lambda r: sorted(r), axis = 1).drop_duplicates()

#   A    B
#0  0   50
#1  10  22
#2  11  35
#3  5   21

If you prefer the result to be sorted by column A:

data.apply(lambda r: sorted(r), axis = 1).drop_duplicates().sort_values('A')

#   A    B
#0  0   50
#3  5   21
#1  10  22
#2  11  35

115

answered Sep 30 '22 08:09

Psidom

Here is bit uglier, but faster solution:

In [44]: pd.DataFrame(np.sort(data.values, axis=1), columns=data.columns).drop_duplicates()
Out[44]:
    A   B
0   0  50
1  10  22
2  11  35
3   5  21

Timing: for 8K rows DF

In [50]: big = pd.concat([data] * 10**3, ignore_index=True)

In [51]: big.shape
Out[51]: (8000, 2)

In [52]: %timeit big.apply(lambda r: sorted(r), axis = 1).drop_duplicates()
1 loop, best of 3: 3.04 s per loop

In [53]: %timeit pd.DataFrame(np.sort(big.values, axis=1), columns=big.columns).drop_duplicates()
100 loops, best of 3: 3.96 ms per loop

In [59]: %timeit big.apply(np.sort, axis = 1).drop_duplicates()
1 loop, best of 3: 2.69 s per loop

answered Sep 30 '22 09:09

MaxU - stop WAR against UA

Related questions
                            
                                Week of a month pandas
                            
                                Flask: are blueprints necessary for app factories?
                            
                                How to work with the scrapy contracts?
                            
                                How do I generate random text in NLTK 3.0?
                            
                                Django: How to login user directly after registration using generic CreateView
                            
                                Get a unique list of items that occur more than once in a list
                            
                                How to use custom token model in Django Rest Framework
                            
                                Is there a numpy biginteger?
                            
                                Calculate new value based on decreasing value
                            
                                Pandas: cannot filter based on string equality
                            
                                Why I get 'list' object has no attribute 'items'?
                            
                                CompletedProcess from subprocess.run() doesn't return a string
                            
                                Mocking a class method and changing some object attributes in Python
                            
                                Catching exception from a called function
                            
                                Link to a specific location in a Flask template
                            
                                Pandas drop_duplicates - TypeError: type object argument after * must be a sequence, not map
                            
                                Letsencrypt ImportError: No module named interface on amazon linux while renewing
                            
                                Converting days since epoch to date
                            
                                Time Wheel in python3 pandas
                            
                                How to put geckodriver into PATH? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With