What's the most efficient way to select the second to last of each duplicated set in a pandas dataframe? For instance I basically want to do this operation: <pre class="prettyprint"><code>df = df.drop_duplicates(['Person','Question'],take_last=True) </code></pre> But this: <pre class="prettyprint"><code>df = df.drop_duplicates(['Person','Question'],take_second_last=True) </code></pre> Abstracted question: how to choose which duplicate to keep if duplicate is neither the max nor the min?

With groupby.apply: <pre class="prettyprint"><code>df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4], 'B': np.arange(10), 'C': np.arange(10)}) df Out: A B C 0 1 0 0 1 1 1 1 2 1 2 2 3 1 3 3 4 2 4 4 5 2 5 5 6 2 6 6 7 3 7 7 8 3 8 8 9 4 9 9 (df.groupby('A', as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]) .reset_index(level=0, drop=True)) Out: A B C 2 1 2 2 5 2 5 5 7 3 7 7 9 4 9 9 </code></pre> With a different DataFrame, subset two columns: <pre class="prettyprint"><code>df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4], 'B': [1, 1, 2, 1, 2, 2, 2, 3, 3, 4], 'C': np.arange(10)}) df Out: A B C 0 1 1 0 1 1 1 1 2 1 2 2 3 1 1 3 4 2 2 4 5 2 2 5 6 2 2 6 7 3 3 7 8 3 3 8 9 4 4 9 (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]) .reset_index(level=0, drop=True)) Out: A B C 1 1 1 1 2 1 2 2 5 2 2 5 7 3 3 7 9 4 4 9 </code></pre>

You could <code>groupby/tail(2)</code> to take the last 2 items, then <code>groupby/head(1)</code> to take the first item from the tail: <pre class="prettyprint"><code>df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1) </code></pre> If there is only one item in the group, <code>tail(2)</code> returns just the one item. <hr> For example, <pre class="prettyprint"><code>import numpy as np import pandas as pd df = pd.DataFrame(np.random.randint(10, size=(10**2, 3)), columns=list('ABC')) result = df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1) expected = (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]).reset_index(level=0, drop=True)) assert expected.sort_index().equals(result) </code></pre> The builtin groupby methods (such as <code>tail</code> and <code>head</code>) are often much faster than <code>groupby/apply</code> with custom Python functions. This is especially true if there are a lot of groups: <pre class="prettyprint"><code>In [96]: %timeit df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1) 1000 loops, best of 3: 1.7 ms per loop In [97]: %timeit (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]).reset_index(level=0, drop=True)) 100 loops, best of 3: 17.9 ms per loop </code></pre> <hr> Alternatively, ayhan suggests a nice improvement: <pre class="prettyprint"><code>alt = df.groupby(['A','B']).tail(2).drop_duplicates(['A','B']) assert expected.sort_index().equals(alt) In [99]: %timeit df.groupby(['A','B']).tail(2).drop_duplicates(['A','B']) 1000 loops, best of 3: 1.43 ms per loop </code></pre>

Python Pandas Drop Duplicates keep second to last

Tags:

python

pandas

What's the most efficient way to select the second to last of each duplicated set in a pandas dataframe?

For instance I basically want to do this operation:

Click to copy

df = df.drop_duplicates(['Person','Question'],take_last=True)

But this:

Click to copy

df = df.drop_duplicates(['Person','Question'],take_second_last=True)

Abstracted question: how to choose which duplicate to keep if duplicate is neither the max nor the min?

642

asked Aug 15 '16 14:08

David Yang

2 Answers

With groupby.apply:

Click to copy

df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4], 
                   'B': np.arange(10), 'C': np.arange(10)})

df
Out: 
   A  B  C
0  1  0  0
1  1  1  1
2  1  2  2
3  1  3  3
4  2  4  4
5  2  5  5
6  2  6  6
7  3  7  7
8  3  8  8
9  4  9  9

(df.groupby('A', as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]])
   .reset_index(level=0, drop=True))
Out: 
   A  B  C
2  1  2  2
5  2  5  5
7  3  7  7
9  4  9  9

With a different DataFrame, subset two columns:

Click to copy

df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4], 
                   'B': [1, 1, 2, 1, 2, 2, 2, 3, 3, 4], 'C': np.arange(10)})

df
Out: 
   A  B  C
0  1  1  0
1  1  1  1
2  1  2  2
3  1  1  3
4  2  2  4
5  2  2  5
6  2  2  6
7  3  3  7
8  3  3  8
9  4  4  9

(df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]])
   .reset_index(level=0, drop=True))
Out: 
   A  B  C
1  1  1  1
2  1  2  2
5  2  2  5
7  3  3  7
9  4  4  9

173

answered Oct 12 '22 19:10

ayhan

You could groupby/tail(2) to take the last 2 items, then groupby/head(1) to take the first item from the tail:

Click to copy

df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)

If there is only one item in the group, tail(2) returns just the one item.

For example,

Click to copy

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(10, size=(10**2, 3)), columns=list('ABC'))
result = df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)

expected = (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]).reset_index(level=0, drop=True))
assert expected.sort_index().equals(result)

The builtin groupby methods (such as tail and head) are often much faster than groupby/apply with custom Python functions. This is especially true if there are a lot of groups:

Click to copy

In [96]: %timeit df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)
1000 loops, best of 3: 1.7 ms per loop

In [97]: %timeit (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]).reset_index(level=0, drop=True))
100 loops, best of 3: 17.9 ms per loop

Alternatively, ayhan suggests a nice improvement:

Click to copy

alt = df.groupby(['A','B']).tail(2).drop_duplicates(['A','B'])
assert expected.sort_index().equals(alt)

In [99]: %timeit df.groupby(['A','B']).tail(2).drop_duplicates(['A','B'])
1000 loops, best of 3: 1.43 ms per loop

answered Oct 12 '22 20:10

unutbu

Related questions
                            
                                Is it reliable to compare two isoformat datetime strings?
                            
                                Matrix multiplication on CPU (numpy) and GPU (gnumpy) give different results
                            
                                Python path as a string [closed]
                            
                                Stuffing a pandas DataFrame.plot into a matplotlib subplot
                            
                                Memory-aware LRU caching in Python?
                            
                                Pandas - Delete Rows with only NaN values
                            
                                Python AttributeError: 'module' object has no attribute 'connect'
                            
                                Datetime Timezone conversion using pytz
                            
                                Regex, select closest match
                            
                                How can I share a class between processes?
                            
                                How do you add error bars to Bokeh plots in python?
                            
                                Difference(s) between scipy.stats.linregress, numpy.polynomial.polynomial.polyfit and statsmodels.api.OLS
                            
                                Find the year with the most number of people alive in Python
                            
                                Curl POST request into pycurl code
                            
                                Python3 threading with uWSGI
                            
                                One object two foreign keys to the same table
                            
                                How does Pandas to_sql determine what dataframe column is placed into what database field?
                            
                                How to avoid NLTK's sentence tokenizer splitting on abbreviations?
                            
                                Using generator send() within a for loop
                            
                                Python Selenium Exception AttributeError: "'Service' object has no attribute 'process'" in selenium.webdriver.ie.service.Service

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python Pandas Drop Duplicates keep second to last

Tags:

python

pandas

David Yang

People also ask

2 Answers

ayhan

unutbu

Recent Activity

Donate For Us