What's the most efficient way to select the second to last of each duplicated set in a pandas dataframe?
For instance I basically want to do this operation:
df = df.drop_duplicates(['Person','Question'],take_last=True)
But this:
df = df.drop_duplicates(['Person','Question'],take_second_last=True)
Abstracted question: how to choose which duplicate to keep if duplicate is neither the max nor the min?
Only consider certain columns for identifying duplicates, by default use all of the columns. Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence.
keep: allowed values are {'first', 'last', False}, default 'first'. If 'first', duplicate rows except the first one is deleted. If 'last', duplicate rows except the last one is deleted. If False, all the duplicate rows are deleted. inplace: if True, the source DataFrame is changed and None is returned.
Pandas drop_duplicates() method helps in removing duplicates from the data frame. Parameters: subset: Subset takes a column or list of column label. It's default value is none.
Use DataFrame. drop_duplicates() to Drop Duplicate and Keep First Rows. You can use DataFrame. drop_duplicates() without any arguments to drop rows with the same values on all columns.
With groupby.apply:
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4],
'B': np.arange(10), 'C': np.arange(10)})
df
Out:
A B C
0 1 0 0
1 1 1 1
2 1 2 2
3 1 3 3
4 2 4 4
5 2 5 5
6 2 6 6
7 3 7 7
8 3 8 8
9 4 9 9
(df.groupby('A', as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]])
.reset_index(level=0, drop=True))
Out:
A B C
2 1 2 2
5 2 5 5
7 3 7 7
9 4 9 9
With a different DataFrame, subset two columns:
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4],
'B': [1, 1, 2, 1, 2, 2, 2, 3, 3, 4], 'C': np.arange(10)})
df
Out:
A B C
0 1 1 0
1 1 1 1
2 1 2 2
3 1 1 3
4 2 2 4
5 2 2 5
6 2 2 6
7 3 3 7
8 3 3 8
9 4 4 9
(df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]])
.reset_index(level=0, drop=True))
Out:
A B C
1 1 1 1
2 1 2 2
5 2 2 5
7 3 3 7
9 4 4 9
You could groupby/tail(2)
to take the last 2 items, then groupby/head(1)
to take the first item from the tail:
df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)
If there is only one item in the group, tail(2)
returns just the one item.
For example,
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(10, size=(10**2, 3)), columns=list('ABC'))
result = df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)
expected = (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]).reset_index(level=0, drop=True))
assert expected.sort_index().equals(result)
The builtin groupby methods (such as tail
and head
) are often much faster
than groupby/apply
with custom Python functions. This is especially true if there are a lot of groups:
In [96]: %timeit df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)
1000 loops, best of 3: 1.7 ms per loop
In [97]: %timeit (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]).reset_index(level=0, drop=True))
100 loops, best of 3: 17.9 ms per loop
Alternatively, ayhan suggests a nice improvement:
alt = df.groupby(['A','B']).tail(2).drop_duplicates(['A','B'])
assert expected.sort_index().equals(alt)
In [99]: %timeit df.groupby(['A','B']).tail(2).drop_duplicates(['A','B'])
1000 loops, best of 3: 1.43 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With