Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas Drop Duplicates keep second to last

Tags:

python

pandas

What's the most efficient way to select the second to last of each duplicated set in a pandas dataframe?

For instance I basically want to do this operation:

df = df.drop_duplicates(['Person','Question'],take_last=True)

But this:

df = df.drop_duplicates(['Person','Question'],take_second_last=True)

Abstracted question: how to choose which duplicate to keep if duplicate is neither the max nor the min?

like image 642
David Yang Avatar asked Aug 15 '16 14:08

David Yang


People also ask

Does pandas drop duplicates keep first?

Only consider certain columns for identifying duplicates, by default use all of the columns. Determines which duplicates (if any) to keep. - first : Drop duplicates except for the first occurrence.

What is keep =' last in Python?

keep: allowed values are {'first', 'last', False}, default 'first'. If 'first', duplicate rows except the first one is deleted. If 'last', duplicate rows except the last one is deleted. If False, all the duplicate rows are deleted. inplace: if True, the source DataFrame is changed and None is returned.

What does pandas drop duplicates do?

Pandas drop_duplicates() method helps in removing duplicates from the data frame. Parameters: subset: Subset takes a column or list of column label. It's default value is none.

How do I drop complete duplicates in pandas?

Use DataFrame. drop_duplicates() to Drop Duplicate and Keep First Rows. You can use DataFrame. drop_duplicates() without any arguments to drop rows with the same values on all columns.


2 Answers

With groupby.apply:

df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4], 
                   'B': np.arange(10), 'C': np.arange(10)})

df
Out: 
   A  B  C
0  1  0  0
1  1  1  1
2  1  2  2
3  1  3  3
4  2  4  4
5  2  5  5
6  2  6  6
7  3  7  7
8  3  8  8
9  4  9  9

(df.groupby('A', as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]])
   .reset_index(level=0, drop=True))
Out: 
   A  B  C
2  1  2  2
5  2  5  5
7  3  7  7
9  4  9  9

With a different DataFrame, subset two columns:

df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4], 
                   'B': [1, 1, 2, 1, 2, 2, 2, 3, 3, 4], 'C': np.arange(10)})

df
Out: 
   A  B  C
0  1  1  0
1  1  1  1
2  1  2  2
3  1  1  3
4  2  2  4
5  2  2  5
6  2  2  6
7  3  3  7
8  3  3  8
9  4  4  9

(df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]])
   .reset_index(level=0, drop=True))
Out: 
   A  B  C
1  1  1  1
2  1  2  2
5  2  2  5
7  3  3  7
9  4  4  9
like image 173
ayhan Avatar answered Oct 12 '22 19:10

ayhan


You could groupby/tail(2) to take the last 2 items, then groupby/head(1) to take the first item from the tail:

df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)

If there is only one item in the group, tail(2) returns just the one item.


For example,

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(10, size=(10**2, 3)), columns=list('ABC'))
result = df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)

expected = (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]).reset_index(level=0, drop=True))
assert expected.sort_index().equals(result)

The builtin groupby methods (such as tail and head) are often much faster than groupby/apply with custom Python functions. This is especially true if there are a lot of groups:

In [96]: %timeit df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)
1000 loops, best of 3: 1.7 ms per loop

In [97]: %timeit (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]).reset_index(level=0, drop=True))
100 loops, best of 3: 17.9 ms per loop

Alternatively, ayhan suggests a nice improvement:

alt = df.groupby(['A','B']).tail(2).drop_duplicates(['A','B'])
assert expected.sort_index().equals(alt)

In [99]: %timeit df.groupby(['A','B']).tail(2).drop_duplicates(['A','B'])
1000 loops, best of 3: 1.43 ms per loop
like image 35
unutbu Avatar answered Oct 12 '22 20:10

unutbu