What is the most Pythonic way to drop the only the first in a series of duplicates?
I have a dataframe:
Group Value
a 0
a 1
a 2
b 6
b 7
b 8
and I want the following result:
Group Value
a 1
a 2
b 7
b 8
drop_duplicates
keeps the first or last item depending on how you set it. I want to drop the first occurrance where there is a duplicate and keep the rest.
Use duplicated()
to create a boolean mask and filter based on it:
df[df.Group.duplicated()]
#Group Value
#1 a 1
#2 a 2
#4 b 7
#5 b 8
duplicated
by default masks duplicates except the first occurrence as True:
df.Group.duplicated()
#0 False
#1 True
#2 True
#3 False
#4 True
#5 True
#Name: Group, dtype: bool
To keep the one row per group edge case (won't be so efficient any more):
df[df.Group.duplicated() | df.Group.groupby(df.Group).transform('count').eq(1)]
# Group Value
#1 a 1
#2 a 2
#4 b 7
#5 b 8
Or:
df[df.Group.groupby(df.Group).transform(lambda x: (x.size == 1) | x.duplicated())]
# Group Value
#1 a 1
#2 a 2
#4 b 7
#5 b 8
If it is unique row , you want to keep it
df.groupby('Group').Value.apply(lambda x : x.iloc[1:] if len(x)>1 else x).reset_index('Group')
Out[144]:
Group Value
1 a 1
2 a 2
4 b 7
5 b 8
6 c 11
Data input
df
Out[138]:
Group Value
0 a 0
1 a 1
2 a 2
3 b 6
4 b 7
5 b 8
6 c 11
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With