What is the most Pythonic way to drop the only the first in a series of duplicates?
I have a dataframe:
Group    Value
  a        0
  a        1
  a        2
  b        6
  b        7
  b        8
and I want the following result:
Group    Value
  a        1
  a        2
  b        7
  b        8
drop_duplicates keeps the first or last item depending on how you set it.  I want to drop the first occurrance where there is a duplicate and keep the rest.
Use duplicated() to create a boolean mask and filter based on it:
df[df.Group.duplicated()]
#Group  Value
#1   a      1
#2   a      2
#4   b      7
#5   b      8
duplicated by default masks duplicates except the first occurrence as True:
df.Group.duplicated()
#0    False
#1     True
#2     True
#3    False
#4     True
#5     True
#Name: Group, dtype: bool
To keep the one row per group edge case (won't be so efficient any more):
df[df.Group.duplicated() | df.Group.groupby(df.Group).transform('count').eq(1)]
# Group Value
#1    a     1
#2    a     2
#4    b     7
#5    b     8
Or:
df[df.Group.groupby(df.Group).transform(lambda x: (x.size == 1) | x.duplicated())]
# Group  Value
#1    a      1
#2    a      2
#4    b      7
#5    b      8
                        If it is unique row , you want to keep it
df.groupby('Group').Value.apply(lambda x : x.iloc[1:] if len(x)>1 else x).reset_index('Group')
Out[144]: 
  Group  Value
1     a      1
2     a      2
4     b      7
5     b      8
6     c     11
Data input
df
Out[138]: 
  Group  Value
0     a      0
1     a      1
2     a      2
3     b      6
4     b      7
5     b      8
6     c     11
                        If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With