Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove consecutive duplicates while keeping the max value

Tags:

python

pandas

I am trying to remove consecutive duplicates from column X while keeping the entry with the max value based on column Y, unfortunately with no success. The data frame is as follow:

idx X Y
0 A 3
1 B 2
2 A 7
3 A 10
4 B 1
5 C 4
6 A 3
7 A 3

What I want to achieve is:

idx X Y
0 A 3
1 B 2
3 A 10
4 B 1
5 C 4
7 A 3

Most of the solutions I found just remove the duplicates tout court without accounting for any repeating pattern.

Please note that the duplicates might have the same value.

like image 782
massigarg Avatar asked Dec 22 '21 11:12

massigarg


2 Answers

You need to apply an itertools-style-groupby and then keep the rows where Y is maximal.

>>> df 
   idx  X   Y
0    0  A   3
1    1  B   2
2    2  A   7
3    3  A  10
4    4  B   1
5    5  C   4
6    6  A   3
7    7  A   5
>>> y_max = df.groupby(df['X'].ne(df['X'].shift()).cumsum())['Y'].transform('max')
>>> df[df['Y'] == y_max] 
   idx  X   Y
0    0  A   3
1    1  B   2
3    3  A  10
4    4  B   1
5    5  C   4
7    7  A   5

edit:

Initial solution had a bug and only produced the correct idx column by accident.

edit 2:

If you only want to keep one row per group, you can use

>>> y_idxmax = df.groupby(df['X'].ne(df['X'].shift()).cumsum())['Y'].idxmax()
>>> df.loc[y_idxmax] 
   idx  X   Y
0    0  A   3
1    1  B   2
3    3  A  10
4    4  B   1
5    5  C   4
7    7  A   5

Credit goes to Ch3steR for this one.

like image 188
timgeb Avatar answered Oct 04 '22 22:10

timgeb


Or I'd prefer just simply only specify the groups in the groupby parameters:

df.groupby(df['X'].ne(df['X'].shift()).cumsum(), as_index=False).max()

Or:

df.groupby(df['X'].ne(df['X'].shift()).cumsum()).max().reset_index(drop=True)

Both output:

   idx  X   Y
0    0  A   3
1    1  B   2
2    3  A  10
3    4  B   1
4    5  C   4
5    7  A   5
like image 30
U12-Forward Avatar answered Oct 04 '22 22:10

U12-Forward