Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove consecutive duplicates. Leave the middle

Tags:

python

pandas

This is my DataFrame:

                 dt    value
2000-01-01 00:00:00    296.0
2000-01-01 00:05:00    296.0
2000-01-01 00:10:00    296.0
2000-01-01 00:15:00    296.25
2000-01-01 00:20:00    296.25
2000-01-01 00:25:00    296.25
2000-01-01 00:30:00    296.25
2000-01-01 00:35:00    296.25
2000-01-01 00:40:00    296.25
2000-01-01 00:45:00    296.5
2000-01-01 00:50:00    296.5
2000-01-01 00:55:00    296.5
2000-01-01 01:00:00    296.5
2000-01-01 01:05:00    296.5
2000-01-01 01:10:00    296.5
2000-01-01 01:15:00    296.75
2000-01-01 01:20:00    296.75
2000-01-01 01:50:00    297.0
2000-01-01 01:55:00    297.0
2000-01-01 02:00:00    297.0
2000-01-01 02:05:00    297.0
2000-01-01 02:10:00    297.0
2000-01-01 02:15:00    297.0

I want to remove adjacent duplicates.
The duplicate in the middle should remain. If the number of duplicates is even, take the next one from the middle. If there are 2 duplicates, take second.

Expected output:

                 dt    value
2000-01-01 00:05:00    296.0
2000-01-01 00:30:00    296.25
2000-01-01 01:00:00    296.5
2000-01-01 01:20:00    296.75
2000-01-01 02:05:00    297.0

I read this post about duplicates, but it doesn't satisfy my conditions of choosing the middle element.
Pandas: Drop consecutive duplicates

like image 803
Igor K. Avatar asked Dec 13 '22 07:12

Igor K.


2 Answers

Use -

df.groupby(['value'])['dt'].apply(lambda x: x.iloc[math.floor(len(x)/2)])

Output

value
296.00    2000-01-01 00:05:00
296.25    2000-01-01 00:30:00
296.50    2000-01-01 01:00:00
296.75    2000-01-01 01:20:00
297.00    2000-01-01 02:05:00
Name: dt, dtype: object

Needless to say, you have to import math

Timings

@ALollz

3.82 ms ± 442 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

This one

2.09 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
like image 126
Vivek Kalyanarangan Avatar answered Dec 25 '22 05:12

Vivek Kalyanarangan


To avoid the slower groupby + apply(lambda ...) we can instead use the built-in methods implemented in cython. Form groups of consecutive values checking against the shifted Series and then use aggregate those groups with size and cumcount to subset the original DataFrame. (Bonus: this preserves the original DataFrame index)

gp = df['value'].ne(df['value'].shift()).cumsum()
gp = gp.groupby(gp)

df[gp.transform('size')//2 == gp.cumcount()]

                     dt   value
1   2000-01-01 00:05:00  296.00
6   2000-01-01 00:30:00  296.25
12  2000-01-01 01:00:00  296.50
16  2000-01-01 01:20:00  296.75
20  2000-01-01 02:05:00  297.00

To illustrate the relative performance here's an example. I create a test DataFrame where each value is repeated twice (just for simplicity) and we can see how these two methods scale with timings as a function of the number of groups (which is the important number for scaling of groupby + apply).

import perfplot
import pandas as pd
import numpy as np

def groupby_apply(df):
    gp = df['value'].ne(df['value'].shift()).cumsum()
    return df.groupby(gp).apply(lambda x: x.iloc[len(x) // 2])
    
def groupby_builtin(df):
    gp = df['value'].ne(df['value'].shift()).cumsum()
    gp = gp.groupby(gp)

    return df[gp.transform('size')//2 == gp.cumcount()]

perfplot.show(
    setup=lambda N: pd.DataFrame({'dt': range(N), 
                                  'value': np.repeat(np.arange(N//2), 2)}),
                                  
    kernels=[
        lambda df: groupby_apply(df),
        lambda df: groupby_builtin(df)],
    
    labels=['apply', 'built-in'],
    n_range=[2 ** k for k in range(1, 20)],
    equality_check=np.allclose,  
    xlabel='Proxy for # of Groups'
)

enter image description here

like image 22
ALollz Avatar answered Dec 25 '22 05:12

ALollz