This is my DataFrame:
dt value
2000-01-01 00:00:00 296.0
2000-01-01 00:05:00 296.0
2000-01-01 00:10:00 296.0
2000-01-01 00:15:00 296.25
2000-01-01 00:20:00 296.25
2000-01-01 00:25:00 296.25
2000-01-01 00:30:00 296.25
2000-01-01 00:35:00 296.25
2000-01-01 00:40:00 296.25
2000-01-01 00:45:00 296.5
2000-01-01 00:50:00 296.5
2000-01-01 00:55:00 296.5
2000-01-01 01:00:00 296.5
2000-01-01 01:05:00 296.5
2000-01-01 01:10:00 296.5
2000-01-01 01:15:00 296.75
2000-01-01 01:20:00 296.75
2000-01-01 01:50:00 297.0
2000-01-01 01:55:00 297.0
2000-01-01 02:00:00 297.0
2000-01-01 02:05:00 297.0
2000-01-01 02:10:00 297.0
2000-01-01 02:15:00 297.0
I want to remove adjacent duplicates.
The duplicate in the middle should remain. If the number of duplicates is even, take the next one from the middle. If there are 2 duplicates, take second.
Expected output:
dt value
2000-01-01 00:05:00 296.0
2000-01-01 00:30:00 296.25
2000-01-01 01:00:00 296.5
2000-01-01 01:20:00 296.75
2000-01-01 02:05:00 297.0
I read this post about duplicates, but it doesn't satisfy my conditions of choosing the middle element.
Pandas: Drop consecutive duplicates
Use -
df.groupby(['value'])['dt'].apply(lambda x: x.iloc[math.floor(len(x)/2)])
Output
value
296.00 2000-01-01 00:05:00
296.25 2000-01-01 00:30:00
296.50 2000-01-01 01:00:00
296.75 2000-01-01 01:20:00
297.00 2000-01-01 02:05:00
Name: dt, dtype: object
Needless to say, you have to import math
Timings
@ALollz
3.82 ms ± 442 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This one
2.09 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
To avoid the slower groupby
+ apply(lambda ...)
we can instead use the built-in methods implemented in cython. Form groups of consecutive values checking against the shift
ed Series and then use aggregate those groups with size
and cumcount
to subset the original DataFrame. (Bonus: this preserves the original DataFrame index)
gp = df['value'].ne(df['value'].shift()).cumsum()
gp = gp.groupby(gp)
df[gp.transform('size')//2 == gp.cumcount()]
dt value
1 2000-01-01 00:05:00 296.00
6 2000-01-01 00:30:00 296.25
12 2000-01-01 01:00:00 296.50
16 2000-01-01 01:20:00 296.75
20 2000-01-01 02:05:00 297.00
To illustrate the relative performance here's an example. I create a test DataFrame where each value is repeated twice (just for simplicity) and we can see how these two methods scale with timings as a function of the number of groups (which is the important number for scaling of groupby + apply).
import perfplot
import pandas as pd
import numpy as np
def groupby_apply(df):
gp = df['value'].ne(df['value'].shift()).cumsum()
return df.groupby(gp).apply(lambda x: x.iloc[len(x) // 2])
def groupby_builtin(df):
gp = df['value'].ne(df['value'].shift()).cumsum()
gp = gp.groupby(gp)
return df[gp.transform('size')//2 == gp.cumcount()]
perfplot.show(
setup=lambda N: pd.DataFrame({'dt': range(N),
'value': np.repeat(np.arange(N//2), 2)}),
kernels=[
lambda df: groupby_apply(df),
lambda df: groupby_builtin(df)],
labels=['apply', 'built-in'],
n_range=[2 ** k for k in range(1, 20)],
equality_check=np.allclose,
xlabel='Proxy for # of Groups'
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With