I have a column of data with values ranging from 0, 1, & 2. I would like to replace all instances of 1 and 2 with the value of 0 if these occurrences do not persist for n rows. If the values persist, for n rows, I would like to leave them as-is. Example, where n=4:
df = pd.DataFrame({'data': [1,0,1,2,0,0,0,1,0,2,2,2,2,2,2,0,1,0,1,0,2],
'desired': [0,0,0,0,0,0,0,0,0,2,2,2,2,2,2,0,0,0,0,0,0]})
data desired
0 1 0
1 0 0
2 1 0
3 2 0
4 0 0
5 0 0
6 0 0
7 1 0
8 0 0
9 2 2
10 2 2
11 2 2
12 2 2
13 2 2
14 2 2
15 0 0
16 1 0
17 0 0
18 1 0
19 0 0
20 2 0
The intended functionality is to essentially "filter" the underlying data for brief changes in values. I'd like to be able to declare the number of consecutive values necessary which permit keeping the underlying data, including all values below and above 'n'. (If n = 4 and there are 6 consecutive values, I want all 6, not just the 2 that persisted beyond the cutoff of 4.). Is there a vectorized way of doing this in Pandas?
Using the cumsum
of the difference
, then get the size
of each group:
n = 4
groups = df['data'].diff().ne(0).cumsum()
df['desired'] = df['data'].where(df.groupby(groups)['data'].transform('size').gt(n), other=0)
data desired
0 1 0
1 0 0
2 1 0
3 2 0
4 0 0
5 0 0
6 0 0
7 1 0
8 0 0
9 2 2
10 2 2
11 2 2
12 2 2
13 2 2
14 2 2
15 0 0
16 1 0
17 0 0
18 1 0
19 0 0
20 2 0
Timings
:# create sample dataframe of 1 million rows
df = pd.DataFrame({'data': [1,0,1,2,0,0,0,1,0,2,2,2,2,2,2,0,1,0,1,0,2]})
dfbig = pd.concat([df]*50000, ignore_index=True)
dfbig.shape
(1050000, 1)
Erfan
%%timeit
n = 4
groups = dfbig['data'].diff().ne(0).cumsum()
dfbig['data'].where(dfbig.groupby(groups)['data'].transform('size').gt(4), other=0)
268 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Quang Hoang
%%timeit
n=4
s = dfbig['data'].diff().eq(0).rolling(n-1).sum()
# fill
np.where(s.where(s>=n-1).bfill(limit=n-1).notnull(), dfbig['data'], 0)
164 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
piRSquared
%%timeit
a = dfbig.data.to_numpy()
b = np.append(0, a[:-1] != a[1:]).cumsum()
dfbig.data.where(np.bincount(b)[b] >= 4, 0)
62 ms ± 735 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Conlusion
:If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With