Replacing Values in a Column on the Basis of n Consecutive Entries

Question

I have a column of data with values ranging from 0, 1, & 2. I would like to replace all instances of 1 and 2 with the value of 0 if these occurrences do not persist for n rows. If the values persist, for n rows, I would like to leave them as-is. Example, where n=4:

df = pd.DataFrame({'data': [1,0,1,2,0,0,0,1,0,2,2,2,2,2,2,0,1,0,1,0,2],
                   'desired': [0,0,0,0,0,0,0,0,0,2,2,2,2,2,2,0,0,0,0,0,0]})

    data  desired
0      1        0
1      0        0
2      1        0
3      2        0
4      0        0
5      0        0
6      0        0
7      1        0
8      0        0
9      2        2
10     2        2
11     2        2
12     2        2
13     2        2
14     2        2
15     0        0
16     1        0
17     0        0
18     1        0
19     0        0
20     2        0

The intended functionality is to essentially "filter" the underlying data for brief changes in values. I'd like to be able to declare the number of consecutive values necessary which permit keeping the underlying data, including all values below and above 'n'. (If n = 4 and there are 6 consecutive values, I want all 6, not just the 2 that persisted beyond the cutoff of 4.). Is there a vectorized way of doing this in Pandas?

Erfan · Accepted Answer

Using the cumsum of the difference, then get the size of each group:

n = 4 
groups = df['data'].diff().ne(0).cumsum()
df['desired'] = df['data'].where(df.groupby(groups)['data'].transform('size').gt(n), other=0)

    data  desired
0   1     0      
1   0     0      
2   1     0      
3   2     0      
4   0     0      
5   0     0      
6   0     0      
7   1     0      
8   0     0      
9   2     2      
10  2     2      
11  2     2      
12  2     2      
13  2     2      
14  2     2      
15  0     0      
16  1     0      
17  0     0      
18  1     0      
19  0     0      
20  2     0

`Timings`:

# create sample dataframe of 1 million rows
df = pd.DataFrame({'data': [1,0,1,2,0,0,0,1,0,2,2,2,2,2,2,0,1,0,1,0,2]})
dfbig = pd.concat([df]*50000, ignore_index=True)
dfbig.shape

(1050000, 1)

Erfan

%%timeit
n = 4

groups = dfbig['data'].diff().ne(0).cumsum()
dfbig['data'].where(dfbig.groupby(groups)['data'].transform('size').gt(4), other=0)

268 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Quang Hoang

%%timeit
n=4
s = dfbig['data'].diff().eq(0).rolling(n-1).sum()

# fill
np.where(s.where(s>=n-1).bfill(limit=n-1).notnull(), dfbig['data'], 0)

164 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

piRSquared

%%timeit
a = dfbig.data.to_numpy()
b = np.append(0, a[:-1] != a[1:]).cumsum()

dfbig.data.where(np.bincount(b)[b] >= 4, 0)

62 ms ± 735 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

`Conlusion`:

piRSquared
Quang Hoang
Erfan

Replacing Values in a Column on the Basis of n Consecutive Entries

Tags:

python

pandas

MattM

1 Answers

`Timings`:

`Conlusion`:

Erfan

Recent Activity

Donate For Us

Replacing Values in a Column on the Basis of n Consecutive Entries

Tags:

python

pandas

MattM

1 Answers

Timings:

Conlusion:

Erfan

Related questions

Recent Activity

Donate For Us

`Timings`:

`Conlusion`: