Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replacing Values in a Column on the Basis of n Consecutive Entries

Tags:

python

pandas

I have a column of data with values ranging from 0, 1, & 2. I would like to replace all instances of 1 and 2 with the value of 0 if these occurrences do not persist for n rows. If the values persist, for n rows, I would like to leave them as-is. Example, where n=4:

df = pd.DataFrame({'data': [1,0,1,2,0,0,0,1,0,2,2,2,2,2,2,0,1,0,1,0,2],
                   'desired': [0,0,0,0,0,0,0,0,0,2,2,2,2,2,2,0,0,0,0,0,0]})

    data  desired
0      1        0
1      0        0
2      1        0
3      2        0
4      0        0
5      0        0
6      0        0
7      1        0
8      0        0
9      2        2
10     2        2
11     2        2
12     2        2
13     2        2
14     2        2
15     0        0
16     1        0
17     0        0
18     1        0
19     0        0
20     2        0

The intended functionality is to essentially "filter" the underlying data for brief changes in values. I'd like to be able to declare the number of consecutive values necessary which permit keeping the underlying data, including all values below and above 'n'. (If n = 4 and there are 6 consecutive values, I want all 6, not just the 2 that persisted beyond the cutoff of 4.). Is there a vectorized way of doing this in Pandas?

like image 533
MattM Avatar asked Dec 18 '22 14:12

MattM


1 Answers

Using the cumsum of the difference, then get the size of each group:

n = 4 
groups = df['data'].diff().ne(0).cumsum()
df['desired'] = df['data'].where(df.groupby(groups)['data'].transform('size').gt(n), other=0)

    data  desired
0   1     0      
1   0     0      
2   1     0      
3   2     0      
4   0     0      
5   0     0      
6   0     0      
7   1     0      
8   0     0      
9   2     2      
10  2     2      
11  2     2      
12  2     2      
13  2     2      
14  2     2      
15  0     0      
16  1     0      
17  0     0      
18  1     0      
19  0     0      
20  2     0  

Timings:

# create sample dataframe of 1 million rows
df = pd.DataFrame({'data': [1,0,1,2,0,0,0,1,0,2,2,2,2,2,2,0,1,0,1,0,2]})
dfbig = pd.concat([df]*50000, ignore_index=True)
dfbig.shape

(1050000, 1)

Erfan

%%timeit
n = 4

groups = dfbig['data'].diff().ne(0).cumsum()
dfbig['data'].where(dfbig.groupby(groups)['data'].transform('size').gt(4), other=0)

268 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Quang Hoang

%%timeit
n=4
s = dfbig['data'].diff().eq(0).rolling(n-1).sum()

# fill
np.where(s.where(s>=n-1).bfill(limit=n-1).notnull(), dfbig['data'], 0)

164 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

piRSquared

%%timeit
a = dfbig.data.to_numpy()
b = np.append(0, a[:-1] != a[1:]).cumsum()

dfbig.data.where(np.bincount(b)[b] >= 4, 0)

62 ms ± 735 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Conlusion:

  1. piRSquared
  2. Quang Hoang
  3. Erfan
like image 56
Erfan Avatar answered Dec 21 '22 23:12

Erfan