I would like to slice a dataframe to return rows where element x=0 appears consecutively at least n=3 times, and then dropping the first i=2 instances in each mini-sequence
is there an efficient way of achieving in pandas, and if not, using numpy or scipy?
import pandas as pd
import numpy as np
Example 1
df=pd.DataFrame({'A':[0,1,0,0,1,1,0,0,0,0,1,1,0,0,0,1,1],'B':np.random.randn(17)})
A B
0 0 0.748958
1 1 0.254730
2 0 0.629609
3 0 0.272738
4 1 -1.885906
5 1 1.206371
6 0 -0.332471
7 0 0.217553
8 0 0.768986
9 0 -1.607236
10 1 1.613650
11 1 -1.096892
12 0 -0.435762
13 0 0.131284
14 0 -0.177188
15 1 1.393890
16 1 0.174803
Desired output:
A B
8 0 0.768986
9 0 -1.607236
14 0 -0.177188
Example 2
x=0 (element of interest)
n=5 (min length of sequence)
i=2 (drop first two in each sequence)
df2=pd.DataFrame({'A':[0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0],'B':np.random.randn(20)})
A B
0 0 0.703803
1 0 -0.144088
2 0 0.635577
3 0 -0.834611
4 0 1.472271
5 0 -0.554860
6 0 -0.167016
7 1 0.578847
8 1 -1.873663
9 0 0.197062
10 0 1.458845
11 0 -1.921660
12 0 -1.301481
13 0 0.240197
14 0 -1.425058
15 1 -2.801151
16 0 0.766757
17 0 1.249806
18 0 0.595366
19 0 -1.447632
Desired output:
A B
2 0 0.635577
3 0 -0.834611
4 0 1.472271
5 0 -0.554860
6 0 -0.167016
11 0 -1.921660
12 0 -1.301481
13 0 0.240197
14 0 -1.425058
Here's an approach using some NumPy manipulations -
def slice_consc(df,n):
Acol = np.array(df['A'])
Acol_ext = np.concatenate(([0],(Acol==0)+0,[0]))
starts = np.where(np.diff(Acol_ext)==1)[0]
stops = np.where(np.diff(Acol_ext)==-1)[0]
id = np.zeros(Acol.size+2,dtype=int)
valid_mask = stops - starts >= n
id[stops[valid_mask]] = -1
id[starts[valid_mask]+2] = 1
return df[(id.cumsum()==1)[:-2]]
Sample runs -
Case #1:
>>> df
A B
0 0 0.977325
1 1 -0.408457
2 0 -0.377973
3 0 0.567537
4 1 -0.222019
5 1 -1.167422
6 0 -0.142546
7 0 0.675458
8 0 -0.184456
9 0 -0.826050
10 1 -0.772413
11 1 -1.556440
12 0 -0.687249
13 0 -0.481676
14 0 0.420400
15 1 0.031999
16 1 -1.092540
>>> slice_consc(df,3)
A B
8 0 -0.184456
9 0 -0.826050
14 0 0.420400
Case #2:
>>> df2
A B
0 0 0.757102
1 0 2.114935
2 0 -0.352309
3 0 -0.214931
4 0 -1.626064
5 0 -0.989776
6 0 0.639635
7 1 0.049358
8 1 -2.600326
9 0 0.057792
10 0 1.263418
11 0 0.618495
12 0 -1.637054
13 0 1.220862
14 0 1.245484
15 1 1.388218
16 0 -0.499900
17 0 0.761310
18 0 -1.308176
19 0 -2.005983
>>> slice_consc(df2,5)
A B
2 0 -0.352309
3 0 -0.214931
4 0 -1.626064
5 0 -0.989776
6 0 0.639635
11 0 0.618495
12 0 -1.637054
13 0 1.220862
14 0 1.245484
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With