Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

slice pandas df based on n consecutive instances of element

I would like to slice a dataframe to return rows where element x=0 appears consecutively at least n=3 times, and then dropping the first i=2 instances in each mini-sequence

is there an efficient way of achieving in pandas, and if not, using numpy or scipy?

import pandas as pd
import numpy as np

Example 1

df=pd.DataFrame({'A':[0,1,0,0,1,1,0,0,0,0,1,1,0,0,0,1,1],'B':np.random.randn(17)})

    A         B
0   0  0.748958
1   1  0.254730
2   0  0.629609
3   0  0.272738
4   1 -1.885906
5   1  1.206371
6   0 -0.332471
7   0  0.217553
8   0  0.768986
9   0 -1.607236
10  1  1.613650
11  1 -1.096892
12  0 -0.435762
13  0  0.131284
14  0 -0.177188
15  1  1.393890
16  1  0.174803

Desired output:

    A         B
8   0  0.768986
9   0 -1.607236
14  0 -0.177188

Example 2

x=0 (element of interest)

n=5 (min length of sequence)

i=2 (drop first two in each sequence)

df2=pd.DataFrame({'A':[0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0],'B':np.random.randn(20)})

    A         B
0   0  0.703803
1   0 -0.144088
2   0  0.635577
3   0 -0.834611
4   0  1.472271
5   0 -0.554860
6   0 -0.167016
7   1  0.578847
8   1 -1.873663
9   0  0.197062
10  0  1.458845
11  0 -1.921660
12  0 -1.301481
13  0  0.240197
14  0 -1.425058
15  1 -2.801151
16  0  0.766757
17  0  1.249806
18  0  0.595366
19  0 -1.447632

Desired output:

    A         B
2   0  0.635577
3   0 -0.834611
4   0  1.472271
5   0 -0.554860
6   0 -0.167016
11  0 -1.921660
12  0 -1.301481
13  0  0.240197
14  0 -1.425058
like image 831
themachinist Avatar asked Oct 11 '15 10:10

themachinist


1 Answers

Here's an approach using some NumPy manipulations -

def slice_consc(df,n):

    Acol = np.array(df['A'])
    Acol_ext = np.concatenate(([0],(Acol==0)+0,[0]))

    starts = np.where(np.diff(Acol_ext)==1)[0]
    stops = np.where(np.diff(Acol_ext)==-1)[0]

    id = np.zeros(Acol.size+2,dtype=int)

    valid_mask = stops - starts >= n
    id[stops[valid_mask]] = -1
    id[starts[valid_mask]+2] = 1

    return df[(id.cumsum()==1)[:-2]]

Sample runs -

Case #1:

>>> df
    A         B
0   0  0.977325
1   1 -0.408457
2   0 -0.377973
3   0  0.567537
4   1 -0.222019
5   1 -1.167422
6   0 -0.142546
7   0  0.675458
8   0 -0.184456
9   0 -0.826050
10  1 -0.772413
11  1 -1.556440
12  0 -0.687249
13  0 -0.481676
14  0  0.420400
15  1  0.031999
16  1 -1.092540
>>> slice_consc(df,3)
    A         B
8   0 -0.184456
9   0 -0.826050
14  0  0.420400

Case #2:

>>> df2
    A         B
0   0  0.757102
1   0  2.114935
2   0 -0.352309
3   0 -0.214931
4   0 -1.626064
5   0 -0.989776
6   0  0.639635
7   1  0.049358
8   1 -2.600326
9   0  0.057792
10  0  1.263418
11  0  0.618495
12  0 -1.637054
13  0  1.220862
14  0  1.245484
15  1  1.388218
16  0 -0.499900
17  0  0.761310
18  0 -1.308176
19  0 -2.005983
>>> slice_consc(df2,5)
    A         B
2   0 -0.352309
3   0 -0.214931
4   0 -1.626064
5   0 -0.989776
6   0  0.639635
11  0  0.618495
12  0 -1.637054
13  0  1.220862
14  0  1.245484
like image 133
Divakar Avatar answered Nov 04 '22 00:11

Divakar