Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get n rows before specific value in pandas

Tags:

Say, i have the following dataframe:

import pandas as pd
dict = {'val':[3.2, 2.4, -2.3, -4.9, 3.2, 2.4, -2.3, -4.9, 2.4, -2.3, -4.9], 
        'label': [0, 2, 1, -1, 1, 2, -1, -1,1, 1, -1]} 
df = pd.DataFrame(dict) 
df
     val    label
0    3.2     0
1    2.4     2
2   -2.3     1
3   -4.9    -1
4    3.2     1
5    2.4     2
6   -2.3    -1
7   -4.9    -1
8    2.4     1
9   -2.3     1
10  -4.9    -1

I want to take each n (for example 2) rows before -1 value in column label. In the given df first -1 appears at index 3, we take 2 rows before it and drop index 3, then next -1 appears at index 6, we again keep 2 rows before and etc. The desired output is as following:

    val     label
1    2.4     2
2   -2.3     1
4    3.2     1
5    2.4     2
6   -2.3    -1
8    2.4     1
9   -2.3     1

Thanks for any ideas!

like image 869
Sascha Avatar asked May 25 '20 09:05

Sascha


1 Answers

You can get the index values and then get the previous two row index values:

idx = df[df.label == -1].index
filtered_idx = (idx-1).union(idx-2)
filtered_idx = filtered_idx[filtered_idx > 0]

df_new = df.iloc[filtered_idx]

output:

   val  label
1  2.4      2
2 -2.3      1
4  3.2      1
5  2.4      2
6 -2.3     -1
8  2.4      1
9 -2.3      1

Speed comparison with for a for loop solution:

# create large df:
import numpy as np
df = pd.DataFrame(np.random.random((20000000,2)), columns=["val","label"])
df.loc[df.sample(frac=0.01).index, "label"] = - 1

def vectorized_filter(df):
    idx = df[df.label == -1].index
    filtered_idx = (idx -1).union(idx-2)
    df_new = df.iloc[filtered_idx]
    return df_new

def loop_filter(df):
    filter = df.loc[df['label'] == -1].index
    req_idx = []
    for idx in filter:
        if idx == 0:
            continue
        elif idx == 1:
            req_idx.append(idx-1)
        else:
            req_idx.append(idx-2)
            req_idx.append(idx-1)    
    req_idx = list(set(req_idx))
    df2 = df.loc[df.index.isin(req_idx)]
    return df2

%timeit vectorized_filter(df)
%timeit loop_filter(df)

vectorized runs ~20x faster on my machine
like image 94
anon01 Avatar answered Sep 28 '22 09:09

anon01