Say, i have the following dataframe:
import pandas as pd
dict = {'val':[3.2, 2.4, -2.3, -4.9, 3.2, 2.4, -2.3, -4.9, 2.4, -2.3, -4.9],
'label': [0, 2, 1, -1, 1, 2, -1, -1,1, 1, -1]}
df = pd.DataFrame(dict)
df
val label
0 3.2 0
1 2.4 2
2 -2.3 1
3 -4.9 -1
4 3.2 1
5 2.4 2
6 -2.3 -1
7 -4.9 -1
8 2.4 1
9 -2.3 1
10 -4.9 -1
I want to take each n (for example 2) rows before -1 value in column label. In the given df first -1 appears at index 3, we take 2 rows before it and drop index 3, then next -1 appears at index 6, we again keep 2 rows before and etc. The desired output is as following:
val label
1 2.4 2
2 -2.3 1
4 3.2 1
5 2.4 2
6 -2.3 -1
8 2.4 1
9 -2.3 1
Thanks for any ideas!
You can get the index
values and then get the previous two row index values:
idx = df[df.label == -1].index
filtered_idx = (idx-1).union(idx-2)
filtered_idx = filtered_idx[filtered_idx > 0]
df_new = df.iloc[filtered_idx]
output:
val label
1 2.4 2
2 -2.3 1
4 3.2 1
5 2.4 2
6 -2.3 -1
8 2.4 1
9 -2.3 1
Speed comparison with for a for loop
solution:
# create large df:
import numpy as np
df = pd.DataFrame(np.random.random((20000000,2)), columns=["val","label"])
df.loc[df.sample(frac=0.01).index, "label"] = - 1
def vectorized_filter(df):
idx = df[df.label == -1].index
filtered_idx = (idx -1).union(idx-2)
df_new = df.iloc[filtered_idx]
return df_new
def loop_filter(df):
filter = df.loc[df['label'] == -1].index
req_idx = []
for idx in filter:
if idx == 0:
continue
elif idx == 1:
req_idx.append(idx-1)
else:
req_idx.append(idx-2)
req_idx.append(idx-1)
req_idx = list(set(req_idx))
df2 = df.loc[df.index.isin(req_idx)]
return df2
%timeit vectorized_filter(df)
%timeit loop_filter(df)
vectorized runs ~20x faster on my machine
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With