def filter_data(df, raw_col,threshold,filt_col):
df['pct'] = None
df[filt_col] = None
df[filt_col][0] = df[raw_col][0]
max_val = df[raw_col][0]
for i in range(1,len(df)):
df['pct'][i] = (df[raw_col][i] - max_val)*1.0 / max_val
if abs(df['pct'][i]) < threshold:
df[filt_col][i] = None
else:
df[filt_col][i] = df[raw_col][i]
max_val = df[raw_col][i]
df = df.dropna(axis=0, how='any').reset_index()
return df
from random import randint
some_lst = [randint(50, 100) for i in range(0,50)]
some_df = pd.DataFrame({'raw_col':some_lst})
some_df_filt = filter_data(some_df,'raw_col',0.01,'raw_col_filt')
The goal to create a new column(filt_col) where record from numeric column (raw_col) are removed with the following logic; if rate of change between two adjacent rows is less than threshold remove the latter. It works but is very inefficient in terms of running time. Any hints on how I could optimise it?
IIUC, you can do this very simply using .pct_change()
and loc
First
df['pctn'] = df.raw_col.pct_change()
Then
threshold = 0.01
df.loc[df.pctn.abs() >= threshold]
You can check that this solution yields the same result as yours, which you said works, but is slow
df.loc[df.pctn.abs() >= 0.01].raw_col.tolist() == some_df_filt.raw_col.tolist()
True
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With