I have a big numeric Pandas dataframe df
, and I want to select out the rows whose certain column's value is within the range of min_value
and max_value
.
I can do this by:
filtered_df = df[(df[col_name].values >= min_value) & (df[col_name].values <= max_value)]
And I am looking for methods to speed it up . I try below:
df.sort(col_name, inplace=True)
left_idx = np.searchsorted(df[col_name].values, min_value, side='left')
right_idx = np.searchsorted(df[col_name].values, max_value, side='right')
filtered_df = df[left_idx:right_idx]
But it does not work for df.sort() costs more time.
So, any tips to speed up the selection ?
(Pandas 0.11)
I think your best bet is to use numexpr
to speed this up
import pandas as pd
import numpy as np
import numexpr as ne
data = np.random.normal(size=100000000)
df = pd.DataFrame(data=data, columns=['col'])
a = df['col']
min_val = a.min()
max_val = a.max()
expr = '(a >= min_val) & (a <= max_val)'
And the timings ...
%timeit eval(expr)
1 loops, best of 3: 668 ms per loop
%timeit ne.evaluate(expr)
1 loops, best of 3: 197 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With