I have 2 dataframes in python that I would like to query for data.
DF1: 4M records x 3 columns. The query function seams more efficient than the loc function.
DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.
Both queries return a single record. The simulation was done by running the same operation in a loop 10K times.
Running python 2.7 and pandas 0.16.0
Any recommendations to improve the query speed?
For improve performance is possible use numexpr
:
import numexpr
np.random.seed(125)
N = 40000000
df = pd.DataFrame({'A':np.random.randint(10, size=N)})
def ne(df):
x = df.A.values
return df[numexpr.evaluate('(x > 5)')]
print (ne(df))
In [138]: %timeit (ne(df))
1 loop, best of 3: 494 ms per loop
In [139]: %timeit df[df.A > 5]
1 loop, best of 3: 536 ms per loop
In [140]: %timeit df.query('A > 5')
1 loop, best of 3: 781 ms per loop
In [141]: %timeit df[df.eval('A > 5')]
1 loop, best of 3: 770 ms per loop
import numexpr
np.random.seed(125)
def ne(x):
x = x.A.values
return x[numexpr.evaluate('(x > 5)')]
def be(x):
return x[x.A > 5]
def q(x):
return x.query('A > 5')
def ev(x):
return x[x.eval('A > 5')]
def make_df(n):
df = pd.DataFrame(np.random.randint(10, size=n), columns=['A'])
return df
perfplot.show(
setup=make_df,
kernels=[ne, be, q, ev],
n_range=[2**k for k in range(2, 25)],
logx=True,
logy=True,
equality_check=False,
xlabel='len(df)')
Edit:
Graph with modified ne
and improved be
:
def ne(x):
return x[numexpr.evaluate('(x > 5)')]
def bex(x):
return x[x.A.values > 5]
I've added the .loc function and used the same code @jezrael used to test the performance in python 3.9.5
and pandas 1.2.5
.
import numpy as np
import pandas as pd
import numexpr
import perfplot
np.random.seed(125)
def ne(x):
return x[numexpr.evaluate('(x > 5)')]
def bex(x):
return x[x.A.values > 5]
def be(x):
return x[x.A > 5]
def lc(x):
return x.loc[x.A > 5]
def lcx(x):
return x.loc[x.A.values > 5]
def q(x):
return x.query('A > 5')
def ev(x):
return x[x.eval('A > 5')]
def make_df(n):
df = pd.DataFrame(np.random.randint(10, size=n), columns=['A'])
return df
perfplot.show(
setup=make_df,
kernels=[ne, lc, lcx, be, bex, q, ev],
n_range=[2**k for k in range(2, 25)],
logx=True,
logy=True,
equality_check=False,
xlabel='len(df)')
The using the .loc
function does not afect the result of the be
funcion and until 10^6 lines, using te column .values
seens to be fastest option. The query
method was the slowest.
here just the loc and be, to see the overlap.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With