Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas dataframe: loc vs query performance

I have 2 dataframes in python that I would like to query for data.

  • DF1: 4M records x 3 columns. The query function seams more efficient than the loc function.

  • DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.

Both queries return a single record. The simulation was done by running the same operation in a loop 10K times.

Running python 2.7 and pandas 0.16.0

Any recommendations to improve the query speed?

like image 643
Syntax_Error Avatar asked Apr 20 '18 07:04

Syntax_Error


2 Answers

For improve performance is possible use numexpr:

import numexpr

np.random.seed(125)
N = 40000000
df = pd.DataFrame({'A':np.random.randint(10, size=N)})

def ne(df):
    x = df.A.values
    return df[numexpr.evaluate('(x > 5)')]
print (ne(df))

In [138]: %timeit (ne(df))
1 loop, best of 3: 494 ms per loop

In [139]: %timeit df[df.A > 5]
1 loop, best of 3: 536 ms per loop

In [140]: %timeit df.query('A > 5')
1 loop, best of 3: 781 ms per loop

In [141]: %timeit df[df.eval('A > 5')]
1 loop, best of 3: 770 ms per loop

import numexpr
np.random.seed(125)

def ne(x):
    x = x.A.values
    return x[numexpr.evaluate('(x > 5)')]

def be(x):
    return x[x.A > 5]

def q(x):
    return x.query('A > 5')

def ev(x):
    return x[x.eval('A > 5')]


def make_df(n):
    df = pd.DataFrame(np.random.randint(10, size=n), columns=['A'])
    return df


perfplot.show(
    setup=make_df,
    kernels=[ne, be, q, ev],
    n_range=[2**k for k in range(2, 25)],
    logx=True,
    logy=True,
    equality_check=False,  
    xlabel='len(df)')

graph

Edit:

Graph with modified ne and improved be:

def ne(x):
    return x[numexpr.evaluate('(x > 5)')]

def bex(x):
    return x[x.A.values > 5]

enter image description here

like image 110
jezrael Avatar answered Oct 26 '22 14:10

jezrael


I've added the .loc function and used the same code @jezrael used to test the performance in python 3.9.5 and pandas 1.2.5.

import numpy as np
import pandas as pd
import numexpr
import perfplot
np.random.seed(125)

def ne(x):
    return x[numexpr.evaluate('(x > 5)')]

def bex(x):
    return x[x.A.values > 5]

def be(x):
    return x[x.A > 5]

def lc(x):
    return x.loc[x.A > 5]

def lcx(x):
    return x.loc[x.A.values > 5]

def q(x):
    return x.query('A > 5')

def ev(x):
    return x[x.eval('A > 5')]

def make_df(n):
    df = pd.DataFrame(np.random.randint(10, size=n), columns=['A'])
    return df


perfplot.show(
    setup=make_df,
    kernels=[ne, lc, lcx, be, bex, q, ev],
    n_range=[2**k for k in range(2, 25)],
    logx=True,
    logy=True,
    equality_check=False,  
    xlabel='len(df)')

The using the .loc function does not afect the result of the be funcion and until 10^6 lines, using te column .values seens to be fastest option. The query method was the slowest.

top5

here just the loc and be, to see the overlap. loc vs. be

like image 3
Rafael Braga Avatar answered Oct 26 '22 12:10

Rafael Braga