Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

optimize pandas query on multiple columns / multiindex

I have a very large table (currently 55 million rows, could be more), and I need to select subsets of it and perform very simple operations on those subsets, lots and lots of times. It seemed like pandas might be the best way to do this in python, but I'm running into optimization problems.

I've tried to create a fake dataset that closely matches my real dataset (although it's ~5-10 times smaller). This is still big, takes a lot of memory, etc. There are four columns that I'm querying on, and two that I'm using for calculations.

import pandas
import numpy as np
import timeit

n=10000000
mdt = pandas.DataFrame()
mdt['A'] = np.random.choice(range(10000,45000,1000), n)
mdt['B'] = np.random.choice(range(10,400), n)
mdt['C'] = np.random.choice(range(1,150), n)
mdt['D'] = np.random.choice(range(10000,45000), n)
mdt['x'] = np.random.choice(range(400), n)
mdt['y'] = np.random.choice(range(25), n)


test_A = 25000
test_B = 25
test_C = 40
test_D = 35000

eps_A = 5000
eps_B = 5
eps_C = 5
eps_D = 5000


f1 = lambda : mdt.query('@test_A-@eps_A <= A <= @test_A+@eps_A  &  ' +
                        '@test_B-@eps_B <= B <= @test_B+@eps_B  &  ' +
                        '@test_C-@eps_C <= C <= @test_C+@eps_C  &  ' +
                        '@test_D-@eps_D <= D <= @test_D+@eps_D')

This selects (for my random dataset) 1848 rows:

len(f1())
Out[289]: 1848

And it takes about .1-.15 seconds per query:

timeit.timeit(f1,number=10)/10
Out[290]: 0.10734589099884033

So I thought I must be able to do better by sorting and indexing the table, right? And I can take advantage of the fact that everything is an int, so I can do slices..

mdt2 = mdt.set_index(['A', 'B', 'C', 'D']).sortlevel()

f2 = lambda : mdt2.loc[(slice(test_A-eps_A, test_A+eps_A),
                        slice(test_B-eps_B, test_B+eps_B),
                        slice(test_C-eps_C, test_C+eps_C),
                        slice(test_D-eps_D, test_D+eps_D)), :]

len(f2())
Out[299]: 1848

And it takes a lot longer:

timeit.timeit(f2,number=10)/10
Out[295]: 7.335134506225586

Am I missing something here? It seems like I could do something like numpy.searchsorted, but I can't think of how to do that on multiple columns. Is pandas the wrong choice?

like image 587
benjamin Avatar asked Jun 05 '15 01:06

benjamin


1 Answers

So there are 2 issues here.

This is an artifice that makes the syntax a little nicer

In [111]: idx = pd.IndexSlice

1) Your .query does not have the correct precedence. The & operator has a higher precedence than comparison operators like <= and needs parentheses around its left and right operands.

In [102]: result3 = mdt.query("(@test_A-@eps_A <= A <= @test_A+@eps_A) & (@test_B-@eps_B <= B <= @test_B+@eps_B) & (@test_C-@eps_C <= C <= @test_C+@eps_C) & (@test_D-@eps_D <= D <= @test_D+@eps_D)").set_index(['A','B','C','D']).sortlevel()

This is your original query using MultiIndex slicers

In [103]: result1 = mdt2.loc[idx[test_A-eps_A:test_A+eps_A,test_B-eps_B:test_B+eps_B,test_C-eps_C:test_C+eps_C,test_D-eps_D:test_D+eps_D],:]

Here is a chained version of this query. IOW its a repeated selection on the result set.

In [104]: result2 = mdt2.loc[idx[test_A-eps_A:test_A+eps_A],:].loc[idx[:,test_B-eps_B:test_B+eps_B],:].loc[idx[:,:,test_C-eps_C:test_C+eps_C],:].loc[idx[:,:,:,test_D-eps_D:test_D+eps_D],:]

Always confirm correctness before working on performance

In [109]: (result1==result2).all().all()
Out[109]: True

In [110]: (result1==result3).all().all()
Out[110]: True

Performance

The .query IMHO will actually scale very well and uses multi-cores. For a large selection set this will be the way to go

In [107]: %timeit mdt.query("(@test_A-@eps_A <= A <= @test_A+@eps_A) & (@test_B-@eps_B <= B <= @test_B+@eps_B) & (@test_C-@eps_C <= C <= @test_C+@eps_C) & (@test_D-@eps_D <= D <= @test_D+@eps_D)").set_index(['A','B','C','D']).sortlevel()
10 loops, best of 3: 107 ms per loop

2) The original multi-index slicing. There is an issues here, see below. I am not sure exactly why this is non-performant, and will investigate this here

In [106]: %timeit  mdt2.loc[idx[test_A-eps_A:test_A+eps_A,test_B-eps_B:test_B+eps_B,test_C-eps_C:test_C+eps_C,test_D-eps_D:test_D+eps_D],:]
1 loops, best of 3: 4.34 s per loop

Repeated selections make this quite performant. Note that I won't normally recommend one do this as you cannot assign to it, but for this purpose it is ok.

In [105]: %timeit mdt2.loc[idx[test_A-eps_A:test_A+eps_A],:].loc[idx[:,test_B-eps_B:test_B+eps_B],:].loc[idx[:,:,test_C-eps_C:test_C+eps_C],:].loc[idx[:,:,:,test_D-eps_D:test_D+eps_D],:]
10 loops, best of 3: 140 ms per loop
like image 81
Jeff Avatar answered Oct 13 '22 17:10

Jeff