I have a very large table (currently 55 million rows, could be more), and I need to select subsets of it and perform very simple operations on those subsets, lots and lots of times. It seemed like pandas might be the best way to do this in python, but I'm running into optimization problems.
I've tried to create a fake dataset that closely matches my real dataset (although it's ~5-10 times smaller). This is still big, takes a lot of memory, etc. There are four columns that I'm querying on, and two that I'm using for calculations.
import pandas
import numpy as np
import timeit
n=10000000
mdt = pandas.DataFrame()
mdt['A'] = np.random.choice(range(10000,45000,1000), n)
mdt['B'] = np.random.choice(range(10,400), n)
mdt['C'] = np.random.choice(range(1,150), n)
mdt['D'] = np.random.choice(range(10000,45000), n)
mdt['x'] = np.random.choice(range(400), n)
mdt['y'] = np.random.choice(range(25), n)
test_A = 25000
test_B = 25
test_C = 40
test_D = 35000
eps_A = 5000
eps_B = 5
eps_C = 5
eps_D = 5000
f1 = lambda : mdt.query('@test_A-@eps_A <= A <= @test_A+@eps_A & ' +
'@test_B-@eps_B <= B <= @test_B+@eps_B & ' +
'@test_C-@eps_C <= C <= @test_C+@eps_C & ' +
'@test_D-@eps_D <= D <= @test_D+@eps_D')
This selects (for my random dataset) 1848 rows:
len(f1())
Out[289]: 1848
And it takes about .1-.15 seconds per query:
timeit.timeit(f1,number=10)/10
Out[290]: 0.10734589099884033
So I thought I must be able to do better by sorting and indexing the table, right? And I can take advantage of the fact that everything is an int, so I can do slices..
mdt2 = mdt.set_index(['A', 'B', 'C', 'D']).sortlevel()
f2 = lambda : mdt2.loc[(slice(test_A-eps_A, test_A+eps_A),
slice(test_B-eps_B, test_B+eps_B),
slice(test_C-eps_C, test_C+eps_C),
slice(test_D-eps_D, test_D+eps_D)), :]
len(f2())
Out[299]: 1848
And it takes a lot longer:
timeit.timeit(f2,number=10)/10
Out[295]: 7.335134506225586
Am I missing something here? It seems like I could do something like numpy.searchsorted, but I can't think of how to do that on multiple columns. Is pandas the wrong choice?
So there are 2 issues here.
This is an artifice that makes the syntax a little nicer
In [111]: idx = pd.IndexSlice
1) Your .query
does not have the correct precedence. The &
operator has a higher precedence than comparison operators like <=
and needs parentheses around its left and right operands.
In [102]: result3 = mdt.query("(@test_A-@eps_A <= A <= @test_A+@eps_A) & (@test_B-@eps_B <= B <= @test_B+@eps_B) & (@test_C-@eps_C <= C <= @test_C+@eps_C) & (@test_D-@eps_D <= D <= @test_D+@eps_D)").set_index(['A','B','C','D']).sortlevel()
This is your original query using MultiIndex slicers
In [103]: result1 = mdt2.loc[idx[test_A-eps_A:test_A+eps_A,test_B-eps_B:test_B+eps_B,test_C-eps_C:test_C+eps_C,test_D-eps_D:test_D+eps_D],:]
Here is a chained version of this query. IOW its a repeated selection on the result set.
In [104]: result2 = mdt2.loc[idx[test_A-eps_A:test_A+eps_A],:].loc[idx[:,test_B-eps_B:test_B+eps_B],:].loc[idx[:,:,test_C-eps_C:test_C+eps_C],:].loc[idx[:,:,:,test_D-eps_D:test_D+eps_D],:]
Always confirm correctness before working on performance
In [109]: (result1==result2).all().all()
Out[109]: True
In [110]: (result1==result3).all().all()
Out[110]: True
Performance
The .query
IMHO will actually scale very well and uses multi-cores. For a large selection set this will be the way to go
In [107]: %timeit mdt.query("(@test_A-@eps_A <= A <= @test_A+@eps_A) & (@test_B-@eps_B <= B <= @test_B+@eps_B) & (@test_C-@eps_C <= C <= @test_C+@eps_C) & (@test_D-@eps_D <= D <= @test_D+@eps_D)").set_index(['A','B','C','D']).sortlevel()
10 loops, best of 3: 107 ms per loop
2) The original multi-index slicing. There is an issues here, see below. I am not sure exactly why this is non-performant, and will investigate this here
In [106]: %timeit mdt2.loc[idx[test_A-eps_A:test_A+eps_A,test_B-eps_B:test_B+eps_B,test_C-eps_C:test_C+eps_C,test_D-eps_D:test_D+eps_D],:]
1 loops, best of 3: 4.34 s per loop
Repeated selections make this quite performant. Note that I won't normally recommend one do this as you cannot assign to it, but for this purpose it is ok.
In [105]: %timeit mdt2.loc[idx[test_A-eps_A:test_A+eps_A],:].loc[idx[:,test_B-eps_B:test_B+eps_B],:].loc[idx[:,:,test_C-eps_C:test_C+eps_C],:].loc[idx[:,:,:,test_D-eps_D:test_D+eps_D],:]
10 loops, best of 3: 140 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With