Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to query a dataframe

Tags:

python

pandas

I want to make aggregation operations (sum) on the rows of a big pandas dataframe(millions of rows) which are determined by a condition on several fixed columns (max 10 columns). These columns have only integer values.

My problem is that I have to make this operation (querying + aggregating) thousands of times (~100 000 times). I think with the aggregating part there is not much to optimize as it is just a simple sum. What would be the most efficient way to perform this task? Is there some way I could build an 'index' on my condition columns in order to speed up each query?

like image 822
elyase Avatar asked Dec 18 '13 23:12

elyase


People also ask

Is Iterrows faster than apply?

The results show that apply massively outperforms iterrows . As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead.

Is query faster than LOC pandas?

The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.

Is Panda faster than SQL?

However, the Pandas library is excellent for formulating and running heavy calculations but inefficient when running deep or complex data queries. Fortunately, the Pandas library contains many code snippets and functions that make processing complex calculations, not the forte of SQL, much easier and faster.


2 Answers

I would try something in this flavor:

Suppose you have the following dataframe

N = 10000000
df = pd.DataFrame({
    'A':np.random.binomial(1,0.5,N),
    'B':np.random.binomial(2,0.5,N),
    'nume1':np.random.uniform(0,1,N),
    'nume2':np.random.normal(0,1,N)})

then doing this

tmp = df[['A','B','nume1','nume2']].query('A > 0.5').groupby('B').sum().reset_index()[['B','nume1','nume2']]

is the SQL equivalent of

select B, sum(nume1),sum(nume2)
from df
where A > 0.5
group by B

this takes a little less then a sec (926ms, using %timeit) on my moderate (i7 quad-core, 16GB ram) machine.

I hope this helps.

like image 133
tipanverella Avatar answered Sep 20 '22 02:09

tipanverella


Without more details it's hard to answer your question.

You should indeed build an index of your conditional columns.

df['idx'] = (df['col1'] * df['col2']) ** (df['col3'] + df['col4']) * df['col5'] == 0.012
df = df.set_index('idx')

Rewriting your condition to an indexable column may be hard. Keep in mind you can set all the columns as the index

df = df.set_index(['col1', 'col2', 'col3', 'col4', 'col5' ...])

This documentation on advanced indexing in Pandas may help you think about your problem: http://pandas.pydata.org/pandas-docs/stable/indexing.html#multiindex-query-syntax

like image 35
firelynx Avatar answered Sep 21 '22 02:09

firelynx