I want to make aggregation operations (sum) on the rows of a big pandas dataframe(millions of rows) which are determined by a condition on several fixed columns (max 10 columns). These columns have only integer values.
My problem is that I have to make this operation (querying + aggregating) thousands of times (~100 000 times). I think with the aggregating part there is not much to optimize as it is just a simple sum. What would be the most efficient way to perform this task? Is there some way I could build an 'index' on my condition columns in order to speed up each query?
The results show that apply massively outperforms iterrows . As mentioned previously, this is because apply is optimized for looping through dataframe rows much quicker than iterrows does. While slower than apply , itertuples is quicker than iterrows , so if looping is required, try implementing itertuples instead.
The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.
However, the Pandas library is excellent for formulating and running heavy calculations but inefficient when running deep or complex data queries. Fortunately, the Pandas library contains many code snippets and functions that make processing complex calculations, not the forte of SQL, much easier and faster.
I would try something in this flavor:
Suppose you have the following dataframe
N = 10000000
df = pd.DataFrame({
'A':np.random.binomial(1,0.5,N),
'B':np.random.binomial(2,0.5,N),
'nume1':np.random.uniform(0,1,N),
'nume2':np.random.normal(0,1,N)})
then doing this
tmp = df[['A','B','nume1','nume2']].query('A > 0.5').groupby('B').sum().reset_index()[['B','nume1','nume2']]
is the SQL equivalent of
select B, sum(nume1),sum(nume2)
from df
where A > 0.5
group by B
this takes a little less then a sec (926ms, using %timeit) on my moderate (i7 quad-core, 16GB ram) machine.
I hope this helps.
Without more details it's hard to answer your question.
You should indeed build an index of your conditional columns.
df['idx'] = (df['col1'] * df['col2']) ** (df['col3'] + df['col4']) * df['col5'] == 0.012
df = df.set_index('idx')
Rewriting your condition to an indexable column may be hard. Keep in mind you can set all the columns as the index
df = df.set_index(['col1', 'col2', 'col3', 'col4', 'col5' ...])
This documentation on advanced indexing in Pandas may help you think about your problem: http://pandas.pydata.org/pandas-docs/stable/indexing.html#multiindex-query-syntax
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With