Fastest way to query a dataframe

Tags:

I want to make aggregation operations (sum) on the rows of a big pandas dataframe(millions of rows) which are determined by a condition on several fixed columns (max 10 columns). These columns have only integer values.

My problem is that I have to make this operation (querying + aggregating) thousands of times (~100 000 times). I think with the aggregating part there is not much to optimize as it is just a simple sum. What would be the most efficient way to perform this task? Is there some way I could build an 'index' on my condition columns in order to speed up each query?

822

asked Dec 18 '13 23:12

elyase

2 Answers

I would try something in this flavor:

Suppose you have the following dataframe

N = 10000000
df = pd.DataFrame({
    'A':np.random.binomial(1,0.5,N),
    'B':np.random.binomial(2,0.5,N),
    'nume1':np.random.uniform(0,1,N),
    'nume2':np.random.normal(0,1,N)})

then doing this

tmp = df[['A','B','nume1','nume2']].query('A > 0.5').groupby('B').sum().reset_index()[['B','nume1','nume2']]

is the SQL equivalent of

select B, sum(nume1),sum(nume2)
from df
where A > 0.5
group by B

this takes a little less then a sec (926ms, using %timeit) on my moderate (i7 quad-core, 16GB ram) machine.

I hope this helps.

133

answered Sep 20 '22 02:09

tipanverella

Without more details it's hard to answer your question.

You should indeed build an index of your conditional columns.

df['idx'] = (df['col1'] * df['col2']) ** (df['col3'] + df['col4']) * df['col5'] == 0.012
df = df.set_index('idx')

Rewriting your condition to an indexable column may be hard. Keep in mind you can set all the columns as the index

df = df.set_index(['col1', 'col2', 'col3', 'col4', 'col5' ...])

This documentation on advanced indexing in Pandas may help you think about your problem: http://pandas.pydata.org/pandas-docs/stable/indexing.html#multiindex-query-syntax

answered Sep 21 '22 02:09

firelynx

Related questions
                            
                                How can I digitalize graph using Python? [closed]
                            
                                Serialize MD5 computation-state and resume later?
                            
                                Undefined symbol error importing Cython module
                            
                                Bandwidth throttling using Twisted
                            
                                iTunes 11 scripting on Windows
                            
                                Baffling non-object python memory leak
                            
                                Adding nodes to objects in Inkscape by Python script
                            
                                Memory leak with large dataset when using mysql-python
                            
                                AttributeError: 'RegexURLResolver' object has no attribute '_urlconf_module'
                            
                                Python file existence checks loop crashes - unless I add a print statement
                            
                                How to capture traffic in python
                            
                                Does python's new 'pip wheel' have any support for building wheels for the dependencies listed in tests_requires?
                            
                                Celery chunks inside chain
                            
                                How to make flask response to client asynchronously?
                            
                                Handling 128-bit integers with ctypes
                            
                                Eclipse, PyDev "interpreter does not exist in filesystem"
                            
                                Test a program that uses tty stdin and stdout
                            
                                Random Forest Classifier Segmentation Fault
                            
                                How to create a memoryview for a non-contiguous memory location?
                            
                                Write a generator or return a generator?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fastest way to query a dataframe

Tags:

python

pandas

elyase

People also ask

2 Answers

tipanverella

firelynx

Recent Activity

Donate For Us