Pandas Dataframe groupby describe 8x ~slower than computing separatly

Tags:

The following code summarizes numeric data using two different approaches.

The first approach uses the Dataframe().describe() and passes some specific extra percentiles.

The second approach separately computes the summary stats (mean, std, N), stacks it, computes the same quantiles, then appends the two and sorts by the index so the result is essentially the same as the first approach.

There are some minor naming differences that we can clean up afterword's and since the summarized data is small, that is very fast.

Turns out that using the describe function was about 8x slower in this example.

I am looking for reasons why and perhaps suggestions on any other approaches that may speed this up even more (filters, groups, values) are all passed in from UI to a tornado service - so speed is important, as the user is waiting for results, and the data can be even larger that this example.

    import pandas as pd
    import numpy as np
    from datetime import datetime

    def make_data (n):
        
        ts = datetime.now().timestamp() + abs(np.random.normal(60, 30, n)).cumsum()
        
        df = pd.DataFrame({
            'c1': np.random.choice(list('ABCDEFGH'), n),
            'c2': np.random.choice(list('ABCDEFGH'), n),
            'c3': np.random.choice(list('ABCDEFGH'), n),
            't1': np.random.randint(1, 20, n),
            't2': pd.to_datetime(ts, unit='s'),
            'x1': np.random.randn(n),
            'x2': np.random.randn(n),
            'x3': np.random.randn(n)
            })
        
        return df
    
    def summarize_numeric_1 (df, mask, groups, values, quantiles): 
        
        dfg = df[mask].groupby(groups)[values]
        
        return dfg.describe(percentiles = quantiles).stack()
    
    def summarize_numeric_2 (df, filt, groups, values, quantiles): 
           
        dfg = df[mask].groupby(groups)[values]
    
        dfg_stats = dfg.agg([np.mean, np.std, len]).stack()
        dfg_quantiles = dfg.quantile(all_quantiles)
        
        return dfg_stats.append(dfg_quantiles).sort_index()

    %time df = make_data(1000000)
    
    groups = ['c1', 'c2', 't1']
    mask = df['c3'].eq('H') & df['c1'].eq('A')
    values = ['x1', 'x3']
    base_quantiles = [0, .5, 1] 
    extd_quantiles = [0.25, 0.75, 0.9]
    all_quantiles = base_quantiles + extd_quantiles
    
    %timeit summarize_numeric_1(df, mask, groups, values, extd_quantiles)
    %timeit summarize_numeric_2(df, mask, groups, values, all_quantiles)

The timings on my PC for this are:

Using describe: 873 ms ± 8.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using two step method: 105 ms ± 490 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

All inputs welcome!

911

asked Jun 24 '18 22:06

Randall Goodwin

2 Answers

Educated guess

I will post this as an answer, maybe to be deleted later, since it is more of an educated guess than an actual answer. Also it is a bit too long for a comment.

So the first thing I did after reading your answer was to re-run your timings in a profiler to take a closer look at the issue. Since the time of the computation itself was rather short, it got overshadowed quite bit by the data generation. However in general the times were similar to what you described. Not only that, the difference got even more pronounced:
1094ms for the first approach vs 63ms for the second one. This makes for the factor of 17 difference.

Since the lower of the times was rather small, I decided it was too small to be trusted and re-run the test with *10 generated data sample size. It upped the data generation step to one minute, and the numbers got weird: 1173ms for the first approach vs 506ms for the second one. Factor only slightly worse than two.

I started to suspect something. To confirm my suspicion I run one last test once again increasing the data size by the factor of 10. Result might surprise you:
12258ms for the first approach vs 3646ms for the second one. The tables have turned completely, the factor being ~0.3.

My guess in this case is that pandas computation is actually the one with better optimization/algorithm. However since it is pandas, it has quite a bit of additional baggage around it - the price that is paid for the convenience and robustness. This means that there is a layer of "unnecessary" (computation wise) baggage, that needs to be carried around no matter how large the data set is.

So in case you want to be faster than pandas even on data sets of your size, take their operations and write them yourself - in the most straightforward manner possible. This will keep their optimization & discard the baggage paid for convenience.

answered Oct 01 '22 06:10

Shamis

Note: this answer is for pandas version 1.0.5 . Things might be different for other versions.

tl;dr

pandas describe() method will always be slower than your version, because under the hood it's using almost the exact same logic, plus some other things like making sure the data has the right dimensions, ordering the results, and checking for NaNs and the correct data types.

Longer Answer

Taking a look at the source code of the describe method, we can see a few things:

pandas uses the same logic as your code to compute the statistics. See this line inside the describe() method for an example of how it's using the same logic. This means that pandas describe will always be slower.
pandas counts non-NaN values using s.count(), but your code counts all values. Let's try and modify your code to use that same method instead of len():

def summarize_numeric_3(df, filt, groups, values, quantiles): 
    dfg = df[mask].groupby(groups)[values]
    dfg_stats = dfg.agg([np.mean, np.std, pd.Series.count]).stack()
    dfg_quantiles = dfg.quantile(all_quantiles)
    return dfg_stats.append(dfg_quantiles).sort_index()

%timeit -n 10 summarize_numeric_3(df, mask, groups, values, all_quantiles)

# outputs
# 48.9 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

it takes ~49 ms, vs 42 ms for your version on my machine. It took 7 more ms just for this relatively small modification!

pandas does a lot more than your code to ensure the data has the correct type and shape, and to present it in a pretty format. I've extracted the pandas describe method code into a "self-contained"* version that you can profile and tinker with here (too long to be posted in this answer). Profiling that, I see that a very large portion of the time is taken to "set a convenient order for rows". Removing that ordering improved the describe timing by ~8%, from 530 ms down to 489 ms.

answered Oct 01 '22 07:10

Qusai Alothman

Related questions
                            
                                Wrong paths in pydev debugger:
                            
                                Jupyter Notebook Python auto docstring generation with triple quotation
                            
                                pyarrow.lib.ArrowInvalid: ('Could not convert X with type Y: did not recognize Python value type when inferring an Arrow data type')
                            
                                Scene Text Image Super-Resolution for OCR
                            
                                How to analyze memory usage from a core dump?
                            
                                A good collaborative filtering/matching/recommendation library for Python/Django?
                            
                                Installing data files into site-packages with setup.py
                            
                                Periodic Axes class in matplotlib?
                            
                                Boost.python vs Cython for C++/python interface
                            
                                How to iterate over a dict proxy in Python?
                            
                                youtube-dl in Android [duplicate]
                            
                                numpy.searchsorted with 2D array
                            
                                pandas DataFrame.to_sql() function if_exists parameter not working
                            
                                Running multiple tensorflow sessions concurrently
                            
                                IPython doesn't reload code when running in the QTConsole
                            
                                How to pop rows from a dataframe?
                            
                                Tensorflow - casting from int to float strange behavior
                            
                                PyCharm: Is there a way to make the "data view" auto update when dataframe is changed?
                            
                                Running Python Script from Android Activity
                            
                                Combine compiled Python regexes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas Dataframe groupby describe 8x ~slower than computing separatly

Tags:

python

pandas

percentile

quantile

describe