Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Dataframe groupby describe 8x ~slower than computing separatly

The following code summarizes numeric data using two different approaches.

The first approach uses the Dataframe().describe() and passes some specific extra percentiles.

The second approach separately computes the summary stats (mean, std, N), stacks it, computes the same quantiles, then appends the two and sorts by the index so the result is essentially the same as the first approach.

There are some minor naming differences that we can clean up afterword's and since the summarized data is small, that is very fast.

Turns out that using the describe function was about 8x slower in this example.

I am looking for reasons why and perhaps suggestions on any other approaches that may speed this up even more (filters, groups, values) are all passed in from UI to a tornado service - so speed is important, as the user is waiting for results, and the data can be even larger that this example.

    import pandas as pd
    import numpy as np
    from datetime import datetime

    def make_data (n):
        
        ts = datetime.now().timestamp() + abs(np.random.normal(60, 30, n)).cumsum()
        
        df = pd.DataFrame({
            'c1': np.random.choice(list('ABCDEFGH'), n),
            'c2': np.random.choice(list('ABCDEFGH'), n),
            'c3': np.random.choice(list('ABCDEFGH'), n),
            't1': np.random.randint(1, 20, n),
            't2': pd.to_datetime(ts, unit='s'),
            'x1': np.random.randn(n),
            'x2': np.random.randn(n),
            'x3': np.random.randn(n)
            })
        
        return df
    
    def summarize_numeric_1 (df, mask, groups, values, quantiles): 
        
        dfg = df[mask].groupby(groups)[values]
        
        return dfg.describe(percentiles = quantiles).stack()
    
    def summarize_numeric_2 (df, filt, groups, values, quantiles): 
           
        dfg = df[mask].groupby(groups)[values]
    
        dfg_stats = dfg.agg([np.mean, np.std, len]).stack()
        dfg_quantiles = dfg.quantile(all_quantiles)
        
        return dfg_stats.append(dfg_quantiles).sort_index()

    %time df = make_data(1000000)
    
    groups = ['c1', 'c2', 't1']
    mask = df['c3'].eq('H') & df['c1'].eq('A')
    values = ['x1', 'x3']
    base_quantiles = [0, .5, 1] 
    extd_quantiles = [0.25, 0.75, 0.9]
    all_quantiles = base_quantiles + extd_quantiles
    
    %timeit summarize_numeric_1(df, mask, groups, values, extd_quantiles)
    %timeit summarize_numeric_2(df, mask, groups, values, all_quantiles)

The timings on my PC for this are:

Using describe: 873 ms ± 8.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Using two step method: 105 ms ± 490 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

All inputs welcome!

like image 911
Randall Goodwin Avatar asked Jun 24 '18 22:06

Randall Goodwin


People also ask

Does group by order matter pandas?

Show activity on this post. No, the order doesn't matter for the GROUP BY clause.

What are the three phases of the pandas Groupby () function?

(1) Splitting the data into groups. (2). Applying a function to each group independently, (3) Combining the results into a data structure.

What does Groupby size do in pandas?

Pandas dataframe. groupby() function is one of the most useful function in the library it splits the data into groups based on columns/conditions and then apply some operations eg. size() which counts the number of entries/rows in each group.

Can you Groupby multiple columns in pandas?

How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.


2 Answers

Educated guess

I will post this as an answer, maybe to be deleted later, since it is more of an educated guess than an actual answer. Also it is a bit too long for a comment.

So the first thing I did after reading your answer was to re-run your timings in a profiler to take a closer look at the issue. Since the time of the computation itself was rather short, it got overshadowed quite bit by the data generation. However in general the times were similar to what you described. Not only that, the difference got even more pronounced:
1094ms for the first approach vs 63ms for the second one. This makes for the factor of 17 difference.

Since the lower of the times was rather small, I decided it was too small to be trusted and re-run the test with *10 generated data sample size. It upped the data generation step to one minute, and the numbers got weird: 1173ms for the first approach vs 506ms for the second one. Factor only slightly worse than two.

I started to suspect something. To confirm my suspicion I run one last test once again increasing the data size by the factor of 10. Result might surprise you:
12258ms for the first approach vs 3646ms for the second one. The tables have turned completely, the factor being ~0.3.

My guess in this case is that pandas computation is actually the one with better optimization/algorithm. However since it is pandas, it has quite a bit of additional baggage around it - the price that is paid for the convenience and robustness. This means that there is a layer of "unnecessary" (computation wise) baggage, that needs to be carried around no matter how large the data set is.

So in case you want to be faster than pandas even on data sets of your size, take their operations and write them yourself - in the most straightforward manner possible. This will keep their optimization & discard the baggage paid for convenience.

like image 94
Shamis Avatar answered Oct 01 '22 06:10

Shamis


Note: this answer is for pandas version 1.0.5 . Things might be different for other versions.

tl;dr

pandas describe() method will always be slower than your version, because under the hood it's using almost the exact same logic, plus some other things like making sure the data has the right dimensions, ordering the results, and checking for NaNs and the correct data types.


Longer Answer

Taking a look at the source code of the describe method, we can see a few things:

  • pandas uses the same logic as your code to compute the statistics. See this line inside the describe() method for an example of how it's using the same logic. This means that pandas describe will always be slower.
  • pandas counts non-NaN values using s.count(), but your code counts all values. Let's try and modify your code to use that same method instead of len():
def summarize_numeric_3(df, filt, groups, values, quantiles): 
    dfg = df[mask].groupby(groups)[values]
    dfg_stats = dfg.agg([np.mean, np.std, pd.Series.count]).stack()
    dfg_quantiles = dfg.quantile(all_quantiles)
    return dfg_stats.append(dfg_quantiles).sort_index()

%timeit -n 10 summarize_numeric_3(df, mask, groups, values, all_quantiles)

# outputs
# 48.9 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

it takes ~49 ms, vs 42 ms for your version on my machine. It took 7 more ms just for this relatively small modification!

  • pandas does a lot more than your code to ensure the data has the correct type and shape, and to present it in a pretty format. I've extracted the pandas describe method code into a "self-contained"* version that you can profile and tinker with here (too long to be posted in this answer). Profiling that, I see that a very large portion of the time is taken to "set a convenient order for rows". Removing that ordering improved the describe timing by ~8%, from 530 ms down to 489 ms.
like image 36
Qusai Alothman Avatar answered Oct 01 '22 07:10

Qusai Alothman