Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas performance: pivot_table vs groupby

I am struggling with performance of pivot_table versus groupby

On one hand I have:

%time df.groupby(['INDEX', 'COLUMN']).agg({'VALUE':['sum','size']}).unstack(level='COLUMN')
CPU times: user 2.28 s, sys: 29.8 ms, total: 2.31 s
Wall time: 2.36 s

On the other hand I get:

%time pd.pivot_table(df, index='INDEX', columns='COLUMN', values='VALUE', aggfunc=[len, np.sum], fill_value=0)
CPU times: user 1min 51s, sys: 1.57 s, total: 1min 53s
Wall time: 1min 54s

These are essentially the same things but I get like 60x performance difference. Why is that?

There are 800k rows with about 400k unique INDEX and COLUMN has 16 unique values in my sample.

like image 937
snovik Avatar asked Nov 18 '22 19:11

snovik


1 Answers

TL;DR: pivot_table loops over aggfunc no matter what's passed to it while groupby checks if cython-optimized implementation is available first and loops if not.


If we peek into the source code of pivot_table(), the way it is implemented is that, when you pass a list of aggregator functions a.k.a. aggfuncs to it, for each func() in the list, groupby().func().unstack() is called and the resulting list of dataframes are concatenated later on. Meanwhile, groupby().agg() tries to first call cython-optimized methods and use loop as a last resort.

So if the functions in aggfuncs are all cython-optimized such as 'sum' or 'size', groupby().agg() will perform as many times faster than pivot_table() as the number of functions in aggfuncs. In particular, for a single aggregator function, they will perform about the same (although, I imagine pivot_table() will still be slightly slower since it has a larger overhead).

However, if the list of functions are not cython-optimized, then since both calls each function in a loop, they will perform about the same. N.B. groupby().agg().unstack() makes a call to unstack() only once while pivot_table() makes the call len(aggfuncs) number of times; so naturally, pivot_table() will also be slightly slower.

A demonstration of this in code is as follows:

Setup:

def groupby_unstack(funcs):
    return df.groupby(['INDEX', 'COLUMN'])['VALUE'].agg(funcs).unstack(level='COLUMN', fill_value=0)

def pivot_table_(funcs):
    return df.pivot_table(index='INDEX', columns='COLUMN', values='VALUE', aggfunc=funcs, fill_value=0)

def get_df(k):
    return pd.DataFrame({'INDEX': np.random.default_rng().choice(k // 2, size=k), 
                         'COLUMN': np.random.default_rng().choice(16, size=k), 
                         'VALUE': np.random.rand(k).round(2)})

Cython-optimized functions

As can be seen from the below benchmarks, the gap between the performance of groupby().agg().unstack() and pivot_table() increases as the number of aggregator functions increase. For a single aggregator function, they perform about the same but for two functions, pivot_table() is about twice as slow and for three functions, it is about 3 times as slow etc.

df = get_df(800_000)

cython_funcs1 = ['sum', 'size']

%timeit groupby_unstack(cython_funcs1)
# 1.41 s ± 35.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit pivot_table_(cython_funcs1)
# 3.51 s ± 263 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


cython_funcs2 = ['sum', 'size', 'mean']

%timeit groupby_unstack(cython_funcs2)
# 1.63 s ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit pivot_table_(cython_funcs2)
# 5.08 s ± 57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


cython_funcs3 = ['median']

%timeit groupby_unstack(cython_funcs3)
# 1.17 s ± 92.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit pivot_table_(cython_funcs3)
# 1.84 s ± 70.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Non-cython-optimized functions

For non-cython optimized functions, groupby().agg().unstack() and pivot_table() perform about the same even for multiple aggregator functions because both loop over the list of functions under the hood.

df = get_df(80_000)

funcs = [lambda x: list(x.mode()), lambda x: x.nunique()**2]

%timeit groupby_unstack(funcs)
# 26.6 s ± 5.99 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit pivot_table_(funcs)
# 27.2 s ± 6.46 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
like image 182
cottontail Avatar answered Dec 10 '22 08:12

cottontail