I am struggling with performance of pivot_table
versus groupby
On one hand I have:
%time df.groupby(['INDEX', 'COLUMN']).agg({'VALUE':['sum','size']}).unstack(level='COLUMN')
CPU times: user 2.28 s, sys: 29.8 ms, total: 2.31 s
Wall time: 2.36 s
On the other hand I get:
%time pd.pivot_table(df, index='INDEX', columns='COLUMN', values='VALUE', aggfunc=[len, np.sum], fill_value=0)
CPU times: user 1min 51s, sys: 1.57 s, total: 1min 53s
Wall time: 1min 54s
These are essentially the same things but I get like 60x performance difference. Why is that?
There are 800k rows with about 400k unique INDEX
and COLUMN
has 16 unique values in my sample.
TL;DR: pivot_table
loops over aggfunc
no matter what's passed to it while groupby
checks if cython-optimized implementation is available first and loops if not.
If we peek into the source code of pivot_table()
, the way it is implemented is that, when you pass a list of aggregator functions a.k.a. aggfuncs to it, for each func()
in the list, groupby().func().unstack()
is called and the resulting list of dataframes are concatenated later on. Meanwhile, groupby().agg()
tries to first call cython-optimized methods and use loop as a last resort.
So if the functions in aggfuncs are all cython-optimized such as 'sum'
or 'size'
, groupby().agg()
will perform as many times faster than pivot_table()
as the number of functions in aggfuncs. In particular, for a single aggregator function, they will perform about the same (although, I imagine pivot_table()
will still be slightly slower since it has a larger overhead).
However, if the list of functions are not cython-optimized, then since both calls each function in a loop, they will perform about the same. N.B. groupby().agg().unstack()
makes a call to unstack()
only once while pivot_table()
makes the call len(aggfuncs)
number of times; so naturally, pivot_table()
will also be slightly slower.
A demonstration of this in code is as follows:
def groupby_unstack(funcs):
return df.groupby(['INDEX', 'COLUMN'])['VALUE'].agg(funcs).unstack(level='COLUMN', fill_value=0)
def pivot_table_(funcs):
return df.pivot_table(index='INDEX', columns='COLUMN', values='VALUE', aggfunc=funcs, fill_value=0)
def get_df(k):
return pd.DataFrame({'INDEX': np.random.default_rng().choice(k // 2, size=k),
'COLUMN': np.random.default_rng().choice(16, size=k),
'VALUE': np.random.rand(k).round(2)})
As can be seen from the below benchmarks, the gap between the performance of groupby().agg().unstack()
and pivot_table()
increases as the number of aggregator functions increase. For a single aggregator function, they perform about the same but for two functions, pivot_table()
is about twice as slow and for three functions, it is about 3 times as slow etc.
df = get_df(800_000)
cython_funcs1 = ['sum', 'size']
%timeit groupby_unstack(cython_funcs1)
# 1.41 s ± 35.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pivot_table_(cython_funcs1)
# 3.51 s ± 263 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
cython_funcs2 = ['sum', 'size', 'mean']
%timeit groupby_unstack(cython_funcs2)
# 1.63 s ± 16.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pivot_table_(cython_funcs2)
# 5.08 s ± 57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
cython_funcs3 = ['median']
%timeit groupby_unstack(cython_funcs3)
# 1.17 s ± 92.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pivot_table_(cython_funcs3)
# 1.84 s ± 70.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
For non-cython optimized functions, groupby().agg().unstack()
and pivot_table()
perform about the same even for multiple aggregator functions because both loop over the list of functions under the hood.
df = get_df(80_000)
funcs = [lambda x: list(x.mode()), lambda x: x.nunique()**2]
%timeit groupby_unstack(funcs)
# 26.6 s ± 5.99 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pivot_table_(funcs)
# 27.2 s ± 6.46 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With