Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas vectorised function cumsum versus numpy

While answering the question Vectorize calculation of a Pandas Dataframe, I noticed an interesting issue regarding performance.

I was under the impression that functions such as df.min(), df.mean(), df.cumsum(), etc, are vectorised. However, I am seeing a massive discrepancy in performance between df.cumsum() and a numpy alternative.

Given pandas uses numpy arrays in its infrastructure, I expected performance to be closer. I tried investigating the source code for df.cumsum() but found it intractable. Can someone explain why it is so much slower?

Seen from the answer by @HYRY the issue reduces to the question of why the following two commands give such a huge discrepancy in timings:

import pandas as pd, numpy as np
df_a = pd.DataFrame(np.arange(1,1000*1000+1).reshape(1000,1000))

%timeit pd.DataFrame(np.nancumsum(df_a.values))    #  4.18 ms
%timeit df_a.cumsum()                              # 15.7  ms

(Timing performed by one of the commentors, since my numpy v1.11 does not have nancumsum.)

like image 817
jpp Avatar asked Mar 06 '23 17:03

jpp


2 Answers

There seem to be a couple things worth nothing here.

First, df_a.cumsum() defaults to axis=0 (Pandas has no concept of summing the whole DataFrame in one call), while the NumPy call defaults to axis=None. So by specifying an axis on one operation and effectively flattening the other, you're comparing apples to oranges.

That said, there are three calls that you could compare:

>>> np.cumsum(df_a, axis=0)
>>> df_a.cumsum()
>>> val.cumsum(axis=0)  # val = df_a.values

where, in the final call, val is the underlying NumPy array and we don't count getting the .values attribute in runtime.

So, if you're working in IPython shell, give line profiling with %prun a try:

>>> %prun -q -T pdcumsum.txt df_a.cumsum()

>>> val = df_a.values
>>> %prun -q -T ndarraycumsum.txt val.cumsum(axis=0)

>>> %prun -q -T df_npcumsum.txt np.cumsum(df_a, axis=0)

-T saves the output to text so that you can view all three matched up with one another. Here's what you end up with:

  • df_a.cumsum(): 186 function calls, .022 seconds. 0.013 of that is spent on numpy.ndarray.cumsum(). (My guess is that if there are no NaNs, then nancumsum() isn't needed, but please don't quote me on that). Another chunk is spent on copying the array.
  • val.cumsum(axis=0): 5 function calls, 0.020 seconds. No copy is made (although this isn't an inplace operation).
  • np.cumsum(df_a, axis=0): 204 function calls, 0.026 seconds. Suffice it to say that passing a Pandas object to a top-level NumPy function seems to eventually invoke the equivalent method on the Pandas object, which goes through a whole bunch of overhead and then re-calls the NumPy function.

Now, unlike %timeit, you're only making 1 call here, as you would in %time, so I wouldn't lean too heavily on the relative timing differences with %prun; perhaps comparing the internal function calls is what's useful. But in this case, when you specify the same axis for both, the timing differences aren't actually that drastic, even if the number of calls made by Pandas dwarfs that of NumPy. In other words, in this case the time of all three calls is dominated by np.ndarray.cumsum(), and the ancillary Pandas calls don't eat up much time. There are other instances where the ancillary Pandas calls do eat up a lot more runtime, but this doesn't seem to be one of them.

Big picture--as acknowledged by Wes McKinney,

Fairly simple operations, from indexing to summary statistics, may pass through multiple layers of scaffolding before hitting the lowest tier of computations.

with the tradeoff being flexibility and increased functionality, you could argue.

One last detail: within NumPy, you can avoid a tiny bit of overhead by calling the instance method ndarray.cumsum() rather than the top-level function np.cumsum(), because the latter just ends up routing to the former. But as a wise man once said, premature optimization is the root of all evil.


For reference:

>>> pd.__version__, np.__version__
('0.22.0', '1.14.0')
like image 57
Brad Solomon Avatar answered Mar 15 '23 18:03

Brad Solomon


Pandas can deal with NaN, you can check the difference by:

a = np.random.randn(1000000)
%timeit np.nancumsum(a)
%timeit np.cumsum(a)

outputs:

9.02 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.37 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
like image 20
HYRY Avatar answered Mar 15 '23 18:03

HYRY