Why is numpy sum 10 times slower than the + operator?

Tags:

I noticed that very strangely, np.sum is 10x slower than a hand written sum.

np.sum with axis:

p1 = np.random.rand(10000, 2)
def test(p1):
    return p1.sum(axis=1)
%timeit test(p1)

186 µs ± 4.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

np.sum without axis:

p1 = np.random.rand(10000, 2)
def test(p1):
    return p1.sum()
%timeit test(p1)

17.9 µs ± 236 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

p1 = np.random.rand(10000, 2)
def test(p1):
    return p1[:,0] + p1[:,1]
%timeit test(p1)

15.8 µs ± 328 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Multiplication:

p1 = np.random.rand(10000, 2)
def test(p1):
    return p1[:,0]*p1[:,1]
%timeit test(p1)

15.7 µs ± 701 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

I don't see any reason for this. Any idea why? My numpy version is 1.15.3.

EDIT: with 10000000:

np.sum (with axis): 202 ms (5 x)
np.sum (without axis): 12 ms
+ : 46 ms (1 x)
* : 44.3 ms

So I guess there is some overhead playing around, to some extent...

507

asked Oct 31 '18 22:10

beesleep

1 Answers

The main difference is larger overhead when a.sum(axis=1) is calculated. Calculating a reduction (in this case sum) is not a trivial matter:

one has to take the round-off errors into account and thus uses pairwise summation to reduce it.
tiling is important for bigger arrays, as it makes the most out of the available cache
In order to be able to use the SIMD-instructions/out-of-order execution abilities of modern CPUs one should calculate multiple rows in parallel

I have discussed the topics above in more details for example here and here.

However, all this is not needed and not better than a naive summation if there are only two elements to add - you get the same result but with much less overhead and faster.

For only 1000 elements, the overhead of calling numpy functionality is probably higher than actually doing these 1000 additions (or multiplications for that matter, because on modern CPUs pipelined additions/multiplications have the same cost) -as you can see, that for 10^4 the running time is only about 2 times higher, a sure sign that overhead plays a bigger role for 10^3! In this answer the impact of overhead and cache misses is investigated in more details.

Let's take a look at profiler-result to see whether the theory above holds (I use perf):

For a.sum(axis=1):

  17,39%  python   umath.cpython-36m-x86_64-linux-gnu.so       [.] reduce_loop
  11,41%  python   umath.cpython-36m-x86_64-linux-gnu.so       [.] pairwise_sum_DOUBLE
   9,78%  python   multiarray.cpython-36m-x86_64-linux-gnu.so  [.] npyiter_buffered_reduce_iternext_ite
   9,24%  python   umath.cpython-36m-x86_64-linux-gnu.so       [.] DOUBLE_add
   4,35%  python   python3.6                                   [.] _PyEval_EvalFrameDefault
   2,17%  python   multiarray.cpython-36m-x86_64-linux-gnu.so  [.] _aligned_strided_to_contig_size8_src
   2,17%  python   python3.6                                   [.] lookdict_unicode_nodummy
   ...

The overhead of using reduce_loop, pairwise_sum_DOUBLE is dominating.

For a[:,0]+a[:,1]):

   7,24%  python   python3.6                                   [.] _PyEval_EvalF
   5,26%  python   python3.6                                   [.] PyObject_Mall
   3,95%  python   python3.6                                   [.] visit_decref
   3,95%  python   umath.cpython-36m-x86_64-linux-gnu.so       [.] DOUBLE_add
   2,63%  python   python3.6                                   [.] PyDict_SetDef
   2,63%  python   python3.6                                   [.] _PyTuple_Mayb
   2,63%  python   python3.6                                   [.] collect
   2,63%  python   python3.6                                   [.] fast_function
   2,63%  python   python3.6                                   [.] visit_reachab
   1,97%  python   python3.6                                   [.] _PyObject_Gen

As expected: Python overhead plays a big role, a simple DOUBLE_add is used.

There are less overhead when calling a.sum()

for once, reduce_loop isn't called for every row but only once, which means considerable less overhead.
no new resulting arrays are created, there is no longer need to write 1000 doubles to the memory.

so it can be expected, that a.sum() is faster (despite the fact, that 2000 and not 1000 addition must be made - but as we have seen it is mostly about overhead and the actual work - the additions aren't responsible for the big share of the running time).

Data obtaining by running:

perf record python run.py
perf report

and

#run.py
import numpy as np
a=np.random.rand(1000,2)

for _ in range(10000):
  a.sum(axis=1)
  #a[:,0]+a[:,1]

176

answered Oct 12 '22 10:10

ead

Related questions
                            
                                How to monitor convergence of Gensim LDA model?
                            
                                What is the Python equivalent of Java's UnsupportedOperationException?
                            
                                Likelihood ratio test in Python
                            
                                itertools not defined when used inside module
                            
                                ZipFile.testzip() returning different results on Python 2 and Python 3
                            
                                Pip: could not find a version. No matching distribution found
                            
                                Wrapping column names in Python Pandas DataFrame or Jupyter Notebooks
                            
                                How to convert a series of one value to float only?
                            
                                Airflow no module named for directory in airflow_home directory
                            
                                TypeError: src is not a numpy array, neither a scalar
                            
                                Can you overload the Python 3.6 f-string's "operator"?
                            
                                Understanding apyori's output
                            
                                size legend for plotly bubble map/chart
                            
                                Find (only) the first row satisfying a given condition in pandas DataFrame
                            
                                Fastest way to compute image dataset channel wise mean and standard deviation in Python
                            
                                What does "Client failed to connect to the D-BUS daemon" mean?
                            
                                Generic NamedTuple in Python 3.6
                            
                                How to type hint a dictionary with values of different types
                            
                                Python Visual Studio Code autoDocstring Configuration
                            
                                HTTP PUT request in Python using JSON data

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is numpy sum 10 times slower than the + operator?

Tags:

performance

python

numpy

beesleep

People also ask

1 Answers

ead

Recent Activity

Donate For Us