I noticed that very strangely, np.sum is 10x slower than a hand written sum.
np.sum with axis:
p1 = np.random.rand(10000, 2)
def test(p1):
return p1.sum(axis=1)
%timeit test(p1)
186 µs ± 4.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
np.sum without axis:
p1 = np.random.rand(10000, 2)
def test(p1):
return p1.sum()
%timeit test(p1)
17.9 µs ± 236 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
+:
p1 = np.random.rand(10000, 2)
def test(p1):
return p1[:,0] + p1[:,1]
%timeit test(p1)
15.8 µs ± 328 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Multiplication:
p1 = np.random.rand(10000, 2)
def test(p1):
return p1[:,0]*p1[:,1]
%timeit test(p1)
15.7 µs ± 701 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I don't see any reason for this. Any idea why? My numpy version is 1.15.3
.
EDIT: with 10000000:
np.sum (with axis): 202 ms (5 x)
np.sum (without axis): 12 ms
+ : 46 ms (1 x)
* : 44.3 ms
So I guess there is some overhead playing around, to some extent...
sum performs faster for np. array objects, whereas sum performs faster for list objects.
NumPy Arrays are faster than Python Lists because of the following reasons: An array is a collection of homogeneous data-types that are stored in contiguous memory locations. On the other hand, a list in Python is a collection of heterogeneous data types stored in non-contiguous memory locations.
NumPy Arrays Are Faster Than Lists The array is randomly generated. As predicted, we can see that NumPy arrays are significantly faster than lists.
The reason why NumPy is fast when used right is that its arrays are extremely efficient. They are like C arrays instead of Python lists.
The main difference is larger overhead when a.sum(axis=1)
is calculated. Calculating a reduction (in this case sum
) is not a trivial matter:
I have discussed the topics above in more details for example here and here.
However, all this is not needed and not better than a naive summation if there are only two elements to add - you get the same result but with much less overhead and faster.
For only 1000 elements, the overhead of calling numpy functionality is probably higher than actually doing these 1000 additions (or multiplications for that matter, because on modern CPUs pipelined additions/multiplications have the same cost) -as you can see, that for 10^4 the running time is only about 2 times higher, a sure sign that overhead plays a bigger role for 10^3! In this answer the impact of overhead and cache misses is investigated in more details.
Let's take a look at profiler-result to see whether the theory above holds (I use perf
):
For a.sum(axis=1)
:
17,39% python umath.cpython-36m-x86_64-linux-gnu.so [.] reduce_loop
11,41% python umath.cpython-36m-x86_64-linux-gnu.so [.] pairwise_sum_DOUBLE
9,78% python multiarray.cpython-36m-x86_64-linux-gnu.so [.] npyiter_buffered_reduce_iternext_ite
9,24% python umath.cpython-36m-x86_64-linux-gnu.so [.] DOUBLE_add
4,35% python python3.6 [.] _PyEval_EvalFrameDefault
2,17% python multiarray.cpython-36m-x86_64-linux-gnu.so [.] _aligned_strided_to_contig_size8_src
2,17% python python3.6 [.] lookdict_unicode_nodummy
...
The overhead of using reduce_loop
, pairwise_sum_DOUBLE
is dominating.
For a[:,0]+a[:,1])
:
7,24% python python3.6 [.] _PyEval_EvalF
5,26% python python3.6 [.] PyObject_Mall
3,95% python python3.6 [.] visit_decref
3,95% python umath.cpython-36m-x86_64-linux-gnu.so [.] DOUBLE_add
2,63% python python3.6 [.] PyDict_SetDef
2,63% python python3.6 [.] _PyTuple_Mayb
2,63% python python3.6 [.] collect
2,63% python python3.6 [.] fast_function
2,63% python python3.6 [.] visit_reachab
1,97% python python3.6 [.] _PyObject_Gen
As expected: Python overhead plays a big role, a simple DOUBLE_add
is used.
There are less overhead when calling a.sum()
reduce_loop
isn't called for every row but only once, which means considerable less overhead.so it can be expected, that a.sum()
is faster (despite the fact, that 2000 and not 1000 addition must be made - but as we have seen it is mostly about overhead and the actual work - the additions aren't responsible for the big share of the running time).
Data obtaining by running:
perf record python run.py
perf report
and
#run.py
import numpy as np
a=np.random.rand(1000,2)
for _ in range(10000):
a.sum(axis=1)
#a[:,0]+a[:,1]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With