Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is numpy's einsum faster than numpy's built in functions?

Lets start with three arrays of dtype=np.double. Timings are performed on a intel CPU using numpy 1.7.1 compiled with icc and linked to intel's mkl. A AMD cpu with numpy 1.6.1 compiled with gcc without mkl was also used to verify the timings. Please note the timings scale nearly linearly with system size and are not due to the small overhead incurred in the numpy functions if statements these difference will show up in microseconds not milliseconds:

arr_1D=np.arange(500,dtype=np.double) large_arr_1D=np.arange(100000,dtype=np.double) arr_2D=np.arange(500**2,dtype=np.double).reshape(500,500) arr_3D=np.arange(500**3,dtype=np.double).reshape(500,500,500) 

First lets look at the np.sum function:

np.all(np.sum(arr_3D)==np.einsum('ijk->',arr_3D)) True  %timeit np.sum(arr_3D) 10 loops, best of 3: 142 ms per loop  %timeit np.einsum('ijk->', arr_3D) 10 loops, best of 3: 70.2 ms per loop 

Powers:

np.allclose(arr_3D*arr_3D*arr_3D,np.einsum('ijk,ijk,ijk->ijk',arr_3D,arr_3D,arr_3D)) True  %timeit arr_3D*arr_3D*arr_3D 1 loops, best of 3: 1.32 s per loop  %timeit np.einsum('ijk,ijk,ijk->ijk', arr_3D, arr_3D, arr_3D) 1 loops, best of 3: 694 ms per loop 

Outer product:

np.all(np.outer(arr_1D,arr_1D)==np.einsum('i,k->ik',arr_1D,arr_1D)) True  %timeit np.outer(arr_1D, arr_1D) 1000 loops, best of 3: 411 us per loop  %timeit np.einsum('i,k->ik', arr_1D, arr_1D) 1000 loops, best of 3: 245 us per loop 

All of the above are twice as fast with np.einsum. These should be apples to apples comparisons as everything is specifically of dtype=np.double. I would expect the speed up in an operation like this:

np.allclose(np.sum(arr_2D*arr_3D),np.einsum('ij,oij->',arr_2D,arr_3D)) True  %timeit np.sum(arr_2D*arr_3D) 1 loops, best of 3: 813 ms per loop  %timeit np.einsum('ij,oij->', arr_2D, arr_3D) 10 loops, best of 3: 85.1 ms per loop 

Einsum seems to be at least twice as fast for np.inner, np.outer, np.kron, and np.sum regardless of axes selection. The primary exception being np.dot as it calls DGEMM from a BLAS library. So why is np.einsum faster that other numpy functions that are equivalent?

The DGEMM case for completeness:

np.allclose(np.dot(arr_2D,arr_2D),np.einsum('ij,jk',arr_2D,arr_2D)) True  %timeit np.einsum('ij,jk',arr_2D,arr_2D) 10 loops, best of 3: 56.1 ms per loop  %timeit np.dot(arr_2D,arr_2D) 100 loops, best of 3: 5.17 ms per loop 

The leading theory is from @sebergs comment that np.einsum can make use of SSE2, but numpy's ufuncs will not until numpy 1.8 (see the change log). I believe this is the correct answer, but have not been able to confirm it. Some limited proof can be found by changing the dtype of input array and observing speed difference and the fact that not everyone observes the same trends in timings.

like image 211
Daniel Avatar asked Aug 21 '13 18:08

Daniel


People also ask

Is np einsum faster?

Einsum seems to be at least twice as fast for np.

Is Einsum slower?

einsum is ~20X slower than manually multiplying and summing #32591.

Is NP sum faster than for loop?

sum performs faster for np.


1 Answers

First off, there's been a lot of past discussion about this on the numpy list. For example, see: http://numpy-discussion.10968.n7.nabble.com/poor-performance-of-sum-with-sub-machine-word-integer-types-td41.html http://numpy-discussion.10968.n7.nabble.com/odd-performance-of-sum-td3332.html

Some of boils down to the fact that einsum is new, and is presumably trying to be better about cache alignment and other memory access issues, while many of the older numpy functions focus on a easily portable implementation over a heavily optimized one. I'm just speculating, there, though.


However, some of what you're doing isn't quite an "apples-to-apples" comparison.

In addition to what @Jamie already said, sum uses a more appropriate accumulator for arrays

For example, sum is more careful about checking the type of the input and using an appropriate accumulator. For example, consider the following:

In [1]: x = 255 * np.ones(100, dtype=np.uint8)  In [2]: x Out[2]: array([255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,        255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,        255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,        255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,        255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,        255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,        255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,        255, 255, 255, 255, 255, 255, 255, 255, 255], dtype=uint8) 

Note that the sum is correct:

In [3]: x.sum() Out[3]: 25500 

While einsum will give the wrong result:

In [4]: np.einsum('i->', x) Out[4]: 156 

But if we use a less limited dtype, we'll still get the result you'd expect:

In [5]: y = 255 * np.ones(100)  In [6]: np.einsum('i->', y) Out[6]: 25500.0 
like image 159
Joe Kington Avatar answered Sep 26 '22 07:09

Joe Kington