Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Huge speed difference in numpy between similar code

Why is there such a large speed difference between the following L2 norm calculations:

a = np.arange(1200.0).reshape((-1,3))

%timeit [np.sqrt((a*a).sum(axis=1))]
100000 loops, best of 3: 12 µs per loop

%timeit [np.sqrt(np.dot(x,x)) for x in a]
1000 loops, best of 3: 814 µs per loop

%timeit [np.linalg.norm(x) for x in a]
100 loops, best of 3: 2 ms per loop

All three produce identical results as far as I can see.

Here's the source code for numpy.linalg.norm function:

x = asarray(x)

# Check the default case first and handle it immediately.
if ord is None and axis is None:
    x = x.ravel(order='K')
    if isComplexType(x.dtype.type):
        sqnorm = dot(x.real, x.real) + dot(x.imag, x.imag)
    else:
        sqnorm = dot(x, x)
    return sqrt(sqnorm)

EDIT: Someone suggested that one version could be parallelized, but I checked and it's not the case. All three versions consume 12.5% of CPU (as is usually the case with Python code on my 4 physical / 8 virtual core Xeon CPU.

like image 429
MichaelSB Avatar asked Sep 27 '22 00:09

MichaelSB


1 Answers

np.dot will usually call a BLAS library function - its speed will thus depend on which BLAS library your version of numpy is linked against. In general I would expect it to have a greater constant overhead, but to scale much better as the array size increases. However, the fact that you are calling it from within a list comprehension (effectively a normal Python for loop) will likely negate any performance benefits of using BLAS.

If you get rid of the list comprehension and use the axis= kwarg, np.linalg.norm is comparable to your first example, but np.einsum is much faster than both:

In [1]: %timeit np.sqrt((a*a).sum(axis=1))
The slowest run took 10.12 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 11.1 µs per loop

In [2]: %timeit np.linalg.norm(a, axis=1)
The slowest run took 14.63 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 13.5 µs per loop

# this is what np.linalg.norm does internally
In [3]: %timeit np.sqrt(np.add.reduce(a * a, axis=1))
The slowest run took 34.05 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 10.7 µs per loop

In [4]: %timeit np.sqrt(np.einsum('ij,ij->i',a,a))
The slowest run took 5.55 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 5.42 µs per loop
like image 114
ali_m Avatar answered Oct 17 '22 03:10

ali_m