Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numpy performance differences depending on numerical values

I found a strange performance difference while evaluating an expression in Numpy.

I executed the following code:

import numpy as np
myarr = np.random.uniform(-1,1,[1100,1100])

and then

%timeit np.exp( - 0.5 * (myarr / 0.001)**2 )
>> 184 ms ± 301 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

and

%timeit np.exp( - 0.5 * (myarr / 0.1)**2 )
>> 12.3 ms ± 34.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

That's an almost 15x faster computation in the second case! Note that the only difference is the factor being 0.1 or 0.001.

What's the reason for this behaviour? Can I change something to make the first calculation as fast as the second?

like image 381
Ethunxxx Avatar asked Nov 21 '18 15:11

Ethunxxx


People also ask

Why is a computation written in NumPy usually faster than the same computation implemented in native Python?

Because the Numpy array is densely packed in memory due to its homogeneous type, it also frees the memory faster. So overall a task executed in Numpy is around 5 to 100 times faster than the standard python list, which is a significant leap in terms of speed.

What makes NumPy faster?

NumPy Arrays are faster than Python Lists because of the following reasons: An array is a collection of homogeneous data-types that are stored in contiguous memory locations. On the other hand, a list in Python is a collection of heterogeneous data types stored in non-contiguous memory locations.

Is NumPy array faster than Dataframe?

pandas provides a bunch of C or Cython optimized functions that can be faster than the NumPy equivalent function (e.g. reading text from text files). If you want to do mathematical operations like a dot product, calculating mean, and some more, pandas DataFrames are generally going to be slower than a NumPy array.

Is Tensorflow faster than NumPy?

Tensorflow is consistently much slower than Numpy in my tests. Shouldn't Tensorflow be much faster since it uses GPU and Numpy uses only CPU? I am running Ubuntu and have not changed anything to affect BLAS (that I am aware of). This always depends on the task.


2 Answers

Use Intel SVML

I have no working numexpr with Intel SVML, but numexpr with working SVML should perform as good as Numba. The Numba Benchmarks show quite the same behaviour without SVML, but perform much better with SVML.

Code

import numpy as np
import numba as nb

myarr = np.random.uniform(-1,1,[1100,1100])

@nb.njit(error_model="numpy",parallel=True)
def func(arr,div):
  return np.exp( - 0.5 * (myarr / div)**2 )

Timings

#Core i7 4771
#Windows 7 x64
#Anaconda Python 3.5.5
#Numba 0.41 (compilation overhead excluded)
func(myarr,0.1)                      -> 3.6ms
func(myarr,0.001)                    -> 3.8ms

#Numba (set NUMBA_DISABLE_INTEL_SVML=1), parallel=True
func(myarr,0.1)                      -> 5.19ms
func(myarr,0.001)                    -> 12.0ms

#Numba (set NUMBA_DISABLE_INTEL_SVML=1), parallel=False
func(myarr,0.1)                      -> 16.7ms
func(myarr,0.001)                    -> 63.2ms

#Numpy (1.13.3), set OMP_NUM_THREADS=4
np.exp( - 0.5 * (myarr / 0.001)**2 ) -> 70.82ms
np.exp( - 0.5 * (myarr / 0.1)**2 )   -> 12.58ms

#Numpy (1.13.3), set OMP_NUM_THREADS=1
np.exp( - 0.5 * (myarr / 0.001)**2 ) -> 189.4ms
np.exp( - 0.5 * (myarr / 0.1)**2 )   -> 17.4ms

#Numexpr (2.6.8), no SVML, parallel
ne.evaluate("exp( - 0.5 * (myarr / 0.001)**2 )") ->17.2ms
ne.evaluate("exp( - 0.5 * (myarr / 0.1)**2 )")   ->4.38ms

#Numexpr (2.6.8), no SVML, single threaded
ne.evaluate("exp( - 0.5 * (myarr / 0.001)**2 )") ->50.85ms
ne.evaluate("exp( - 0.5 * (myarr / 0.1)**2 )")   ->13.9ms
like image 73
max9111 Avatar answered Sep 28 '22 04:09

max9111


This may produce denormalised numbers which slow down computations.

You may like to disable denormalized numbers using daz library:

import daz
daz.set_daz()

More info: x87 and SSE Floating Point Assists in IA-32: Flush-To-Zero (FTZ) and Denormals-Are-Zero (DAZ):

To avoid serialization and performance issues due to denormals and underflow numbers, use the SSE and SSE2 instructions to set Flush-to-Zero and Denormals-Are-Zero modes within the hardware to enable highest performance for floating-point applications.

Note that in 64-bit mode floating point computations use SSE instructions, not x87.

like image 26
Maxim Egorushkin Avatar answered Sep 28 '22 04:09

Maxim Egorushkin