Understanding Numba Performance Differences

Tags:

I'm trying to understand the performance differences I am seeing by using various numba implementations of an algorithm. In particular, I would expect func1d from below to be the fastest implementation since it it the only algorithm that is not copying data, however from my timings func1b appears to be fastest.

import numpy
import numba


def func1a(data, a, b, c):
    # pure numpy
    return a * (1 + numpy.tanh((data / b) - c))


@numba.njit(fastmath=True)
def func1b(data, a, b, c):
    new_data = a * (1 + numpy.tanh((data / b) - c))
    return new_data


@numba.njit(fastmath=True)
def func1c(data, a, b, c):
    new_data = numpy.empty(data.shape)
    for i in range(new_data.shape[0]):
        for j in range(new_data.shape[1]):
            new_data[i, j] = a * (1 + numpy.tanh((data[i, j] / b) - c)) 
    return new_data


@numba.njit(fastmath=True)
def func1d(data, a, b, c):
    for i in range(data.shape[0]):
        for j in range(data.shape[1]):
            data[i, j] = a * (1 + numpy.tanh((data[i, j] / b) - c)) 
    return data

Helper functions for testing memory copying

def get_data_base(arr):
    """For a given NumPy array, find the base array
    that owns the actual data.
    
    https://ipython-books.github.io/45-understanding-the-internals-of-numpy-to-avoid-unnecessary-array-copying/
    """
    base = arr
    while isinstance(base.base, numpy.ndarray):
        base = base.base
    return base


def arrays_share_data(x, y):
    return get_data_base(x) is get_data_base(y)


def test_share(func):
    data = data = numpy.random.randn(100, 3)
    print(arrays_share_data(data, func(data, 0.5, 2.5, 2.5)))

Timings

# force compiling
data = numpy.random.randn(10_000, 300)
_ = func1a(data, 0.5, 2.5, 2.5)
_ = func1b(data, 0.5, 2.5, 2.5)
_ = func1c(data, 0.5, 2.5, 2.5)
_ = func1d(data, 0.5, 2.5, 2.5)

data = numpy.random.randn(10_000, 300)
%timeit func1a(data, 0.5, 2.5, 2.5)
%timeit func1b(data, 0.5, 2.5, 2.5)
%timeit func1c(data, 0.5, 2.5, 2.5)
%timeit func1d(data, 0.5, 2.5, 2.5)

67.2 ms ± 230 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
13 ms ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
69.8 ms ± 60.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
69.8 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Test which implementations copy memory

test_share(func1a)
test_share(func1b)
test_share(func1c)
test_share(func1d)

False
False
False
True

602

asked Dec 20 '21 19:12

mgilbert

1 Answers

Here, copying of data doesn't play a big role: the bottle neck is fast how the tanh-function is evaluated. There are many algorithms: some of them are faster some of them are slower, some are more precise some less.

Different numpy-distributions use different implementations of tanh-function, e.g. it could be one from mkl/vml or the one from the gnu-math-library.

Depending on numba version, also either the mkl/svml impelementation is used or gnu-math-library.

The easiest way to look inside is to use a profiler, for example perf.

For the numpy-version on my machine I get:

>>> perf record python run.py
>>> perf report
Overhead  Command  Shared Object                                      Symbol                                  
  46,73%  python   libm-2.23.so                                       [.] __expm1
  24,24%  python   libm-2.23.so                                       [.] __tanh
   4,89%  python   _multiarray_umath.cpython-37m-x86_64-linux-gnu.so  [.] sse2_binary_scalar2_divide_DOUBLE
   3,59%  python   [unknown]                                          [k] 0xffffffff8140290c

As one can see, numpy uses the slow gnu-math-library (libm) functionality.

For the numba-function I get:

 53,98%  python   libsvml.so                                         [.] __svml_tanh4_e9
   3,60%  python   [unknown]                                          [k] 0xffffffff81831c57
   2,79%  python   python3.7                                          [.] _PyEval_EvalFrameDefault

which means that fast mkl/svml functionality is used.

That is (almost) all there is to it.

As @user2640045 has rightly pointed out, the numpy performance will be hurt by additional cache misses due to creation of temporary arrays.

However, cache misses don't play such a big role as the calculation of tanh:

%timeit func1a(data, 0.5, 2.5, 2.5)  # 91.5 ms ± 2.88 ms per loop 
%timeit numpy.tanh(data)             # 76.1 ms ± 539 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

i.e. creation of temporary objects is responsible for around 20% of the running time.

FWIW, also for version with the handwritten loops, my numba version (0.50.1) is able to vectorize and call mkl/svml functionality. If for some other version this not happens - numba will fall back to gnu-math-library functionality, what seems to be happening on your machine.

Listing of run.py:

import numpy

# TODO: define func1b for checking numba
def func1a(data, a, b, c):
    # pure numpy
    return a * (1 + numpy.tanh((data / b) - c))


data = numpy.random.randn(10_000, 300)

for _ in range(100):
    func1a(data, 0.5, 2.5, 2.5)

answered Oct 12 '22 23:10

ead

Related questions
                            
                                Using automatic differentiation libraries to compute partial derivatives of an arbitrary tensor
                            
                                Converting Audio files between Pydub and Librosa
                            
                                How to replace certain parts of a tensor on the condition in keras?
                            
                                Select NumPy Values Around Index
                            
                                What is the most pythonic way of generating a boolean mask of an RGB image based on the colour of the pixels?
                            
                                What is the numpy equivalent of random.sample?
                            
                                Cannot reshape array of size into shape
                            
                                How to optimize such codes as follows in python?
                            
                                Split Pandas Dataframe Column According To a Value
                            
                                How to convert a spline fit into a piecewise function?
                            
                                Tensorflow error in Colab - ValueError: Shapes (None, 1) and (None, 10) are incompatible
                            
                                Why does pandas use "NaN" from numpy, instead of its own null value?
                            
                                implementation of using Maclaurin series of e^x in python
                            
                                Speed up random weighted choice without replacement in python
                            
                                How to speed up the performance of array masking from the results of numpy.searchsorted in python?
                            
                                Comparison of np.random.choice vs np.random.shuffle for samples without replacement
                            
                                How to Deal with Lat/Lon Arrays with Multiple Dimensions?
                            
                                How to fill the values in the list and convert it into the dataframe?
                            
                                Forward fill only certain value
                            
                                Efficient way to map 3D function to a meshgrid with NumPy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Understanding Numba Performance Differences

Tags:

numpy

numba

mgilbert

People also ask

1 Answers

ead

Recent Activity

Donate For Us