Numba vs Cython loop optimization

Tags:

Consider the following four functions (python, numba, cython and smart), which calculate identical responses when given the same integer inputs

Click to copy

def python(n):
    total = 0
    for m in range(1,n+1):
        total += m
    return total

from numba import jit
numba = jit(python)

cpdef int cython(int n):
    cdef int total = 0
    cdef int m
    for m in range(1, n+1):
        total += m
    return total

def smart(n):
    return n * (n + 1) // 2

Timing their execution I was somewhat surprised to discover that

numba's run-time is independent of n (while cython's is linear in n)
numba is slower than smart

This immediately raises two questions:

Why is Numba, but not Cython, able to turn it into a constant-time algorithm?
Given that Numba does manage to turn it into a contstant-time algorithm, why is it slower than the pure Python constant-time function smart?

As I am no assembler maven, looking at the generated code doesn't really give me much of a clue, beyond that the intermediate LLVM code generated by Numba still appears (I might have misunderstood, though) to contain a loop ... and I get hopelessly lost in the x64 that is eventually generated from that. (Unless someone asks, I won't post the generated codes, as they are rather long.)

I am running this on a x64 Linux, in a Jupyter notebook, so I suspect that Cython is using the GCC 4.4.7 which was used to compile Python; and llvmlite 0.20.0, which would imply LLVM 4.0.x.

Edit:

I have also timed

Click to copy

smart_numba = jit(smart)

and

Click to copy

cpdef int smart_cython(int n):
    return n * (n + 1) // 2

smart_numba and numba give identical timings, which are 25% slower than smart (pure-Python) and 175% slower than smart_cython.

Does this indicate that Cython does a very good job of efficiently crossing the Python/low-level boundary, while Numba does a poor job? Or is there something else to it?

892

asked Dec 06 '17 21:12

jacg

Video Answer

1 Answers

This appears to be a LLVM vs GCC thing - see example in compiler explorer here, which is less noisy than what numba spits out. I get a bit lost in the assembly, but fairly clear that the GCC output has a loop (the jge to .L6) and the clang output does not. See also this issue on the GCC bugtracker.
On my machine (Windows x64) numba is not significantly slower than smart, only about 9 ns. This overhead appears to be due to the type dispatch mechanism of numba - if you elide it by picking a specific overload, the numba version is faster than the python one

Here are my timings

Click to copy

In [73]: %timeit numba_sum(10000)
182 ns ± 1.69 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [74]: %timeit smart(10000)
171 ns ± 2.26 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

# pick out int64 overload
i64_numba_sum = numba_sum.get_overload((numba.int64,))

In [75]: %timeit i64_numba_sum(10000)
94 ns ± 1.41 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

138

answered Sep 22 '22 00:09

chrisb

Related questions
                            
                                calling a method inside a class-Python
                            
                                Find likeliest periodicity for time series with numpy's Fourier Transform?
                            
                                How does this one-hot vector conversion work?
                            
                                Sort 2 lists in Python based on the ratio of individual corresponding elements or based on a third list
                            
                                In Python, how can I loop over all the matches of a regular expression on a string?
                            
                                What is the new upload URL for the Test PyPI server?
                            
                                Stateful LSTM: When to reset states?
                            
                                Log from multiple python files into single log file in Python
                            
                                Pass a fixture to a helper function in PyTest?
                            
                                Possible to add descriptions to symbols in sympy?
                            
                                Highly inconsistent OCR result for tesseract
                            
                                Getting ValueError: y contains new labels when using scikit learn's LabelEncoder
                            
                                What is this feature column and how does it affect the training?
                            
                                How to filter pandas dataframe based on date value with exact match
                            
                                Proper use of the * operator in a oneline if statement python
                            
                                Django REST framework custom format for all out responses
                            
                                Python 3 - Class variable is not defined [duplicate]
                            
                                Keras: Training loss decrases (accuracy increase) while validation loss increases (accuracy decrease)
                            
                                Get value from dictionary for first key that exists [duplicate]
                            
                                Browser not sending cookie in subsequent request

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Numba vs Cython loop optimization

Tags:

python

cython

numba

Edit:

jacg

People also ask

Video Answer

1 Answers

chrisb

Recent Activity

Donate For Us