Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numba vs Cython loop optimization

Consider the following four functions (python, numba, cython and smart), which calculate identical responses when given the same integer inputs

def python(n):
    total = 0
    for m in range(1,n+1):
        total += m
    return total

from numba import jit
numba = jit(python)

cpdef int cython(int n):
    cdef int total = 0
    cdef int m
    for m in range(1, n+1):
        total += m
    return total

def smart(n):
    return n * (n + 1) // 2

Timing their execution I was somewhat surprised to discover that

  1. numba's run-time is independent of n (while cython's is linear in n)
  2. numba is slower than smart

This immediately raises two questions:

  1. Why is Numba, but not Cython, able to turn it into a constant-time algorithm?
  2. Given that Numba does manage to turn it into a contstant-time algorithm, why is it slower than the pure Python constant-time function smart?

As I am no assembler maven, looking at the generated code doesn't really give me much of a clue, beyond that the intermediate LLVM code generated by Numba still appears (I might have misunderstood, though) to contain a loop ... and I get hopelessly lost in the x64 that is eventually generated from that. (Unless someone asks, I won't post the generated codes, as they are rather long.)

I am running this on a x64 Linux, in a Jupyter notebook, so I suspect that Cython is using the GCC 4.4.7 which was used to compile Python; and llvmlite 0.20.0, which would imply LLVM 4.0.x.

Edit:

I have also timed

smart_numba = jit(smart)

and

cpdef int smart_cython(int n):
    return n * (n + 1) // 2

smart_numba and numba give identical timings, which are 25% slower than smart (pure-Python) and 175% slower than smart_cython.

Does this indicate that Cython does a very good job of efficiently crossing the Python/low-level boundary, while Numba does a poor job? Or is there something else to it?

like image 892
jacg Avatar asked Dec 06 '17 21:12

jacg


People also ask

Is Numba faster than Cython?

Following benchmark result shows Cython and Numba library can significantly speed up Python code. Computation time for Python and Cython increase much faster compared to Numba.

Does Numba speed up for loops?

Numba can speed things up Numba is a just-in-time compiler for Python specifically focused on code that runs in loops over NumPy arrays.

Does Cython improve performance?

The CPython + Cython implementation is the fastest; it is 44 times faster than the CPython implementation. This is an impressive speed improvement, especially considering that the Cython code is very close to the original Python code in its design.

How much faster is Cython than Python?

Note that regular Python takes more than 500 seconds for executing the above code while Cython just takes around 1 second. Thus, Cython is 500x times faster than Python for summing 1 billion numbers.


Video Answer


1 Answers

  1. This appears to be a LLVM vs GCC thing - see example in compiler explorer here, which is less noisy than what numba spits out. I get a bit lost in the assembly, but fairly clear that the GCC output has a loop (the jge to .L6) and the clang output does not. See also this issue on the GCC bugtracker.

  2. On my machine (Windows x64) numba is not significantly slower than smart, only about 9 ns. This overhead appears to be due to the type dispatch mechanism of numba - if you elide it by picking a specific overload, the numba version is faster than the python one

Here are my timings

In [73]: %timeit numba_sum(10000)
182 ns ± 1.69 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [74]: %timeit smart(10000)
171 ns ± 2.26 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

# pick out int64 overload
i64_numba_sum = numba_sum.get_overload((numba.int64,))

In [75]: %timeit i64_numba_sum(10000)
94 ns ± 1.41 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
like image 138
chrisb Avatar answered Sep 22 '22 00:09

chrisb