Consider the following four functions (python
, numba
, cython
and smart
), which calculate identical responses when given the same integer inputs
def python(n):
total = 0
for m in range(1,n+1):
total += m
return total
from numba import jit
numba = jit(python)
cpdef int cython(int n):
cdef int total = 0
cdef int m
for m in range(1, n+1):
total += m
return total
def smart(n):
return n * (n + 1) // 2
Timing their execution I was somewhat surprised to discover that
numba
's run-time is independent of n
(while cython
's is linear in n
)numba
is slower than smart
This immediately raises two questions:
smart
?As I am no assembler maven, looking at the generated code doesn't really give me much of a clue, beyond that the intermediate LLVM code generated by Numba still appears (I might have misunderstood, though) to contain a loop ... and I get hopelessly lost in the x64 that is eventually generated from that. (Unless someone asks, I won't post the generated codes, as they are rather long.)
I am running this on a x64 Linux, in a Jupyter notebook, so I suspect that Cython is using the GCC 4.4.7 which was used to compile Python; and llvmlite 0.20.0, which would imply LLVM 4.0.x.
I have also timed
smart_numba = jit(smart)
and
cpdef int smart_cython(int n):
return n * (n + 1) // 2
smart_numba
and numba
give identical timings, which are 25% slower than smart
(pure-Python) and 175% slower than smart_cython
.
Does this indicate that Cython does a very good job of efficiently crossing the Python/low-level boundary, while Numba does a poor job? Or is there something else to it?
Following benchmark result shows Cython and Numba library can significantly speed up Python code. Computation time for Python and Cython increase much faster compared to Numba.
Numba can speed things up Numba is a just-in-time compiler for Python specifically focused on code that runs in loops over NumPy arrays.
The CPython + Cython implementation is the fastest; it is 44 times faster than the CPython implementation. This is an impressive speed improvement, especially considering that the Cython code is very close to the original Python code in its design.
Note that regular Python takes more than 500 seconds for executing the above code while Cython just takes around 1 second. Thus, Cython is 500x times faster than Python for summing 1 billion numbers.
This appears to be a LLVM vs GCC thing - see example in compiler explorer here, which is less noisy than what numba spits out. I get a bit lost in the assembly, but fairly clear that the GCC output has a loop (the jge
to .L6
) and the clang output does not. See also this issue on the GCC bugtracker.
On my machine (Windows x64) numba
is not significantly slower than smart
, only about 9 ns. This overhead appears to be due to the type dispatch mechanism of numba - if you elide it by picking a specific overload, the numba version is faster than the python one
Here are my timings
In [73]: %timeit numba_sum(10000)
182 ns ± 1.69 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [74]: %timeit smart(10000)
171 ns ± 2.26 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# pick out int64 overload
i64_numba_sum = numba_sum.get_overload((numba.int64,))
In [75]: %timeit i64_numba_sum(10000)
94 ns ± 1.41 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With