Compare a pure Python no-op function with a no-op function decorated with @numba.jit
, that is:
import numba
@numba.njit
def boring_numba():
pass
def call_numba(x):
for t in range(x):
boring_numba()
def boring_normal():
pass
def call_normal(x):
for t in range(x):
boring_normal()
If we time this with %timeit
, we get the following:
%timeit call_numba(int(1e7))
792 ms ± 5.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit call_normal(int(1e7))
737 ms ± 2.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
All perfectly reasonable; there's a small overhead for the numba function, but not much.
If however we use cProfile
to profile this code, we get the following:
cProfile.run('call_numba(int(1e7)); call_normal(int(1e7))', sort='cumulative')
ncalls tottime percall cumtime percall filename:lineno(function)
76/1 0.003 0.000 8.670 8.670 {built-in method builtins.exec}
1 6.613 6.613 7.127 7.127 experiments.py:10(call_numba)
1 1.111 1.111 1.543 1.543 experiments.py:17(call_normal)
10000000 0.432 0.000 0.432 0.000 experiments.py:14(boring_normal)
10000000 0.428 0.000 0.428 0.000 experiments.py:6(boring_numba)
1 0.000 0.000 0.086 0.086 dispatcher.py:72(compile)
cProfile
thinks there is a massive overhead in calling the numba function.
This extends to "real" code: I had a function that simply called my expensive computation (the computation being numba-JIT-compiled), and cProfile
reported that the wrapper function was taking around a third of the total time.
I don't mind cProfile
adding a bit of overhead, but if it's massively inconsistent about where it adds that overhead it's not very helpful. Does anyone know why this happens, whether there is anything that can be done about it, and/or if there are any alternative profiling tools that don't interact badly with numba?
Numba is generally faster than Numpy and even Cython (at least on Linux). In this benchmark, pairwise distances have been computed, so this may depend on the algorithm.
Just add the @jit(nopython=True) above your function and Numba will take care of the rest! On my PC, , sorting all of those numbers takes an average of 0.1424 seconds — that's a 21X speed up!
Numba has quite a few decorators, we've seen @jit , but there's also: @njit - this is an alias for @jit(nopython=True) as it is so commonly used! @vectorize - produces NumPy ufunc s (with all the ufunc methods supported).
When you create a numba function you actually create a numba Dispatcher
object. This object "re-directs" a "call" to boring_numba
to the correct (as far as types are concerned) internal "jitted" function. So even though you created a function called boring_numba
- this function isn't called, what is called is a compiled function based on your function.
Just so you can see that the function boring_numba
is called (even though it isn't, what is called is CPUDispatcher.__call__
) during profiling the Dispatcher
object needs to hook into the current thread state and check if there's a profiler/tracer running and if "yes" it makes it look like boring_numba
is called.This last step is what incurs the overhead because it has to fake a "Python stack frame" for boring_numba
.
A bit more technical:
When you call the numba function boring_numba
it actually calls Dispatcher_Call
which is a wrapper around call_cfunc
and here is the major difference: When you have a profiler running the code dealing with a profiler makes up a majority of the function call (just compare the if (tstate->use_tracing && tstate->c_profilefunc)
branch with the else
branch that is running if there is no profiler/tracer):
static PyObject *
call_cfunc(DispatcherObject *self, PyObject *cfunc, PyObject *args, PyObject *kws, PyObject *locals)
{
PyCFunctionWithKeywords fn;
PyThreadState *tstate;
assert(PyCFunction_Check(cfunc));
assert(PyCFunction_GET_FLAGS(cfunc) == METH_VARARGS | METH_KEYWORDS);
fn = (PyCFunctionWithKeywords) PyCFunction_GET_FUNCTION(cfunc);
tstate = PyThreadState_GET();
if (tstate->use_tracing && tstate->c_profilefunc)
{
/*
* The following code requires some explaining:
*
* We want the jit-compiled function to be visible to the profiler, so we
* need to synthesize a frame for it.
* The PyFrame_New() constructor doesn't do anything with the 'locals' value if the 'code's
* 'CO_NEWLOCALS' flag is set (which is always the case nowadays).
* So, to get local variables into the frame, we have to manually set the 'f_locals'
* member, then call `PyFrame_LocalsToFast`, where a subsequent call to the `frame.f_locals`
* property (by virtue of the `frame_getlocals` function in frameobject.c) will find them.
*/
PyCodeObject *code = (PyCodeObject*)PyObject_GetAttrString((PyObject*)self, "__code__");
PyObject *globals = PyDict_New();
PyObject *builtins = PyEval_GetBuiltins();
PyFrameObject *frame = NULL;
PyObject *result = NULL;
if (!code) {
PyErr_Format(PyExc_RuntimeError, "No __code__ attribute found.");
goto error;
}
/* Populate builtins, which is required by some JITted functions */
if (PyDict_SetItemString(globals, "__builtins__", builtins)) {
goto error;
}
frame = PyFrame_New(tstate, code, globals, NULL);
if (frame == NULL) {
goto error;
}
/* Populate the 'fast locals' in `frame` */
Py_XDECREF(frame->f_locals);
frame->f_locals = locals;
Py_XINCREF(frame->f_locals);
PyFrame_LocalsToFast(frame, 0);
tstate->frame = frame;
C_TRACE(result, fn(PyCFunction_GET_SELF(cfunc), args, kws));
tstate->frame = frame->f_back;
error:
Py_XDECREF(frame);
Py_XDECREF(globals);
Py_XDECREF(code);
return result;
}
else
return fn(PyCFunction_GET_SELF(cfunc), args, kws);
}
I assume that this extra code (in case a profiler is running) slows down the function when you're cProfile-ing.
It's a bit unfortunate that numba function add so much overhead when you run a profiler but that the slowdown will actually be almost negligible if you do anything substantial in the numba function.
If you would also move the for
loop in a numba function then even more so.
If you notice that the numba function (with or without profiler running) takes too much time then you probably call it too often. Then you should check if you can actually move the loop inside the numba function or wrap the code containing the loop in another numba function.
Note: All of this is (a bit) speculation, I haven't actually build numba with debug symbols and profiled the C-Code in case a profiler is running. However the amount of operations in case there ise a profiler running makes this seem very plausible. And all of this assumes numba 0.39, not sure if this applies to past versions as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With