Are Python C extensions faster than Numba JIT?

Question

I am testing the performance of the Numba JIT vs Python C extensions. It seems the C extension is about 3-4 times faster than the Numba equivalent for a for-loop-based function to calculate the sum of all the elements in a 2d array.

Update:

Based on valuable comments, I realized a mistake that I should have compiled (called) the Numba JIT once. I provide the results of the tests after the fix along with extra cases. But the question remains on when and how which method should be considered.

Here's the result (time_s, value):

# 200 tests mean (including JIT compile inside the loop)
Pure Python: (0.09232537984848023, 29693825)
Numba: (0.003188209533691406, 29693825)
C Extension: (0.000905141830444336, 29693825.0)

# JIT once called before the test loop (to avoid compile time)
Normal: (0.0948486328125, 29685065)
Numba: (0.00031280517578125, 29685065)
C Extension: (0.0025129318237304688, 29685065.0)

# JIT no warm-up also no test loop (only calling once)
Normal: (0.10458517074584961, 29715115)
Numba: (0.314251184463501, 29715115)
C Extension: (0.0025091171264648438, 29715115.0)

Is my implementation correct?
Is there a reason for why C extensions are faster?
Should I probably always use C extensions if I want the best performance? (non-vectorized functions)

main.py

import numpy as np
import pandas as pd
import numba
import time
import loop_test # ext


def test(fn, *args):
    res = []
    val = None
    for _ in range(100):
        start = time.time()
        val = fn(*args)
        res.append(time.time() - start)
    return np.mean(res), val


sh = (30_000, 20)
col_names = [f"col_{i}" for i in range(sh[1])]
df = pd.DataFrame(np.random.randint(0, 100, size=sh), columns=col_names)
arr = df.to_numpy()


def sum_columns(arr):
    _sum = 0
    for i in range(arr.shape[0]):
        for j in range(arr.shape[1]):
            _sum += arr[i, j]
    return _sum


@numba.njit
def sum_columns_numba(arr):
    _sum = 0
    for i in range(arr.shape[0]):
        for j in range(arr.shape[1]):
            _sum += arr[i, j]
    return _sum


print("Pure Python:", test(sum_columns, arr))
print("Numba:", test(sum_columns_numba, arr))
print("C Extension:", test(loop_test.loop_fn, arr))

ext.c

#define PY_SSIZE_CLEAN
#include <Python.h>
#include <numpy/arrayobject.h>

static PyObject *loop_fn(PyObject *module, PyObject *args)
{
    PyObject *arr;
    if (!PyArg_ParseTuple(args, "O!", &PyArray_Type, &arr))
        return NULL;

    npy_intp *dims = PyArray_DIMS(arr);
    npy_intp rows = dims[0];
    npy_intp cols = dims[1];
    double sum = 0;
    PyArrayObject *arr_new = (PyArrayObject *)PyArray_FROM_OTF(arr, NPY_DOUBLE, NPY_ARRAY_IN_ARRAY);
    double *data = (double *)PyArray_DATA(arr_new);
    npy_intp i, j;
    for (i = 0; i < rows; i++)
        for (j = 0; j < cols; j++)
            sum += data[i * cols + j];
    Py_DECREF(arr_new);
    return Py_BuildValue("d", sum);
};

static PyMethodDef Methods[] = {
    {
        .ml_name = "loop_fn",
        .ml_meth = loop_fn,
        .ml_flags = METH_VARARGS,
        .ml_doc = "Returns the sum using for loop, but in C.",
    },
    {NULL, NULL, 0, NULL},
};

static struct PyModuleDef Module = {
    PyModuleDef_HEAD_INIT,
    "loop_test",
    "A benchmark module test",
    -1,
    Methods};

PyMODINIT_FUNC PyInit_loop_test(void)
{
    import_array();
    return PyModule_Create(&Module);
}

setup.py

from distutils.core import setup, Extension
import numpy as np

module = Extension(
    "loop_test",
    sources=["ext.c"],
    include_dirs=[
        np.get_include(),
    ],
)

setup(
    name="loop_test",
    version="1.0",
    description="This is a test package",
    ext_modules=[module],
)

python3 setup.py install

Jérôme Richard · Accepted Answer

I would like to complete the good answer of John Bollinger:

First of all, C extensions tends to be compiled with GCC on Linux (possibly MSVC on Windows and Clang on MacOS AFAIK), while Numba uses the LLVM compilation toolchain internally. If you want to compare both, then you should use Clang which is based on the LLVM toolchain. In fact, you should also use the same version of LLVM than Numba for the comparison to be fair. Clang, GCC and MSVC are not optimizing codes the same way so the resulting program can have pretty different performances.

Moreover, Numba is a JIT so it does not care about the compatibility (of instruction set extensions) between different platforms. This means it can use the AVX-2 SIMD instruction set if available on your machine while mainstream compilers will not do that by default for sake of compatibility. In fact, Numba actually does that. You can specify Clang and GCC to optimize the code for the target machine and not to care about compatibility between machines with the compilation flag -march=native. As a result, the resulting package will certainly be faster but can also crash on old machines (or be possibly significantly slower). You can also enable some specific instruction set (with flags like -mavx2).

Additionally, Numba uses an aggressive optimization level by default while AFAIK C extension use the -O2 flags which does *not auto-vectorize the code by default on both GCC and Clang (i.e. no use of packed SIMD instructions). You should certainly specify manually to use the -O3 flag if this is not already done. On MSVC, the equivalent flag is /O2 (AFAIK there is no /O3 yet).

Please note that Numba functions can be compiled eagerly (as opposed to lazily by default) by providing a specific signature (possibly multiple one). This means you should know the type of the input parameters and the start-up time of your application can significantly increase. Numba functions can also be cached so not to recompile the funciton over and over on the same platform. This can be done with the flag cache=True. It may not always work regarding your specific use-case though.

Last but not least, the two codes are not equivalent. This is certainly the most important point. The Numba code deal with an int32-typed arr and accumulate the value in a 64-bit integer _sum, while the C extension accumulate the value in a double-precision floating-point type. Floating-point types are not associative (unless you tell the compiler to assume they are, with the flag -ffast-math, which is not enabled by default since it is unsafe) so accumulating floating-point numbers is far more expensive than integers due to the high latency of the FMA unit on most platform. Besides, I actually wonder if PyArray_FROM_OTF performs the correct conversion, but if it does, then I expect the conversion to be pretty expensive. You should use the same types in the two code for the comparison to be fair (possibly 64-bit integers in the two).

For more information, please read the related posts:

Fastest way to do loop over 2D arrays in Cython
Why is cython so much slower than numba for this simple loop?

John Bollinger · Answer

Are Python C extensions faster than Numba JIT?

It depends. Well-written C extensions should not be slower than JITed Python code accomplishing the same work, even if you discount the once-per-run runtime cost of performing the JIT compilation. After all, in principle, you can write C code that will compile to the exact same machine code that JIT would produce. On the other hand, clever and experienced humans may have insights into the details of the computation that enable them to express it in C in a more efficient way than JIT'ed Python code yields. Humans might do worse, but they shouldn't.

Is my implementation correct?

It looks ok to me, and the fact all versions are computing the same sum is a good basic check. The C implementation is not general in the sense that it will not properly handle arrays in which any dimension has a stride different from 1, but that's fine for your test because it doesn't need to handle that.

Is there a reason for why C extensions are faster?

To the extent that C extensions can be faster than JITed functions, it's largely because

C extensions do not have a runtime compilation cost, and
C extensions can be tuned to the task at hand.

The latter is more likely to be relevant for tasks somewhat more complicated than your example.

Should I probably always use C extensions if I want the best performance? (non-vectorized functions)

Using C to implement performance-critical code is a longstanding approach to improving performance over that of pure Python. It is probably the most difficult of the available methods, but sometimes taking on that difficulty is warranted. Other times, JIT or Cython or just using modules and packages that handle the performance details for you is an easier way forward, with almost or even fully as much performance benefit.

If you want the absolute best performance possible then writing extensions in C is the doorway to that, but you still have to do a good job of writing the extension. Oftentimes, however, you just need fast, not maximum possible speed. In these cases, ease of implementation and maintenance is an important consideration, and it probably points toward one of the other options.

Are Python C extensions faster than Numba JIT?

Tags:

python

c

pandas

numpy

numba

Update:

Momo

2 Answers

Jérôme Richard

John Bollinger

Recent Activity

Donate For Us

Are Python C extensions faster than Numba JIT?

Tags:

python

c

pandas

numpy

numba

Update:

Momo

2 Answers

Jérôme Richard

John Bollinger

Related questions

Recent Activity

Donate For Us