Broadcasted NumPy arithmetic - why is one method so much more performant?

This question is a follow up to my answer in Efficient way to compute the Vandermonde matrix.

Here's the setup:

x = np.arange(5000)  # an integer array
N = 4

Now, I'll compute the Vandermonde matrix in two different ways:

m1 = (x ** np.arange(N)[:, None]).T

And,

m2 = x[:, None] ** np.arange(N)

Sanity check:

np.array_equal(m1, m2)
True

These methods are identical, but their performance is not:

%timeit m1 = (x ** np.arange(N)[:, None]).T
42.7 µs ± 271 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit m2 = x[:, None] ** np.arange(N)
150 µs ± 995 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

So, the first method, despite requiring a transposition at the end, is still over 3X faster than the second method.

The only difference is that in the first case, the smaller array is broadcasted, whereas with the second case, it is the larger.

So, with a fairly decent understanding of how numpy works, I can guess that the answer would involve the cache. The first method is a lot more cache friendly than the second. However, I'd like an official word from someone with more experience than me.

What could be the reason for this stark contrast in timings?

Is NumPy broadcasting slow?

Broadcasting is usually fast, since it vectorizes array operations so that looping occurs in optimized C code instead of the slower Python. In addition, it doesn't really require storing all copies of the smaller array; instead, there are faster and more efficient algorithms to store that.

What are the limitations of broadcasting in Python?

Limitations of Broadcasting Arithmetic, including broadcasting, can only be performed when the shape of each dimension in the arrays are equal or one has the dimension size of 1.

Why is NumPy vectorization so fast?

A major reason why vectorization is faster than its for loop counterpart is due to the underlying implementation of Numpy operations. As many of you know (if you're familiar with Python), Python is a dynamically typed language.

I too tried to look at broadcast_arrays:

In [121]: X,Y = np.broadcast_arrays(np.arange(4)[:,None], np.arange(1000))
In [122]: timeit X+Y
10.1 µs ± 31.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [123]: X,Y = np.broadcast_arrays(np.arange(1000)[:,None], np.arange(4))
In [124]: timeit X+Y
26.1 µs ± 30.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [125]: X.shape, X.strides
Out[125]: ((1000, 4), (4, 0))
In [126]: Y.shape, Y.strides
Out[126]: ((1000, 4), (0, 4))

np.ascontiguousarray converts the 0 strided dimensions into full ones

In [132]: Y1 = np.ascontiguousarray(Y)
In [134]: Y1.strides
Out[134]: (16, 4)
In [135]: X1 = np.ascontiguousarray(X)
In [136]: X1.shape
Out[136]: (1000, 4)

Operating with the full arrays is faster:

In [137]: timeit X1+Y1
4.66 µs ± 161 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

So there's some sort of time penalty to using the 0-strided arrays, even if it doesn't explicitly expand the arrays first. And cost is tied to the shapes, and possibly which dimension is expanded.

I'm not convinced caching is really the single most influential factor here.

I'm also not a trained computer scientist, so I may well be wrong, but let me walk you through a couple of obervations. For simplicity I'm using @hpaulj's call that '+' shows essentially the same effect as '**'.

My working hypthesis is that it is the overhead of the outer loops which I believe are substantially more expensive than the contiguous vectorizable innermost loops.

So let us first minimize the amount of data that repeat, so caching is unlikely to have much impact:

>>> from timeit import repeat
>>> import numpy as np
>>> 
>>> def mock_data(k, N, M):
...     x = list(np.random.randint(0, 10000, (k, N, M)))
...     y = list(np.random.randint(0, 10000, (k, M)))
...     z = list(np.random.randint(0, 10000, (k, N, 1)))
...     return x, y, z
...   
>>> k, N, M = 500, 5000, 4
>>>
>>> repeat('x.pop() + y.pop()', setup='x, y, z = mock_data(k, M, N)', globals=globals(), number=k)
[0.017986663966439664, 0.018148145987652242, 0.018077059998176992]
>>> repeat('x.pop() + y.pop()', setup='x, y, z = mock_data(k, N, M)', globals=globals(), number=k)
[0.026680009090341628, 0.026304758968763053, 0.02680662798229605]

Here both scenarios have contiguous data, the same number of additions but the version with 5000 outer iterations is substantially slower. When we bring back caching albeit across trials the difference stays roughly the same but the ratio becomes even more pronounced:

>>> repeat('x[0] + y[0]', setup='x, y, z = mock_data(k, M, N)', globals=globals(), number=k)
[0.011324503924697638, 0.011121788993477821, 0.01106808998156339]
>>> repeat('x[0] + y[0]', setup='x, y, z = mock_data(k, N, M)', globals=globals(), number=k)
[0.020170683041214943, 0.0202067659702152, 0.020624138065613806]

Returning to the original "outer sum" scenario we see that in the noncontiguous long dimension case we are getting even worse. Since we have to read no more data than in the contiguous scenario this cannot be explained by data not being cached.

>>> repeat('z.pop() + y.pop()', setup='x, y, z = mock_data(k, M, N)', globals=globals(), number=k)
[0.013918839977122843, 0.01390116906259209, 0.013737019035033882]
>>> repeat('z.pop() + y.pop()', setup='x, y, z = mock_data(k, N, M)', globals=globals(), number=k)
[0.0335254140663892, 0.03351909795310348, 0.0335453050211072]

Further both profit from across trial caching:

>>> repeat('z[0] + y[0]', setup='x, y, z = mock_data(k, M, N)', globals=globals(), number=k)
[0.012061356916092336, 0.012182610924355686, 0.012071475037373602]
>>> repeat('z[0] + y[0]', setup='x, y, z = mock_data(k, N, M)', globals=globals(), number=k)
[0.03265167598146945, 0.03277428599540144, 0.03247103898320347]

From a cachist's point of view this is inconclusive at best.

So let's have a look at the source. After building a current NumPy from the tarball you'll find somewhere in the tree almost 15000 lines worth of computer generated code in a file called 'loops.c'. These loops are the innermost loops of ufuncs, the most relevant bit to our situation appears to be

#define BINARY_LOOP\
    char *ip1 = args[0], *ip2 = args[1], *op1 = args[2];\
    npy_intp is1 = steps[0], is2 = steps[1], os1 = steps[2];\
    npy_intp n = dimensions[0];\
    npy_intp i;\
    for(i = 0; i < n; i++, ip1 += is1, ip2 += is2, op1 += os1)

/*
 * loop with contiguous specialization
 * op should be the code working on `tin in1`, `tin in2` and
 * storing the result in `tout * out`
 * combine with NPY_GCC_OPT_3 to allow autovectorization
 * should only be used where its worthwhile to avoid code bloat
 */
#define BASE_BINARY_LOOP(tin, tout, op) \
    BINARY_LOOP { \
        const tin in1 = *(tin *)ip1; \
        const tin in2 = *(tin *)ip2; \
        tout * out = (tout *)op1; \
        op; \
    }

etc.

The payload in our case seems lean enough, especially if I interpret the comment about contiguous specialization and autovectorization correctly. Now, if we do only 4 iterations the overhead to payload ratio starts to look a bit troubling and it doesn't end here.

In the file ufunc_object.c we find the following snippet

/*
 * If no trivial loop matched, an iterator is required to
 * resolve broadcasting, etc
 */

NPY_UF_DBG_PRINT("iterator loop\n");
if (iterator_loop(ufunc, op, dtypes, order,
                buffersize, arr_prep, arr_prep_args,
                innerloop, innerloopdata) < 0) {
    return -1;
}

return 0;

the actual loop looks like

    NPY_BEGIN_THREADS_NDITER(iter);

    /* Execute the loop */
    do {
        NPY_UF_DBG_PRINT1("iterator loop count %d\n", (int)*count_ptr);
        innerloop(dataptr, count_ptr, stride, innerloopdata);
    } while (iternext(iter));

    NPY_END_THREADS;

innerloop is the inner loop we looked at above. How much overhead comes with iternext?

For this we need to turn to file nditer_templ.c.src where we find

/*NUMPY_API
 * Compute the specialized iteration function for an iterator
 *
 * If errmsg is non-NULL, it should point to a variable which will
 * receive the error message, and no Python exception will be set.
 * This is so that the function can be called from code not holding
 * the GIL.
 */
NPY_NO_EXPORT NpyIter_IterNextFunc *
NpyIter_GetIterNext(NpyIter *iter, char **errmsg)
{

etc.

This function returns a function pointer to one of the things the preprocessing makes of

/* Specialized iternext (@const_itflags@,@tag_ndim@,@tag_nop@) */
static int
npyiter_iternext_itflags@tag_itflags@_dims@tag_ndim@_iters@tag_nop@(
                                                      NpyIter *iter)
{

etc.

Parsing this is beyond me, but in any case it is a function pointer that must be called at every iteration of the outer loop, and as far as I know function pointers cannot be inlined, so compared to 4 iterations of a trivial loop body this will be sustantial.

I should probably profile this but my skills are insufficient.

Broadcasted NumPy arithmetic - why is one method so much more performant?

Tags:

performance

python

numpy

array-broadcasting

linear-algebra

cs95

People also ask

2 Answers

hpaulj

Paul Panzer

Recent Activity

Donate For Us

Broadcasted NumPy arithmetic - why is one method so much more performant?

Tags:

performance

python

numpy

array-broadcasting

linear-algebra

cs95

People also ask

2 Answers

hpaulj

Paul Panzer

Related questions

Recent Activity

Donate For Us