Best way to iterate over an array that uses its own output

Question

Firstly, I'd like to apologize for the badly worded title - I can't currently think of a better way to phrase it. Basically, I'm wondering if there's a faster way to implement an array operation in Python where each operation depends on previous outputs in an iterative fashion (e.g. forward differencing operations, filtering, etc.). Basically, operations that are of a form like:

for n in range(1, len(X)):
    Y[n] = X[n] + X[n - 1] + Y[n-1]

Where X is an array of values, and Y is the output. In this case, Y[0] is assumed to be known or calculated separately before the above loop. My question is: Is there a NumPy functionality to speed up this sort of self-referential loop? This is the major bottleneck in almost all the scripts I have. I know NumPy routines benefit from being executed from C routines, so I was curious if anyone knew of any numpy routines that would help here. Else failing that, are there better ways to program this loop (in Python) that would speed up its execution for large array sizes? (>500,000 data points).

MSeifert · Accepted Answer

Accessing single NumPy array elements or (elementwise-)iterating over a NumPy array is slow (like really slow). If you ever want to do a manual iteration over a NumPy array: Just don't do it!

But you got some options. The easiest is to convert the array to a Python list and iterate over the list (sounds silly, but stay with me - I'll present some benchmarks at the end of the answer ¹):

X = X.tolist()
Y = Y.tolist()
for n in range(1, len(X)):
    Y[n] = X[n] + X[n - 1] + Y[n-1]

If you also use direct iteration over the lists, it could be even faster:

X = X.tolist()
Y = Y.tolist()
for idx, (Y_n_m1, X_n, X_n_m1) in enumerate(zip(Y, X[1:], X), 1):
    Y[idx] = X_n + X_n_m1 + Y_n_m1

Then there are more sophisticated options that require additional packages. Most notably Cython and Numba, these are designed to work on the array-elements directly and avoid Python overhead whenever possible. For example with Numba you could just use your approach inside a jitted (just-in-time compiled) function:

import numba as nb

@nb.njit
def func(X, Y):
    for n in range(1, len(X)):
        Y[n] = X[n] + X[n - 1] + Y[n-1]

There X and Y can be NumPy arrays but numba will work on the buffer directly, out-speeding the other approaches (possibly by orders of magnitude).

Numba is a "heavier" dependency than Cython, but it can be faster and easier to use. But without conda it's hard to install numba... YMMV

However here's also a Cython version of the code (compiled using IPython magic, it's a bit different if you're not using IPython):

In [1]: %load_ext cython

In [2]: %%cython
   ...:
   ...: cimport cython
   ...:
   ...: @cython.boundscheck(False)
   ...: @cython.wraparound(False)
   ...: cpdef cython_indexing(double[:] X, double[:] Y):
   ...:     cdef Py_ssize_t n
   ...:     for n in range(1, len(X)):
   ...:         Y[n] = X[n] + X[n - 1] + Y[n-1]
   ...:     return Y

Just to give an example (based on the timing framework from my answer to another question), regarding the timings:

import numpy as np
import numba as nb
import scipy.signal

def numpy_indexing(X, Y):
    for n in range(1, len(X)):
        Y[n] = X[n] + X[n - 1] + Y[n-1]
    return Y

def list_indexing(X, Y):
    X = X.tolist()
    Y = Y.tolist()
    for n in range(1, len(X)):
        Y[n] = X[n] + X[n - 1] + Y[n-1]
    return Y

def list_direct(X, Y):
    X = X.tolist()
    Y = Y.tolist()
    for idx, (Y_n_m1, X_n, X_n_m1) in enumerate(zip(Y, X[1:], X), 1):
        Y[idx] = X_n + X_n_m1 + Y_n_m1
    return Y

@nb.njit
def numba_indexing(X, Y):
    for n in range(1, len(X)):
        Y[n] = X[n] + X[n - 1] + Y[n-1]
    return Y


def numpy_cumsum(X, Y):
    Y[1:] = X[1:] + X[:-1]
    np.cumsum(Y, out=Y)
    return Y

def scipy_lfilter(X, Y):
    a = [1, -1]
    b = [1, 1]
    return Y[0] - X[0] + scipy.signal.lfilter(b, a, X)

# Make sure the approaches give the same result
X = np.random.random(10000)
Y = np.zeros(10000)
Y[0] = np.random.random()

np.testing.assert_array_equal(numba_indexing(X, Y), numpy_indexing(X, Y))
np.testing.assert_array_equal(numba_indexing(X, Y), numpy_cumsum(X, Y))
np.testing.assert_almost_equal(numba_indexing(X, Y), scipy_lfilter(X, Y))
np.testing.assert_array_equal(numba_indexing(X, Y), cython_indexing(X, Y))

# Timing setup
timings = {numpy_indexing: [], 
           list_indexing: [], 
           list_direct: [],
           numba_indexing: [],
           numpy_cumsum: [],
           scipy_lfilter: [],
           cython_indexing: []}
sizes = [2**i for i in range(1, 20, 2)]

# Timing
for size in sizes:
    X = np.random.random(size=size)
    Y = np.zeros(size)
    Y[0] = np.random.random()
    for func in timings:
        res = %timeit -o func(X, Y)
        timings[func].append(res)

# Plottig absolute times

%matplotlib notebook
import matplotlib.pyplot as plt

fig = plt.figure(1)
ax = plt.subplot(111)

for func in timings:
    ax.plot(sizes, 
            [time.best for time in timings[func]], 
            label=str(func.__name__))
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('size')
ax.set_ylabel('time [seconds]')
ax.grid(which='both')
ax.legend()
plt.tight_layout()

# Plotting relative times

fig = plt.figure(1)
ax = plt.subplot(111)

baseline = numba_indexing # choose one function as baseline
for func in timings:
    ax.plot(sizes, 
            [time.best / ref.best for time, ref in zip(timings[func], timings[baseline])], 
            label=str(func.__name__))
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlabel('size')
ax.set_ylabel('time relative to "{}"'.format(baseline.__name__))
ax.grid(which='both')
ax.legend()

plt.tight_layout()

With the following results:

Absolute runtimes

enter image description here

Relative runtimes (compared to the numba function)

enter image description here

So, just by converting it to a list you will be roughly 3 times faster! By iterating directly over these lists you get another (yet smaller) speedup, only 20% in this benchmark but we're almost 4 times faster now compared to the original solution. With numba you can speed it by a factor of more than 100 compared to the list operations! And Cython is only a bit slower than numba (~40-50%), probably because I haven't squeezed out every possible optimization (usually it's not more than 10-20% slower) you could do with Cython. However for large arrays the difference gets smaller.

¹ I did go into more details in another answer. That Q+A was about converting to a set but because set uses (hidden) "manual iteration" it also applies here.

I included the timings for the NumPy cumsum and Scipy lfilter approaches. These were roughly 20 times slower for small arrays and 4 times slower for large arrays compared to the numba function. However if I interpret the question correctly you looked for general ways not only ones that applied in the example. Not every self-referencing loop can be implemented using cum* functions from NumPy or SciPys filters. But even then it seems like they can't compete with Cython and/or numba.

o11c · Answer

It's pretty simple using np.cumsum:

#!/usr/bin/env python3
import numpy as np
import random

def r():
    return random.randint(100, 1000)
X = np.array([r() for _ in range(10)])
fast_Y = np.ndarray(X.shape, dtype=X.dtype)
slow_Y = np.ndarray(X.shape, dtype=X.dtype)
slow_Y[0] = fast_Y[0] = r()

# fast method
fast_Y[1:] = X[1:] + X[:-1]
np.cumsum(fast_Y, out=fast_Y)

# original method
for n in range(1, len(X)):
    slow_Y[n] = X[n] + X[n - 1] + slow_Y[n-1]


assert (fast_Y == slow_Y).all()

Best way to iterate over an array that uses its own output

Tags:

performance

python

arrays

loops

numpy

Yoshi

2 Answers

Absolute runtimes

Relative runtimes (compared to the numba function)

MSeifert

o11c

Recent Activity

Donate For Us

Best way to iterate over an array that uses its own output

Tags:

performance

python

arrays

loops

numpy

Yoshi

2 Answers

Absolute runtimes

Relative runtimes (compared to the numba function)

MSeifert

o11c

Related questions

Recent Activity

Donate For Us