Firstly, I'd like to apologize for the badly worded title - I can't currently think of a better way to phrase it. Basically, I'm wondering if there's a faster way to implement an array operation in Python where each operation depends on previous outputs in an iterative fashion (e.g. forward differencing operations, filtering, etc.). Basically, operations that are of a form like:
for n in range(1, len(X)):
Y[n] = X[n] + X[n - 1] + Y[n-1]
Where X
is an array of values, and Y
is the output. In this case, Y[0]
is assumed to be known or calculated separately before the above loop. My question is: Is there a NumPy functionality to speed up this sort of self-referential loop? This is the major bottleneck in almost all the scripts I have. I know NumPy routines benefit from being executed from C routines, so I was curious if anyone knew of any numpy routines that would help here. Else failing that, are there better ways to program this loop (in Python) that would speed up its execution for large array sizes? (>500,000 data points).
Accessing single NumPy array elements or (elementwise-)iterating over a NumPy array is slow (like really slow). If you ever want to do a manual iteration over a NumPy array: Just don't do it!
But you got some options. The easiest is to convert the array to a Python list and iterate over the list (sounds silly, but stay with me - I'll present some benchmarks at the end of the answer 1):
X = X.tolist()
Y = Y.tolist()
for n in range(1, len(X)):
Y[n] = X[n] + X[n - 1] + Y[n-1]
If you also use direct iteration over the lists, it could be even faster:
X = X.tolist()
Y = Y.tolist()
for idx, (Y_n_m1, X_n, X_n_m1) in enumerate(zip(Y, X[1:], X), 1):
Y[idx] = X_n + X_n_m1 + Y_n_m1
Then there are more sophisticated options that require additional packages. Most notably Cython
and Numba
, these are designed to work on the array-elements directly and avoid Python overhead whenever possible. For example with Numba you could just use your approach inside a jitted (just-in-time compiled) function:
import numba as nb
@nb.njit
def func(X, Y):
for n in range(1, len(X)):
Y[n] = X[n] + X[n - 1] + Y[n-1]
There X
and Y
can be NumPy arrays but numba will work on the buffer directly, out-speeding the other approaches (possibly by orders of magnitude).
Numba is a "heavier" dependency than Cython, but it can be faster and easier to use. But without conda
it's hard to install numba... YMMV
However here's also a Cython version of the code (compiled using IPython magic, it's a bit different if you're not using IPython):
In [1]: %load_ext cython
In [2]: %%cython
...:
...: cimport cython
...:
...: @cython.boundscheck(False)
...: @cython.wraparound(False)
...: cpdef cython_indexing(double[:] X, double[:] Y):
...: cdef Py_ssize_t n
...: for n in range(1, len(X)):
...: Y[n] = X[n] + X[n - 1] + Y[n-1]
...: return Y
Just to give an example (based on the timing framework from my answer to another question), regarding the timings:
import numpy as np
import numba as nb
import scipy.signal
def numpy_indexing(X, Y):
for n in range(1, len(X)):
Y[n] = X[n] + X[n - 1] + Y[n-1]
return Y
def list_indexing(X, Y):
X = X.tolist()
Y = Y.tolist()
for n in range(1, len(X)):
Y[n] = X[n] + X[n - 1] + Y[n-1]
return Y
def list_direct(X, Y):
X = X.tolist()
Y = Y.tolist()
for idx, (Y_n_m1, X_n, X_n_m1) in enumerate(zip(Y, X[1:], X), 1):
Y[idx] = X_n + X_n_m1 + Y_n_m1
return Y
@nb.njit
def numba_indexing(X, Y):
for n in range(1, len(X)):
Y[n] = X[n] + X[n - 1] + Y[n-1]
return Y
def numpy_cumsum(X, Y):
Y[1:] = X[1:] + X[:-1]
np.cumsum(Y, out=Y)
return Y
def scipy_lfilter(X, Y):
a = [1, -1]
b = [1, 1]
return Y[0] - X[0] + scipy.signal.lfilter(b, a, X)
# Make sure the approaches give the same result
X = np.random.random(10000)
Y = np.zeros(10000)
Y[0] = np.random.random()
np.testing.assert_array_equal(numba_indexing(X, Y), numpy_indexing(X, Y))
np.testing.assert_array_equal(numba_indexing(X, Y), numpy_cumsum(X, Y))
np.testing.assert_almost_equal(numba_indexing(X, Y), scipy_lfilter(X, Y))
np.testing.assert_array_equal(numba_indexing(X, Y), cython_indexing(X, Y))
# Timing setup
timings = {numpy_indexing: [],
list_indexing: [],
list_direct: [],
numba_indexing: [],
numpy_cumsum: [],
scipy_lfilter: [],
cython_indexing: []}
sizes = [2**i for i in range(1, 20, 2)]
# Timing
for size in sizes:
X = np.random.random(size=size)
Y = np.zeros(size)
Y[0] = np.random.random()
for func in timings:
res = %timeit -o func(X, Y)
timings[func].append(res)
# Plottig absolute times
%matplotlib notebook
import matplotlib.pyplot as plt
fig = plt.figure(1)
ax = plt.subplot(111)
for func in timings:
ax.plot(sizes,
[time.best for time in timings[func]],
label=str(func.__name__))
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('size')
ax.set_ylabel('time [seconds]')
ax.grid(which='both')
ax.legend()
plt.tight_layout()
# Plotting relative times
fig = plt.figure(1)
ax = plt.subplot(111)
baseline = numba_indexing # choose one function as baseline
for func in timings:
ax.plot(sizes,
[time.best / ref.best for time, ref in zip(timings[func], timings[baseline])],
label=str(func.__name__))
ax.set_yscale('log')
ax.set_xscale('log')
ax.set_xlabel('size')
ax.set_ylabel('time relative to "{}"'.format(baseline.__name__))
ax.grid(which='both')
ax.legend()
plt.tight_layout()
With the following results:
So, just by converting it to a list you will be roughly 3 times faster! By iterating directly over these lists you get another (yet smaller) speedup, only 20% in this benchmark but we're almost 4 times faster now compared to the original solution. With numba you can speed it by a factor of more than 100 compared to the list operations! And Cython is only a bit slower than numba (~40-50%), probably because I haven't squeezed out every possible optimization (usually it's not more than 10-20% slower) you could do with Cython. However for large arrays the difference gets smaller.
1 I did go into more details in another answer. That Q+A was about converting to a set
but because set
uses (hidden) "manual iteration" it also applies here.
I included the timings for the NumPy cumsum
and Scipy lfilter
approaches. These were roughly 20 times slower for small arrays and 4 times slower for large arrays compared to the numba function. However if I interpret the question correctly you looked for general ways not only ones that applied in the example. Not every self-referencing loop can be implemented using cum*
functions from NumPy or SciPys filters. But even then it seems like they can't compete with Cython and/or numba.
It's pretty simple using np.cumsum
:
#!/usr/bin/env python3
import numpy as np
import random
def r():
return random.randint(100, 1000)
X = np.array([r() for _ in range(10)])
fast_Y = np.ndarray(X.shape, dtype=X.dtype)
slow_Y = np.ndarray(X.shape, dtype=X.dtype)
slow_Y[0] = fast_Y[0] = r()
# fast method
fast_Y[1:] = X[1:] + X[:-1]
np.cumsum(fast_Y, out=fast_Y)
# original method
for n in range(1, len(X)):
slow_Y[n] = X[n] + X[n - 1] + slow_Y[n-1]
assert (fast_Y == slow_Y).all()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With