There seems to be common wisdom that using np.take
is considerably faster than array indexing. For example http://wesmckinney.com/blog/numpy-indexing-peculiarities/, Fast numpy fancy indexing, and Fast(er) numpy fancy indexing and reduction?. There are also suggestions that np.ix_
is better in some cases.
I've done some profiling, and that does seem to be true in the majority of cases, although the difference decreases as the arrays become larger.
Performance is impacted by the size of the array, the length of the index (for rows), and the number of columns taken. The number of rows seems to have the largest effect, the number of columns in the array also has a an effect even when the index is 1D. Changing the size of the index doesn't seem to affect things much between the methods.
So, the question is two fold: 1. Why is there such a large difference in performance between methods? 2. When does it make sense to use one method over another? Are there some array types, orderings, or shapes that one will always work better for?
There are a lot of things that could be impacting performance, so I've shown a few of them below, and included the code used to try and make this reproducible.
Edit I've updated the y axis on the plots to show the complete range of values. It makes it clearer that the difference is smaller than it appeared for 1D data.
Looking at running time compared to the number of rows shows that indexing is pretty consistent, with a slight upward trend. take
is consistently slower as the number rows increases.
As the number of columns increases both become slower, but take
has a higher increase (and this is for a 1D index still).
With 2D data results are similar. Using ix_
is also shown, and it seems to have the worst performance overall.
from pylab import *
import timeit
def get_test(M, T, C):
"""
Returns an array and random sorted index into rows
M : number of rows
T : rows to take
C : number of columns
"""
arr = randn(M, C)
idx = sort(randint(0, M, T))
return arr, idx
def draw_time(call, N=10, V='M', T=1000, M=5000, C=300, **kwargs):
"""
call : function to do indexing, accepts (arr, idx)
N : number of times to run timeit
V : string indicating to evaluate number of rows (M) or rows taken (T), or columns created(C)
** kwargs : passed to plot
"""
pts = {
'M': [10, 20, 50, 100, 500, 1000, 2000, 5000, 10000, 20000, 50000, 100000, 200000, 500000, ],
'T': [10, 50, 100, 500, 1000, 5000, 10000, 50000],
'C': [5, 10, 20, 50, 100, 200, 500, 1000],
}
res = []
kw = dict(T=T, M=M, C=C) ## Default values
for v in pts[V]:
kw[V] = v
try:
arr, idx = get_test(**kw)
except CallerError:
res.append(None)
else:
res.append(timeit.timeit(lambda :call(arr, idx), number=N))
plot(pts[V], res, marker='x', **kwargs)
xscale('log')
ylabel('runtime [s]')
if V == 'M':
xlabel('size of array [rows]')
elif V == 'T':
xlabel('number of rows taken')
elif V == 'C':
xlabel('number of columns created')
funcs1D = {
'fancy':lambda arr, idx: arr[idx],
'take':lambda arr, idx: arr.take(idx, axis=0),
}
cidx = r_[1, 3, 7, 15, 29]
funcs2D = {
'fancy2D':lambda arr, idx: arr[idx.reshape(-1, 1), cidx],
'take2D':lambda arr, idx: arr.take(idx.reshape(-1, 1)*arr.shape[1] + cidx),
'ix_':lambda arr, idx: arr[ix_(idx, cidx)],
}
def test(funcs, N=100, **kwargs):
for descr, f in funcs.items():
draw_time(f, label="{}".format(descr), N=100, **kwargs)
legend()
figure()
title('1D index, 30 columns in data')
test(funcs1D, V='M')
ylim(0, 0.25)
# savefig('perf_1D_arraysize', C=30)
figure()
title('1D index, 5000 rows in data')
test(funcs1D, V='C', M=5000)
ylim(0, 0.07)
# savefig('perf_1D_numbercolumns')
figure()
title('2D index, 300 columns in data')
test(funcs2D, V='M')
ylim(0, 0.01)
# savefig('perf_2D_arraysize')
figure()
title('2D index, 30 columns in data')
test(funcs2D, V='M')
ylim(0, 0.01)
# savefig('perf_2D_arraysize_C30', C=30)
The answer is very low level, and have to do with the C compiler and CPU cache optimizations. Please see the active discussion with Sebastian Berg and Max Bolingbroke (both numpy's contributors) on this numpy issue.
Fancy indexing tries to be "smart" about how the memory is read and written (C-order vs F-order), while .take
will always keep C-order. This means fancy indexing will usually be much faster for F-ordered arrays, and should always be faster in any case for huge arrays. Now, numpy decides what is the "smart" way without taking the size of the array into consideration, or the particular hardware it is running on. Therefore, for smaller arrays, choosing the "wrong" memory order might actually get better performance thanks to better use of reads in CPU-cache.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With