Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sorting numpy structured and record arrays is very slow

it looks like sorting numpy structured and record arrays by a single column is much slower than doing a sort on a similar standalone array:

In [111]: a = np.random.rand(1e4)

In [112]: b = np.random.rand(1e4)

In [113]: rec = np.rec.fromarrays([a,b])

In [114]: timeit rec.argsort(order='f0')
100 loops, best of 3: 18.8 ms per loop

In [115]: timeit a.argsort()
1000 loops, best of 3: 891 µs per loop

There is a marginal improvement using the structured array, but it's not dramatic:

In [120]: struct = np.empty(len(a),dtype=[('a','f8'),('b','f8')])

In [121]: struct['a'] = a

In [122]: struct['b'] = b

In [124]: timeit struct.argsort(order='a')
100 loops, best of 3: 15.8 ms per loop

This indicates that it's potentially faster to create an index array from argsort and then use that to reorder the individual arrays. This is OK except that I expect to be dealing with very large arrays and would like to avoid copying data as much as possible. Is there a more efficient way of doing this that I'm missing?

like image 956
Rok Avatar asked Oct 30 '13 12:10

Rok


People also ask

Is NumPy sort faster?

But then I thought using numpy would significantly improve the performance for a simple task like sorting. I was surprised that operating with numpy array is ~3 times slower than with python list, and ~100 times slower than Fortran.

Is NumPy array slow?

NumPy random for generating an array of random numbers ndarray of 1000 random numbers. The reason why NumPy is fast when used right is that its arrays are extremely efficient. They are like C arrays instead of Python lists.

Is NumPy sort stable?

NumPy's np. argsort is able to do stable sorting through passing kind = 'stable' argument.

Why NumPy array operations are faster than looping?

NumPy Arrays are faster than Python Lists because of the following reasons: An array is a collection of homogeneous data-types that are stored in contiguous memory locations. On the other hand, a list in Python is a collection of heterogeneous data types stored in non-contiguous memory locations.


2 Answers

What´s slowing you is the use of order, not the fact that you have a record array. If you want to sort by a single field, do it like this:

In [12]: %timeit np.argsort(rec['f0'])
1000 loops, best of 3: 829 us per loop

Once order is used, performance goes south no matter how many fields you want to sort by:

In [16]: %timeit np.argsort(rec, order=['f0'])
10 loops, best of 3: 27.9 ms per loop

In [17]: %timeit np.argsort(rec, order=['f0', 'f1'])
10 loops, best of 3: 28.4 ms per loop
like image 183
Jaime Avatar answered Sep 27 '22 17:09

Jaime


As Jaime have said, you can use argsort to sort the record array.

inds = np.argsort(rec['f0'])

And use take to avoid making a copy

np.take(rec, inds, out=rec)
like image 28
imsc Avatar answered Sep 27 '22 18:09

imsc