it looks like sorting numpy structured and record arrays by a single column is much slower than doing a sort on a similar standalone array:
In [111]: a = np.random.rand(1e4)
In [112]: b = np.random.rand(1e4)
In [113]: rec = np.rec.fromarrays([a,b])
In [114]: timeit rec.argsort(order='f0')
100 loops, best of 3: 18.8 ms per loop
In [115]: timeit a.argsort()
1000 loops, best of 3: 891 µs per loop
There is a marginal improvement using the structured array, but it's not dramatic:
In [120]: struct = np.empty(len(a),dtype=[('a','f8'),('b','f8')])
In [121]: struct['a'] = a
In [122]: struct['b'] = b
In [124]: timeit struct.argsort(order='a')
100 loops, best of 3: 15.8 ms per loop
This indicates that it's potentially faster to create an index array from argsort and then use that to reorder the individual arrays. This is OK except that I expect to be dealing with very large arrays and would like to avoid copying data as much as possible. Is there a more efficient way of doing this that I'm missing?
But then I thought using numpy would significantly improve the performance for a simple task like sorting. I was surprised that operating with numpy array is ~3 times slower than with python list, and ~100 times slower than Fortran.
NumPy random for generating an array of random numbers ndarray of 1000 random numbers. The reason why NumPy is fast when used right is that its arrays are extremely efficient. They are like C arrays instead of Python lists.
NumPy's np. argsort is able to do stable sorting through passing kind = 'stable' argument.
NumPy Arrays are faster than Python Lists because of the following reasons: An array is a collection of homogeneous data-types that are stored in contiguous memory locations. On the other hand, a list in Python is a collection of heterogeneous data types stored in non-contiguous memory locations.
What´s slowing you is the use of order
, not the fact that you have a record array. If you want to sort by a single field, do it like this:
In [12]: %timeit np.argsort(rec['f0'])
1000 loops, best of 3: 829 us per loop
Once order
is used, performance goes south no matter how many fields you want to sort by:
In [16]: %timeit np.argsort(rec, order=['f0'])
10 loops, best of 3: 27.9 ms per loop
In [17]: %timeit np.argsort(rec, order=['f0', 'f1'])
10 loops, best of 3: 28.4 ms per loop
As Jaime have said, you can use argsort
to sort the record array.
inds = np.argsort(rec['f0'])
And use take
to avoid making a copy
np.take(rec, inds, out=rec)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With