Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Indirect-ish sort with Numpy

With numpy, you can do an indirect sort. That is, from an array like

>> a = array([ 8, 10,  5,  2,  3,  1,  6])

And then do an indirect sort like this:

>> np.argsort(a)
>> array([5, 3, 4, 2, 6, 0, 1])

This array says something like "in the 0-th position of the ordered array should be the a[5] one of the input array, the 1-th position of the ordered array should be a[3]" and so on. But, is there a Numpy-powered way to get something like a "should-be-here" ordering? What do I mean? With argsort, you have the index ordering that orders the input array, so that a[np.argsort(a)] is an ordered array. But, what I need is sort of the opposite, that is, for every element of the input array, to get the position of the element on the ordered array. For the example:

>>myweirdsort(a)
>>array([5, 6, 3, 1, 2, 0, 4])

This array says something like "a[0] goes in the 5-th position of the ordered array, a[1] goes in the 6-th position of the ordered array" and so-on.

By the way, when I say "Numpy-powered", I mean a vectorized Numpy-ish way to do this. A non-Numpy way should be just to iterate through every element, do something like a partition of the array, and then figure out where the element ends up in the partitioned array, but it would take too long.

like image 550
user4052054 Avatar asked Dec 25 '22 04:12

user4052054


2 Answers

You just need to argsort the argsort again:

>>> a.argsort().argsort()
array([5, 6, 3, 1, 2, 0, 4], dtype=int64)
like image 143
BrenBarn Avatar answered Dec 28 '22 15:12

BrenBarn


While @BrenBarn's solution is perfectly valid, very compact, and a usual construct in numpy code, you are having to sort twice, which always struck me as being a little wasteful. It turns out you don't have to do the second sort. The following code is not as terse, but will be faster for large arrays:

>>> my_weird_sort = np.empty_like(idx)
>>> my_weird_sort[idx] = np.arange(idx.size)

>>> my_weird_sort
array([5, 6, 3, 1, 2, 0, 4])

How much faster? I did some timings, and on my system it is marginally slower for small sizes, starts being faster for arrays of ~100-200 items and it is about 1.4-1.5x faster for arrays of 1,000 to 1,000,000 items.


For completeness, a similar construct is used often to first sort an array, obtain some value for each item in the sorted array, then reorder the result back to the unsorted state. As an example, to find out if an item is the first instance of that value in an array, you could do:

>>> b = np.array([1, 3, 1, 2, 4, 3, 3, 2, 0])
>>> idx = np.argsort(b, kind='mergesort')  # need stable sort
>>> sorted_b = b[idx]
>>> sorted_b
array([0, 1, 1, 2, 2, 3, 3, 3, 4])
>>> sorted_is_first = np.concatenate(([True], sorted_b[1:] != sorted_b[:-1]))
>>> sorted_is_first
array([ True,  True, False,  True, False,  True, False, False,  True], dtype=bool)
>>> is_first = sorted_is_first[idx.argsort()]
>>> is_first
array([ True,  True, False,  True,  True, False, False, False,  True], dtype=bool)

You can also get this without the second sort, by doing similarly to above:

>>> is_first = np.empty_like(sorted_is_first)
>>> is_first[idx] = sorted_is_first
>>> is_first
array([ True,  True, False,  True,  True, False, False, False,  True], dtype=bool)

A change similar to this was recently added to np.unique, for the case where return_inverse indices are requested, see here. In that case, the speed up for larger sizes was almost 2x.

like image 44
Jaime Avatar answered Dec 28 '22 13:12

Jaime