I have a numpy array of floats/ints and want to map its elements into their ranks.
If an array doesn't have duplicates the problem can be solved by the following code
In [49]: a1
Out[49]: array([ 0.1, 5.1, 2.1, 3.1, 4.1, 1.1, 6.1, 8.1, 7.1, 9.1])
In [50]: a1.argsort().argsort()
Out[50]: array([0, 5, 2, 3, 4, 1, 6, 8, 7, 9])
Now I want to extend this method to arrays with possible duplicates, so that duplicates are mapped to the same value. For example, I want array a
a2 = np.array([0.1, 1.1, 2.1, 3.1, 4.1, 1.1, 6.1, 7.1, 7.1, 1.1])
to be mapped to either
0 1 4 5 6 1 7 8 8 1
or to
0 3 4 5 6 3 7 9 9 3
or to
0 2 4 5 6 2 7 8.5 8.5 2
In the first/second case we map duplicates to the minimum/maximum rank among them if we just apply a2.argsort().argsort(). The third case is just the average of first two cases.
Any suggestions?
EDIT (efficiency requirements)
In the initial description I forgot to mention about time requirements. I am seeking for solution in terms of numpy/scipy functions which will let to avoid "pure python overhead". Just to make it clear, consider the solution proposed by Richard which actually solves the problem but quite slow:
def argsortdup(a1):
sorted = np.sort(a1)
ranked = []
for item in a1:
ranked.append(sorted.searchsorted(item))
return np.array(ranked)
In [86]: a2 = np.array([ 0.1, 1.1, 2.1, 3.1, 4.1, 1.1, 6.1, 7.1, 7.1, 1.1])
In [87]: %timeit a2.argsort().argsort()
1000000 loops, best of 3: 1.55 us per loop
In [88]: %timeit argsortdup(a2)
10000 loops, best of 3: 25.6 us per loop
In [89]: a = np.arange(0.1, 1000.1)
In [90]: %timeit a.argsort().argsort()
10000 loops, best of 3: 24.5 us per loop
In [91]: %timeit argsortdup(a)
1000 loops, best of 3: 1.14 ms per loop
In [92]: a = np.arange(0.1, 10000.1)
In [93]: %timeit a.argsort().argsort()
1000 loops, best of 3: 303 us per loop
In [94]: %timeit argsortdup(a)
100 loops, best of 3: 11.9 ms per loop
It is clear from the analysis above that argsortdup is 30-50 times slower than a.argsort().argsort(). The main reason is the use of python loops and lists.
You can do reasonably well using unique
and bincount
:
>>> u, v = np.unique(a2, return_inverse=True)
>>> (np.cumsum(np.bincount(v)) - 1)[v]
array([0, 3, 4, 5, 6, 3, 7, 9, 9, 3])
Or, for the minimum rank:
>>> (np.cumsum(np.concatenate(([0], np.bincount(v)))))[v]
array([0, 1, 4, 5, 6, 1, 7, 8, 8, 1])
There's a minor speedup by giving bincount
the number of bins to provide:
(np.cumsum(np.bincount(v, minlength=u.size)) - 1)[v]
After upgrading to a latest version of scipy
as suggested @WarrenWeckesser in the comments, scipy.stats.rankdata
seems to be faster than both scipy.stats.mstats.rankdata
and np.searchsorted
being the fastet way to do it on larger arrays.
In [1]: import numpy as np
In [2]: from scipy.stats import rankdata as rd
...: from scipy.stats.mstats import rankdata as rd2
...:
In [3]: array = np.arange(0.1, 1000000.1)
In [4]: %timeit np.searchsorted(np.sort(array), array)
1 loops, best of 3: 385 ms per loop
In [5]: %timeit rd(array)
10 loops, best of 3: 109 ms per loop
In [6]: %timeit rd2(array)
1 loops, best of 3: 205 ms per loop
Here is a function that can return the output you desire (in the first case)
def argsortdup(a1):
sorted = sort(a1)
ranked = []
for item in a1:
ranked.append(sorted.searchsorted(item))
return array(ranked)
Basically you sort it and then you search for the index the item is at. Assuming duplicates the first instance index should be returned. I tested it with your a2 example and doing something like
a3 = argsortdup(a2)
Yields
array([0, 1, 4, 5, 6, 1, 7, 8, 8, 1])
"Test with a2":
>>> a2
array([ 0.1, 1.1, 2.1, 3.1, 4.1, 1.1, 6.1, 7.1, 7.1, 1.1])
>>> def argsortdup(a1):
... sorted = sort(a1)
... ranked = []
... for item in a1:
... ranked.append(sorted.searchsorted(item))
... return array(ranked)
...
>>> a3 = argsortdup(a2)
>>> a2
array([ 0.1, 1.1, 2.1, 3.1, 4.1, 1.1, 6.1, 7.1, 7.1, 1.1])
>>> a3
array([0, 1, 4, 5, 6, 1, 7, 8, 8, 1])
>>>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With