I am trying to use arpgpartition from numpy, but it seems there is something going wrong and I cannot seem to figure it out. Here is what's happening:
These are first 5 elements of the sorted array norms
np.sort(norms)[:5] array([ 53.64759445, 54.91434479, 60.11617279, 64.09630585, 64.75318909], dtype=float32)
But when I use indices_sorted = np.argpartition(norms, 5)[:5]
norms[indices_sorted] array([ 60.11617279, 64.09630585, 53.64759445, 54.91434479, 64.75318909], dtype=float32)
When I think I should get the same result as the sorted array?
It works just fine when I use 3 as the parameter indices_sorted = np.argpartition(norms, 3)[:3]
norms[indices_sorted] array([ 53.64759445, 54.91434479, 60.11617279], dtype=float32)
This isn't making much sense to me, hoping someone can offer some insight?
EDIT: Rephrasing this question as whether argpartition preserves order of the k partitioned elements makes more sense.
argpartition() function is used to create a indirect partitioned copy of input array with its elements rearranged in such a way that the value of the element in k-th position is in the position it would be in a sorted array.
In order to get the indices of N maximum values in a NumPy array, we can use the argsort() function.
Axes are defined for arrays with more than one dimension. A 2-dimensional array has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second running horizontally across columns (axis 1). Many operation can take place along one of these axes.
We need to use list of indices that are to be kept in sorted order instead of feeding the kth param as a scalar. Thus, to maintain the sorted nature across the first 5
elements, instead of np.argpartition(a,5)[:5]
, simply do -
np.argpartition(a,range(5))[:5]
Here's a sample run to make things clear -
In [84]: a = np.random.rand(10) In [85]: a Out[85]: array([ 0.85017222, 0.19406266, 0.7879974 , 0.40444978, 0.46057793, 0.51428578, 0.03419694, 0.47708 , 0.73924536, 0.14437159]) In [86]: a[np.argpartition(a,5)[:5]] Out[86]: array([ 0.19406266, 0.14437159, 0.03419694, 0.40444978, 0.46057793]) In [87]: a[np.argpartition(a,range(5))[:5]] Out[87]: array([ 0.03419694, 0.14437159, 0.19406266, 0.40444978, 0.46057793])
Please note that argpartition
makes sense on performance aspect, if we are looking to get sorted indices for a small subset of elements, let's say k
number of elems which is a small fraction of the total number of elems.
Let's use a bigger dataset and try to get sorted indices for all elems to make the above mentioned point clear -
In [51]: a = np.random.rand(10000)*100 In [52]: %timeit np.argpartition(a,range(a.size-1))[:5] 10 loops, best of 3: 105 ms per loop In [53]: %timeit a.argsort() 1000 loops, best of 3: 893 µs per loop
Thus, to sort all elems, np.argpartition
isn't the way to go.
Now, let's say I want to get sorted indices for only the first 5 elems with that big dataset and also keep the order for those -
In [68]: a = np.random.rand(10000)*100 In [69]: np.argpartition(a,range(5))[:5] Out[69]: array([1647, 942, 2167, 1371, 2571]) In [70]: a.argsort()[:5] Out[70]: array([1647, 942, 2167, 1371, 2571]) In [71]: %timeit np.argpartition(a,range(5))[:5] 10000 loops, best of 3: 112 µs per loop In [72]: %timeit a.argsort()[:5] 1000 loops, best of 3: 888 µs per loop
Very useful here!
Given the task of indirectly sorting a subset (the top k, top meaning first in sort order) there are two builtin solutions: argsort
and argpartition
cf. @Divakar's answer.
If, however, performance is a consideration then it may (depending on the sizes of the data and the subset of interest) be well worth resisting the "lure of the one-liner", investing one more line and applying argsort
on the output of argpartition
:
>>> def top_k_sort(a, k): ... return np.argsort(a)[:k] ... >>> def top_k_argp(a, k): ... return np.argpartition(a, range(k))[:k] ... >>> def top_k_hybrid(a, k): ... b = np.argpartition(a, k)[:k] ... return b[np.argsort(a[b])] >>> k = 100 >>> timeit.timeit('f(a,k)', 'a=rng((100000,))', number = 1000, globals={'f': top_k_sort, 'rng': np.random.random, 'k': k}) 8.348663672804832 >>> timeit.timeit('f(a,k)', 'a=rng((100000,))', number = 1000, globals={'f': top_k_argp, 'rng': np.random.random, 'k': k}) 9.869098862167448 >>> timeit.timeit('f(a,k)', 'a=rng((100000,))', number = 1000, globals={'f': top_k_hybrid, 'rng': np.random.random, 'k': k}) 1.2305558240041137
argsort
is O(n log n), argpartition
with range argument appears to be O(nk) (?), and argpartition
+ argsort
is O(n + k log k)
Therefore in an interesting regime n >> k >> 1 the hybrid method is expected to be fastest
UPDATE: ND version:
import numpy as np from timeit import timeit def top_k_sort(A,k,axis=-1): return A.argsort(axis=axis)[(*axis%A.ndim*(slice(None),),slice(k))] def top_k_partition(A,k,axis=-1): return A.argpartition(range(k),axis=axis)[(*axis%A.ndim*(slice(None),),slice(k))] def top_k_hybrid(A,k,axis=-1): B = A.argpartition(k,axis=axis)[(*axis%A.ndim*(slice(None),),slice(k))] return np.take_along_axis(B,np.take_along_axis(A,B,axis).argsort(axis),axis) A = np.random.random((100,10000)) k = 100 from timeit import timeit for f in globals().copy(): if f.startswith("top_"): print(f, timeit(f"{f}(A,k)",globals=globals(),number=10)*100)
Sample run:
top_k_sort 63.72379460372031 top_k_partition 99.30561298970133 top_k_hybrid 10.714635509066284
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With