Cannot understand numpy argpartition output

I am trying to use arpgpartition from numpy, but it seems there is something going wrong and I cannot seem to figure it out. Here is what's happening:

These are first 5 elements of the sorted array norms

np.sort(norms)[:5] array([ 53.64759445,  54.91434479,  60.11617279,  64.09630585,  64.75318909], dtype=float32)

But when I use indices_sorted = np.argpartition(norms, 5)[:5]

norms[indices_sorted] array([ 60.11617279,  64.09630585,  53.64759445,  54.91434479,  64.75318909], dtype=float32)

When I think I should get the same result as the sorted array?

It works just fine when I use 3 as the parameter indices_sorted = np.argpartition(norms, 3)[:3]

norms[indices_sorted] array([ 53.64759445,  54.91434479,  60.11617279], dtype=float32)

This isn't making much sense to me, hoping someone can offer some insight?

EDIT: Rephrasing this question as whether argpartition preserves order of the k partitioned elements makes more sense.

What does NP Argpartition do?

argpartition() function is used to create a indirect partitioned copy of input array with its elements rearranged in such a way that the value of the element in k-th position is in the position it would be in a sorted array.

How do I get indices of N maximum values in a numpy array?

In order to get the indices of N maximum values in a NumPy array, we can use the argsort() function.

What are axis in Numpy?

Axes are defined for arrays with more than one dimension. A 2-dimensional array has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second running horizontally across columns (axis 1). Many operation can take place along one of these axes.

We need to use list of indices that are to be kept in sorted order instead of feeding the kth param as a scalar. Thus, to maintain the sorted nature across the first 5 elements, instead of np.argpartition(a,5)[:5], simply do -

np.argpartition(a,range(5))[:5]

Here's a sample run to make things clear -

In [84]: a = np.random.rand(10)  In [85]: a Out[85]:  array([ 0.85017222,  0.19406266,  0.7879974 ,  0.40444978,  0.46057793,         0.51428578,  0.03419694,  0.47708   ,  0.73924536,  0.14437159])  In [86]: a[np.argpartition(a,5)[:5]] Out[86]: array([ 0.19406266,  0.14437159,  0.03419694,  0.40444978,  0.46057793])  In [87]: a[np.argpartition(a,range(5))[:5]] Out[87]: array([ 0.03419694,  0.14437159,  0.19406266,  0.40444978,  0.46057793])

Please note that argpartition makes sense on performance aspect, if we are looking to get sorted indices for a small subset of elements, let's say k number of elems which is a small fraction of the total number of elems.

Let's use a bigger dataset and try to get sorted indices for all elems to make the above mentioned point clear -

In [51]: a = np.random.rand(10000)*100  In [52]: %timeit np.argpartition(a,range(a.size-1))[:5] 10 loops, best of 3: 105 ms per loop  In [53]: %timeit a.argsort() 1000 loops, best of 3: 893 µs per loop

Thus, to sort all elems, np.argpartition isn't the way to go.

Now, let's say I want to get sorted indices for only the first 5 elems with that big dataset and also keep the order for those -

In [68]: a = np.random.rand(10000)*100  In [69]: np.argpartition(a,range(5))[:5] Out[69]: array([1647,  942, 2167, 1371, 2571])  In [70]: a.argsort()[:5] Out[70]: array([1647,  942, 2167, 1371, 2571])  In [71]: %timeit np.argpartition(a,range(5))[:5] 10000 loops, best of 3: 112 µs per loop  In [72]: %timeit a.argsort()[:5] 1000 loops, best of 3: 888 µs per loop

Very useful here!

Given the task of indirectly sorting a subset (the top k, top meaning first in sort order) there are two builtin solutions: argsort and argpartition cf. @Divakar's answer.

If, however, performance is a consideration then it may (depending on the sizes of the data and the subset of interest) be well worth resisting the "lure of the one-liner", investing one more line and applying argsort on the output of argpartition:

>>> def top_k_sort(a, k): ...     return np.argsort(a)[:k] ... >>> def top_k_argp(a, k): ...     return np.argpartition(a, range(k))[:k] ... >>> def top_k_hybrid(a, k): ...     b = np.argpartition(a, k)[:k] ...     return b[np.argsort(a[b])]  >>> k = 100 >>> timeit.timeit('f(a,k)', 'a=rng((100000,))', number = 1000, globals={'f': top_k_sort, 'rng': np.random.random, 'k': k}) 8.348663672804832 >>> timeit.timeit('f(a,k)', 'a=rng((100000,))', number = 1000, globals={'f': top_k_argp, 'rng': np.random.random, 'k': k}) 9.869098862167448 >>> timeit.timeit('f(a,k)', 'a=rng((100000,))', number = 1000, globals={'f': top_k_hybrid, 'rng': np.random.random, 'k': k}) 1.2305558240041137

argsort is O(n log n), argpartition with range argument appears to be O(nk) (?), and argpartition + argsort is O(n + k log k)

Therefore in an interesting regime n >> k >> 1 the hybrid method is expected to be fastest

UPDATE: ND version:

import numpy as np from timeit import timeit  def top_k_sort(A,k,axis=-1):     return A.argsort(axis=axis)[(*axis%A.ndim*(slice(None),),slice(k))]  def top_k_partition(A,k,axis=-1):     return A.argpartition(range(k),axis=axis)[(*axis%A.ndim*(slice(None),),slice(k))]  def top_k_hybrid(A,k,axis=-1):     B = A.argpartition(k,axis=axis)[(*axis%A.ndim*(slice(None),),slice(k))]     return np.take_along_axis(B,np.take_along_axis(A,B,axis).argsort(axis),axis)  A = np.random.random((100,10000)) k = 100  from timeit import timeit  for f in globals().copy():     if f.startswith("top_"):         print(f, timeit(f"{f}(A,k)",globals=globals(),number=10)*100)

Sample run:

top_k_sort 63.72379460372031 top_k_partition 99.30561298970133 top_k_hybrid 10.714635509066284

Cannot understand numpy argpartition output

Tags:

python

arrays

numpy

rookie

People also ask

2 Answers

Divakar

Paul Panzer

Recent Activity

Donate For Us

Cannot understand numpy argpartition output

Tags:

python

arrays

numpy

rookie

People also ask

2 Answers

Divakar

Paul Panzer

Related questions

Recent Activity

Donate For Us