Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cannot understand numpy argpartition output

I am trying to use arpgpartition from numpy, but it seems there is something going wrong and I cannot seem to figure it out. Here is what's happening:

These are first 5 elements of the sorted array norms

np.sort(norms)[:5] array([ 53.64759445,  54.91434479,  60.11617279,  64.09630585,  64.75318909], dtype=float32) 

But when I use indices_sorted = np.argpartition(norms, 5)[:5]

norms[indices_sorted] array([ 60.11617279,  64.09630585,  53.64759445,  54.91434479,  64.75318909], dtype=float32) 

When I think I should get the same result as the sorted array?

It works just fine when I use 3 as the parameter indices_sorted = np.argpartition(norms, 3)[:3]

norms[indices_sorted] array([ 53.64759445,  54.91434479,  60.11617279], dtype=float32) 

This isn't making much sense to me, hoping someone can offer some insight?

EDIT: Rephrasing this question as whether argpartition preserves order of the k partitioned elements makes more sense.

like image 518
rookie Avatar asked Feb 12 '17 05:02

rookie


People also ask

What does NP Argpartition do?

argpartition() function is used to create a indirect partitioned copy of input array with its elements rearranged in such a way that the value of the element in k-th position is in the position it would be in a sorted array.

How do I get indices of N maximum values in a numpy array?

In order to get the indices of N maximum values in a NumPy array, we can use the argsort() function.

What are axis in Numpy?

Axes are defined for arrays with more than one dimension. A 2-dimensional array has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second running horizontally across columns (axis 1). Many operation can take place along one of these axes.


2 Answers

We need to use list of indices that are to be kept in sorted order instead of feeding the kth param as a scalar. Thus, to maintain the sorted nature across the first 5 elements, instead of np.argpartition(a,5)[:5], simply do -

np.argpartition(a,range(5))[:5] 

Here's a sample run to make things clear -

In [84]: a = np.random.rand(10)  In [85]: a Out[85]:  array([ 0.85017222,  0.19406266,  0.7879974 ,  0.40444978,  0.46057793,         0.51428578,  0.03419694,  0.47708   ,  0.73924536,  0.14437159])  In [86]: a[np.argpartition(a,5)[:5]] Out[86]: array([ 0.19406266,  0.14437159,  0.03419694,  0.40444978,  0.46057793])  In [87]: a[np.argpartition(a,range(5))[:5]] Out[87]: array([ 0.03419694,  0.14437159,  0.19406266,  0.40444978,  0.46057793]) 

Please note that argpartition makes sense on performance aspect, if we are looking to get sorted indices for a small subset of elements, let's say k number of elems which is a small fraction of the total number of elems.

Let's use a bigger dataset and try to get sorted indices for all elems to make the above mentioned point clear -

In [51]: a = np.random.rand(10000)*100  In [52]: %timeit np.argpartition(a,range(a.size-1))[:5] 10 loops, best of 3: 105 ms per loop  In [53]: %timeit a.argsort() 1000 loops, best of 3: 893 µs per loop 

Thus, to sort all elems, np.argpartition isn't the way to go.

Now, let's say I want to get sorted indices for only the first 5 elems with that big dataset and also keep the order for those -

In [68]: a = np.random.rand(10000)*100  In [69]: np.argpartition(a,range(5))[:5] Out[69]: array([1647,  942, 2167, 1371, 2571])  In [70]: a.argsort()[:5] Out[70]: array([1647,  942, 2167, 1371, 2571])  In [71]: %timeit np.argpartition(a,range(5))[:5] 10000 loops, best of 3: 112 µs per loop  In [72]: %timeit a.argsort()[:5] 1000 loops, best of 3: 888 µs per loop 

Very useful here!

like image 108
Divakar Avatar answered Sep 21 '22 00:09

Divakar


Given the task of indirectly sorting a subset (the top k, top meaning first in sort order) there are two builtin solutions: argsort and argpartition cf. @Divakar's answer.

If, however, performance is a consideration then it may (depending on the sizes of the data and the subset of interest) be well worth resisting the "lure of the one-liner", investing one more line and applying argsort on the output of argpartition:

>>> def top_k_sort(a, k): ...     return np.argsort(a)[:k] ... >>> def top_k_argp(a, k): ...     return np.argpartition(a, range(k))[:k] ... >>> def top_k_hybrid(a, k): ...     b = np.argpartition(a, k)[:k] ...     return b[np.argsort(a[b])]  >>> k = 100 >>> timeit.timeit('f(a,k)', 'a=rng((100000,))', number = 1000, globals={'f': top_k_sort, 'rng': np.random.random, 'k': k}) 8.348663672804832 >>> timeit.timeit('f(a,k)', 'a=rng((100000,))', number = 1000, globals={'f': top_k_argp, 'rng': np.random.random, 'k': k}) 9.869098862167448 >>> timeit.timeit('f(a,k)', 'a=rng((100000,))', number = 1000, globals={'f': top_k_hybrid, 'rng': np.random.random, 'k': k}) 1.2305558240041137 

argsort is O(n log n), argpartition with range argument appears to be O(nk) (?), and argpartition + argsort is O(n + k log k)

Therefore in an interesting regime n >> k >> 1 the hybrid method is expected to be fastest

UPDATE: ND version:

import numpy as np from timeit import timeit  def top_k_sort(A,k,axis=-1):     return A.argsort(axis=axis)[(*axis%A.ndim*(slice(None),),slice(k))]  def top_k_partition(A,k,axis=-1):     return A.argpartition(range(k),axis=axis)[(*axis%A.ndim*(slice(None),),slice(k))]  def top_k_hybrid(A,k,axis=-1):     B = A.argpartition(k,axis=axis)[(*axis%A.ndim*(slice(None),),slice(k))]     return np.take_along_axis(B,np.take_along_axis(A,B,axis).argsort(axis),axis)  A = np.random.random((100,10000)) k = 100  from timeit import timeit  for f in globals().copy():     if f.startswith("top_"):         print(f, timeit(f"{f}(A,k)",globals=globals(),number=10)*100) 

Sample run:

top_k_sort 63.72379460372031 top_k_partition 99.30561298970133 top_k_hybrid 10.714635509066284 
like image 30
Paul Panzer Avatar answered Sep 20 '22 00:09

Paul Panzer