Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numpy argsort vs Scipy.stats rankdata

Tags:

I've recently used both of these functions, and am looking for input from anyone who can speak to the following:

  • do argsort and rankdata differ fundamentally in their purpose?
  • are there performance advantages with one over the other? (specifically: large vs small array performance differences?)
  • what is the memory overhead associated with importing rankdata?

Thanks in advance.

p.s. I could not create the new tags 'argsort' or 'rankdata'. If anyone with sufficient standing feels they should be added to this question, please do.

like image 225
Boreal Coder Avatar asked Mar 23 '18 15:03

Boreal Coder


People also ask

Is NumPy Argsort stable?

NumPy's np. argsort is able to do stable sorting through passing kind = 'stable' argument. Also np. argsort doesn't support reverse (descending) order.

What is the use of Argsort in Python?

Returns the indices that would sort an array. Perform an indirect sort along the given axis using the algorithm specified by the kind keyword. It returns an array of indices of the same shape as a that index data along the given axis in sorted order.

What is the rank of NumPy array?

Rank of the array is the number of singular values of the array that are greater than tol. Input vector or stack of matrices. Threshold below which SVD values are considered zero. If tol is None, and S is an array with singular values for M, and eps is the epsilon value for datatype of S , then tol is set to S.

How do you reverse NP Argsort?

Use numpy. argsort(an_array) to sort the indices of an_array in ascending order. Use the syntax ranked[::-1] with ranked as the previous result to reverse ranked . Use the syntax arr[:n] , where arr is the previous result to get the first n values, which are the indices of the n largest values in an_array .


1 Answers

Do argsort and rankdata differ fundamentally in their purpose?

In my opinion, they do slightly. The first gives you the positions of the data if the data was sorted, while the second the rank of the data. The difference can become apparent in the case of ties:

import numpy as np
from scipy import stats

a = np.array([ 5, 0.3,  0.4, 1, 1, 1, 3, 42])
almost_ranks = np.empty_like(a)
almost_ranks[np.argsort(a)] = np.arange(len(a))
print(almost_ranks)
print(almost_ranks+1)
print(stats.rankdata(a))

Results to (notice 3. 4. 5 vs. 4. 4. 4 ):

[6. 0. 1. 2. 3. 4. 5. 7.]
[7. 1. 2. 3. 4. 5. 6. 8.]
[7. 1. 2. 4. 4. 4. 6. 8.]

Are there performance advantages with one over the other? (specifically: large vs small array performance differences?)

Both algorithms seem to me to have the same complexity: O(NlgN) I would expect the numpy implementation to be slightly faster as it has a bit of a smaller overhead, plus it's numpy. But you should test this yourself... Checking the code for scipy.rankdata, it seems to -at present, my python...- be calling np.unique among other functions, so i would guess it would take more in practice...

what is the memory overhead associated with importing rankdata?

Well, you import scipy, if you had not done so before, so it is the overhead of scipy...

like image 112
ntg Avatar answered Sep 19 '22 13:09

ntg