Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python SciPy Stats percentilofscore

Consider the following Python code:

In [1]: import numpy as np
In [2]: import scipy.stats as stats
In [3]: ar = np.array([0.8389, 0.5176, 0.1867, 0.1953, 0.4153, 0.6036, 0.2497, 0.5188, 0.4723, 0.3963])
In [4]: x = ar[-1]
In [5]: stats.percentileofscore(ar, x, kind='strict')
Out[5]: 30.0
In [6]: stats.percentileofscore(ar, x, kind='rank')
Out[6]: 40.0
In [7]: stats.percentileofscore(ar, x, kind='weak')
Out[7]: 40.0
In [8]: stats.percentileofscore(ar, x, kind='mean')
Out[8]: 35.0

The kind argument represents the interpretation of the resulting score.

Now when I use Excel's PERCENTRANK function with the same data, I get 0.3333. This appears to be correct as there are 3 values less than x=0.3963.

Can someone explain why I'm getting inconsistent results?

like image 368
Jason Strimpel Avatar asked Nov 30 '25 21:11

Jason Strimpel


1 Answers

When I rewrote this function in scipy.stats, I found many different definitions, some of them are included.

The basic example is when I want to rank students on a score. In this case the score includes all students, and the percentileofscore gives the rank among all students. The main distinction then is just how to handle ties.

Excel seems to use how you would rank a student relative to an existing scale, for example what's the rank of a score on the historical GRE scale. I have no idea if excel drops one entry if the score is not in the existing list.

A similar problem in statistics are "plotting positions" for quantiles. I don't find a good reference on the internet. Here is one general formula http://amsglossary.allenpress.com/glossary/search?id=plotting-position1 Wikipedia only has a short paragraph: http://en.wikipedia.org/wiki/Q-Q_plot#Plotting_positions

The literature has a large number of cases of different choices of b (or even choices of a second parameter a), that correspond to different approximations for different distributions. Several are implemented in scipy.stats.mstats.

I don't think it's a question of which is right. It's, what you want to use it for? And what's the common definition for your problem or your field?

like image 189
Josef Avatar answered Dec 03 '25 13:12

Josef