I have a dataframe with a column that has numerical values. This column is not well-approximated by a normal distribution. Given another numerical value, not in this column, how can I calculate its percentile in the column? That is, if the value is greater than 80% of the values in the column but less than the other 20%, it would be in the 20th percentile.
To find the percentile of a value relative to an array (or in your case a dataframe column), use the scipy function stats.percentileofscore()
.
For example, if we have a value x
(the other numerical value not in the dataframe), and a reference array, arr
(the column from the dataframe), we can find the percentile of x
by:
from scipy import stats
percentile = stats.percentileofscore(arr, x)
Note that there is a third parameter to the stats.percentileofscore()
function that has a significant impact on the resulting value of the percentile, viz. kind
. You can choose from rank
, weak
, strict
, and mean
. See the docs for more information.
For an example of the difference:
>>> df
a
0 1
1 2
2 3
3 4
4 5
>>> stats.percentileofscore(df['a'], 4, kind='rank')
80.0
>>> stats.percentileofscore(df['a'], 4, kind='weak')
80.0
>>> stats.percentileofscore(df['a'], 4, kind='strict')
60.0
>>> stats.percentileofscore(df['a'], 4, kind='mean')
70.0
As a final note, if you have a value that is greater than 80% of the other values in the column, it would be in the 80th percentile (see the example above for how the kind
method affects this final score somewhat) not the 20th percentile. See this Wikipedia article for more information.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With