Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate percentile for every value in a column of dataframe

I am trying to calculate percentile for every value in column a from a DataFrame x.

Is there a better way to write the following piece of code?

x["pcta"] = [stats.percentileofscore(x["a"].values, i) 
                                    for i in x["a"].values]

I would like to see better performance.

like image 227
Praveen Gupta Sanka Avatar asked May 27 '17 00:05

Praveen Gupta Sanka


People also ask

How do you find the percentile of a column in a data frame?

To find percentiles of a numeric column in a DataFrame, or the percentiles of a Series in pandas, the easiest way is to use the pandas quantile() function. You can also use the numpy percentile() function.

How do you find the percentile of a data value in Python?

To find the percentile of a value relative to an array (or in your case a dataframe column), use the scipy function stats. percentileofscore() . Note that there is a third parameter to the stats. percentileofscore() function that has a significant impact on the resulting value of the percentile, viz.

How do you calculate quantile of a column in Python?

Pandas DataFrame quantile() Method The quantile() method calculates the quantile of the values in a given axis. Default axis is row. By specifying the column axis ( axis='columns' ), the quantile() method calculates the quantile column-wise and returns the mean value for each row.


1 Answers

It seems like you want Series.rank():

x.loc[:, 'pcta'] = x.rank(pct=True) # will be in decimal form

Performance:

import scipy.stats as scs

%timeit [scs.percentileofscore(x["a"].values, i) for i in x["a"].values]
1000 loops, best of 3: 877 µs per loop

%timeit x.rank(pct=True)
10000 loops, best of 3: 107 µs per loop
like image 170
Brad Solomon Avatar answered Oct 06 '22 12:10

Brad Solomon