I am trying to calculate percentile for every value in column a
from a DataFrame x
.
Is there a better way to write the following piece of code?
x["pcta"] = [stats.percentileofscore(x["a"].values, i)
for i in x["a"].values]
I would like to see better performance.
To find percentiles of a numeric column in a DataFrame, or the percentiles of a Series in pandas, the easiest way is to use the pandas quantile() function. You can also use the numpy percentile() function.
To find the percentile of a value relative to an array (or in your case a dataframe column), use the scipy function stats. percentileofscore() . Note that there is a third parameter to the stats. percentileofscore() function that has a significant impact on the resulting value of the percentile, viz.
Pandas DataFrame quantile() Method The quantile() method calculates the quantile of the values in a given axis. Default axis is row. By specifying the column axis ( axis='columns' ), the quantile() method calculates the quantile column-wise and returns the mean value for each row.
It seems like you want Series.rank()
:
x.loc[:, 'pcta'] = x.rank(pct=True) # will be in decimal form
Performance:
import scipy.stats as scs
%timeit [scs.percentileofscore(x["a"].values, i) for i in x["a"].values]
1000 loops, best of 3: 877 µs per loop
%timeit x.rank(pct=True)
10000 loops, best of 3: 107 µs per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With