How to calculate a percentile ranking of a column of data relative to another column using python

Tags:

I have two columns of data representing the same quantity; one column is from my training data, the other is from my validation data.

I know how to calculate the percentile rankings of the training data efficiently using:

pandas.DataFrame(training_data).rank(pct = True).values

My question is, how can I efficiently get a similar set of percentile rankings of the validation data column relative to the training data column? That is, for each value in the validation data column, how can I find what its percentile ranking would be relative to all the values in the training data column?

I've tried doing this:

def percentrank(input_data,comparison_data):
    rescaled_data = np.zeros(input_data.size)
    for idx,datum in enumerate(input_data):
        rescaled_data[idx] =scipy.stats.percentileofscore(comparison_data,datum)
    return rescaled_data/100

But I'm not sure if this is even correct, and on top of that it's incredibly slow because it is doing a lot of redundant calculations for each value in the for loop.

Any help would be greatly appreciated!

584

asked Mar 31 '17 16:03

Doodles

1 Answers

Here's a solution. Sort the training data. Then use searchsorted on the validation data.

import pandas as pd
import numpy as np

# Generate Dummy Data
df_train = pd.DataFrame({'Values': 1000*np.random.rand(15712)})

#Sort Data
df_train = df_train.sort_values('Values')

# Calculating Rank and Rank_Pct for demo purposes 
#but note that it is not needed for the solution
# The ranking of the validation data below does not depend on this
df_train['Rank'] = df_train.rank()
df_train['Rank_Pct']= df_train.Values.rank(pct=True)

# Demonstrate how Rank Percentile is calculated
# This gives the same value as .rank(pct=True)
pct_increment = 1./len(df_train)
df_train['Rank_Pct_Manual'] = df_train.Rank*pct_increment

df_train.head()

       Values  Rank  Rank_Pct  Rank_Pct_Manual
2724  0.006174   1.0  0.000064         0.000064
3582  0.016264   2.0  0.000127         0.000127
5534  0.095691   3.0  0.000191         0.000191
944   0.141442   4.0  0.000255         0.000255
7566  0.161766   5.0  0.000318         0.000318

Now use searchsorted to get Rank_Pct of validation data

# Generate Dummy Validation Data
df_validation = pd.DataFrame({'Values': 1000*np.random.rand(1000)})

# Note searchsorted returns array index. 
# In sorted list rank is the same as the array index +1
df_validation['Rank_Pct'] = (1 + df_train.Values.searchsorted(df_validation.Values))*pct_increment

Here is first few lines of final df_validation dataframe:

print df_validation.head()
      Values  Rank_Pct
0  307.378334  0.304290
1  744.247034  0.744208
2  669.223821  0.670825
3  149.797030  0.145621
4  317.742713  0.314218

154

answered Nov 15 '22 16:11

B. Shieh

Related questions
                            
                                How to turn off autoscaling in matplotlib.pyplot
                            
                                Changing iterable variable during loop
                            
                                How to call all functions with name starting with given prefix?
                            
                                jupyter notebook starting directory
                            
                                NaN from sparse_softmax_cross_entropy_with_logits in Tensorflow
                            
                                Precise nth root
                            
                                Pandas cast all object columns to category
                            
                                Vertical scrollbar for frame in Tkinter, Python
                            
                                Running Python from CLion gives "Processed finished with exit code 127"
                            
                                Pyspark read multiple csv files into a dataframe (OR RDD?)
                            
                                Pandas - Creating a New Column
                            
                                Django REST API: Make field read-only for certain permission level
                            
                                How to send image to Flask server from curl request
                            
                                Django generate csv file on view and download
                            
                                python merge set of fronzensets into one set
                            
                                a bytes-like object is required not 'str'
                            
                                Remove anaconda environment prefix from ubuntu terminal command prompt
                            
                                pyspark merge two rdd together
                            
                                Tensorflow Error : No Variables to optimize
                            
                                Python generate all possible strings of length n [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to calculate a percentile ranking of a column of data relative to another column using python

Tags:

python

pandas

percentile

quantile

Doodles

People also ask

1 Answers

B. Shieh

Recent Activity

Donate For Us