Map each list value to its corresponding percentile

Tags:

I'd like to create a function that takes a (sorted) list as its argument and outputs a list containing each element's corresponding percentile.

For example, fn([1,2,3,4,17]) returns [0.0, 0.25, 0.50, 0.75, 1.00].

Can anyone please either:

Help me correct my code below? OR
Offer a better alternative than my code for mapping values in a list to their corresponding percentiles?

My current code:

def median(mylist):
    length = len(mylist)
    if not length % 2:
        return (mylist[length / 2] + mylist[length / 2 - 1]) / 2.0
    return mylist[length / 2]

###############################################################################
# PERCENTILE FUNCTION
###############################################################################

def percentile(x):
    """
    Find the correspoding percentile of each value relative to a list of values.
    where x is the list of values
    Input list should already be sorted!
    """

    # sort the input list
    # list_sorted = x.sort()

    # count the number of elements in the list
    list_elementCount = len(x)

    #obtain set of values from list

    listFromSetFromList = list(set(x))

    # count the number of unique elements in the list
    list_uniqueElementCount = len(set(x))

    # define extreme quantiles
    percentileZero    = min(x)
    percentileHundred = max(x)

    # define median quantile
    mdn = median(x) 

    # create empty list to hold percentiles
    x_percentile = [0.00] * list_elementCount 

    # initialize unique count
    uCount = 0

    for i in range(list_elementCount):
        if x[i] == percentileZero:
            x_percentile[i] = 0.00
        elif x[i] == percentileHundred:
            x_percentile[i] = 1.00
        elif x[i] == mdn:
            x_percentile[i] = 0.50 
        else:
            subList_elementCount = 0
            for j in range(i):
                if x[j] < x[i]:
                    subList_elementCount = subList_elementCount + 1 
            x_percentile[i] = float(subList_elementCount / list_elementCount)
            #x_percentile[i] = float(len(x[x > listFromSetFromList[uCount]]) / list_elementCount)
            if i == 0:
                continue
            else:
                if x[i] == x[i-1]:
                    continue
                else:
                    uCount = uCount + 1
    return x_percentile

Currently, if I submit percentile([1,2,3,4,17]), the list [0.0, 0.0, 0.5, 0.0, 1.0] is returned.

282

asked Sep 13 '12 20:09

Jubbles

8 Answers

I think your example input/output does not correspond to typical ways of calculating percentile. If you calculate the percentile as "proportion of data points strictly less than this value", then the top value should be 0.8 (since 4 of 5 values are less than the largest one). If you calculate it as "percent of data points less than or equal to this value", then the bottom value should be 0.2 (since 1 of 5 values equals the smallest one). Thus the percentiles would be [0, 0.2, 0.4, 0.6, 0.8] or [0.2, 0.4, 0.6, 0.8, 1]. Your definition seems to be "the number of data points strictly less than this value, considered as a proportion of the number of data points not equal to this value", but in my experience this is not a common definition (see for instance wikipedia).

With the typical percentile definitions, the percentile of a data point is equal to its rank divided by the number of data points. (See for instance this question on Stats SE asking how to do the same thing in R.) Differences in how to compute the percentile amount to differences in how to compute the rank (for instance, how to rank tied values). The scipy.stats.percentileofscore function provides four ways of computing percentiles:

>>> x = [1, 1, 2, 2, 17]
>>> [stats.percentileofscore(x, a, 'rank') for a in x]
[30.0, 30.0, 70.0, 70.0, 100.0]
>>> [stats.percentileofscore(x, a, 'weak') for a in x]
[40.0, 40.0, 80.0, 80.0, 100.0]
>>> [stats.percentileofscore(x, a, 'strict') for a in x]
[0.0, 0.0, 40.0, 40.0, 80.0]
>>> [stats.percentileofscore(x, a, 'mean') for a in x]
[20.0, 20.0, 60.0, 60.0, 90.0]

(I used a dataset containing ties to illustrate what happens in such cases.)

The "rank" method assigns tied groups a rank equal to the average of the ranks they would cover (i.e., a three-way tie for 2nd place gets a rank of 3 because it "takes up" ranks 2, 3 and 4). The "weak" method assigns a percentile based on the proportion of data points less than or equal to a given point; "strict" is the same but counts proportion of points strictly less than the given point. The "mean" method is the average of the latter two.

As Kevin H. Lin noted, calling percentileofscore in a loop is inefficient since it has to recompute the ranks on every pass. However, these percentile calculations can be easily replicated using different ranking methods provided by scipy.stats.rankdata, letting you calculate all the percentiles at once:

>>> from scipy import stats
>>> stats.rankdata(x, "average")/len(x)
array([ 0.3,  0.3,  0.7,  0.7,  1. ])
>>> stats.rankdata(x, 'max')/len(x)
array([ 0.4,  0.4,  0.8,  0.8,  1. ])
>>> (stats.rankdata(x, 'min')-1)/len(x)
array([ 0. ,  0. ,  0.4,  0.4,  0.8])

In the last case the ranks are adjusted down by one to make them start from 0 instead of 1. (I've omitted "mean", but it could easily be obtained by averaging the results of the latter two methods.)

I did some timings. With small data such as that in your example, using rankdata is somewhat slower than Kevin H. Lin's solution (presumably due to the overhead scipy incurs in converting things to numpy arrays under the hood) but faster than calling percentileofscore in a loop as in reptilicus's answer:

In [11]: %timeit [stats.percentileofscore(x, i) for i in x]
1000 loops, best of 3: 414 µs per loop

In [12]: %timeit list_to_percentiles(x)
100000 loops, best of 3: 11.1 µs per loop

In [13]: %timeit stats.rankdata(x, "average")/len(x)
10000 loops, best of 3: 39.3 µs per loop

With a large dataset, however, the performance advantage of numpy takes effect and using rankdata is 10 times faster than Kevin's list_to_percentiles:

In [18]: x = np.random.randint(0, 10000, 1000)

In [19]: %timeit [stats.percentileofscore(x, i) for i in x]
1 loops, best of 3: 437 ms per loop

In [20]: %timeit list_to_percentiles(x)
100 loops, best of 3: 1.08 ms per loop

In [21]: %timeit stats.rankdata(x, "average")/len(x)
10000 loops, best of 3: 102 µs per loop

This advantage will only become more pronounced on larger and larger datasets.

182

answered Oct 04 '22 04:10

BrenBarn

I think you want scipy.stats.percentileofscore

Example:

percentileofscore([1, 2, 3, 4], 3)
75.0
percentiles = [percentileofscore(data, i) for i in data]

answered Oct 04 '22 04:10

reptilicus

In terms of complexity, I think reptilicus's answer is not optimal. It takes O(n^2) time.

Here is a solution that takes O(n log n) time.

def list_to_percentiles(numbers):
    pairs = zip(numbers, range(len(numbers)))
    pairs.sort(key=lambda p: p[0])
    result = [0 for i in range(len(numbers))]
    for rank in xrange(len(numbers)):
        original_index = pairs[rank][1]
        result[original_index] = rank * 100.0 / (len(numbers)-1)
    return result

I'm not sure, but I think this is the optimal time complexity you can get. The rough reason I think it's optimal is because the information of all of the percentiles is essentially equivalent to the information of the sorted list, and you can't get better than O(n log n) for sorting.

EDIT: Depending on your definition of "percentile" this may not always give the correct result. See BrenBarn's answer for more explanation and for a better solution which makes use of scipy/numpy.

answered Oct 04 '22 04:10

Kevin H. Lin

Pure numpy version of Kevin's solution

As Kevin said, optimal solution works in O(n log(n)) time. Here is fast version of his code in numpy, which works almost the same time as stats.rankdata:

percentiles = numpy.argsort(numpy.argsort(array)) * 100. / (len(array) - 1)

PS. This is one if my favourite tricks in numpy.

answered Oct 04 '22 05:10

Alleo

this might look oversimplyfied but what about this:

def percentile(x):
    pc = float(1)/(len(x)-1)
    return ["%.2f"%(n*pc) for n, i in enumerate(x)]

EDIT:

def percentile(x):
    unique = set(x)
    mapping = {}
    pc = float(1)/(len(unique)-1)
    for n, i in enumerate(unique):
        mapping[i] = "%.2f"%(n*pc)
    return [mapping.get(el) for el in x]

answered Oct 04 '22 05:10

aschmid00

I tried Scipy's percentile score but it turned out to be very slow for one of my tasks. So, simply implemented it this way. Can be modified if a weak ranking is needed.


def assign_pct(X):
    mp = {}
    X_tmp = np.sort(X)
    pct = []
    cnt = 0
    for v in X_tmp:
        if v in mp:
            continue
        else:
            mp[v] = cnt
            cnt+=1
    for v in X:
        pct.append(mp[v]/cnt)
    return pct

Calling the function

assign_pct([23,4,1,43,1,6])

Output of function

[0.75, 0.25, 0.0, 1.0, 0.0, 0.5]

answered Oct 04 '22 05:10

Abhishek Mungoli

If I understand you correctly, all you want to do, is to define the percentile this element represents in the array, how much of the array is before that element. as in [1, 2, 3, 4, 5] should be [0.0, 0.25, 0.5, 0.75, 1.0]

I believe such code will be enough:

def percentileListEdited(List):
    uniqueList = list(set(List))
    increase = 1.0/(len(uniqueList)-1)
    newList = {}
    for index, value in enumerate(uniqueList):
        newList[index] = 0.0 + increase * index
    return [newList[val] for val in List]

answered Oct 04 '22 05:10

Mahmoud Aladdin

For me the best solution is to use QuantileTransformer in sklearn.preprocessing.

from sklearn.preprocessing import QuantileTransformer
fn = lambda input_list : QuantileTransformer(100).fit_transform(np.array(input_list).reshape([-1,1])).ravel().tolist()
input_raw = [1, 2, 3, 4, 17]
output_perc = fn( input_raw )

print "Input=", input_raw
print "Output=", np.round(output_perc,2)

Here is the output

Input= [1, 2, 3, 4, 17]
Output= [ 0.    0.25  0.5   0.75  1.  ]

Note: this function has two salient features:

input raw data is NOT necessarily sorted.
input raw data is NOT necessarily single column.

answered Oct 04 '22 05:10

pitfall

Related questions
                            
                                How to plot scikit learn classification report?
                            
                                How do I retrieve a Django model class dynamically?
                            
                                How do you install lxml on OS X Leopard without using MacPorts or Fink?
                            
                                String to list in Python
                            
                                How to use append with pickle in python?
                            
                                Union of multiple sets in python
                            
                                TypeError after overriding the __add__ method
                            
                                Python try block does not catch os.system exceptions
                            
                                convert openCV image into PIL Image in Python (for use with Zbar library)
                            
                                Python: Remove division decimal
                            
                                Python: How to increase/reduce the fontsize of x and y tick labels?
                            
                                Calling variable defined inside one function from another function
                            
                                How to concatenate two integers in Python?
                            
                                How to log IPython history to text file?
                            
                                Writing a CSV from Flask framework [duplicate]
                            
                                How to specify date and time in python?
                            
                                Python regex to get everything until the first dot in a string
                            
                                pandas xlsxwriter, format header
                            
                                How can I select only one column using SQLAlchemy?
                            
                                converting string to tuple

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Map each list value to its corresponding percentile

Tags:

python

numpy

scipy

percentile

median

Jubbles

People also ask

8 Answers

BrenBarn

reptilicus

Kevin H. Lin

Pure numpy version of Kevin's solution

Alleo

aschmid00

Abhishek Mungoli

Mahmoud Aladdin

pitfall

Recent Activity

Donate For Us