Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculate Percentile using LINQ

All,

Having reviewed StackOverflow and the wider internet, I am still struggling to efficiently calculate Percentiles using LINQ.

Where a percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. The below example attempts to convert a list of values, to an array where each (unique) value is represented with is associated percentile. The min() and max() of the list are necessarily the 0% and 100% of the returned array percentiles.

Using LINQPad, the below code generates the required output a VP[]:

enter image description here

This can be interpreted as: - At 0% the minimum value is 1 - At 100% the maximum value is 3 - At 50% between the minimum and maximum the value is 2

void Main()
{
    var list = new List<double> {1,2,3};
    double denominator = list.Count - 1;   
    var answer = list.Select(x => new VP
        {
            Value = x,
            Percentile = list.Count(y => x > y) / denominator
        })
        //.GroupBy(grp => grp.Value) --> commented out until attempted duplicate solution 
        .ToArray();
    answer.Dump();
}

public struct VP
{
    public double Value;
    public double Percentile;
}

However, this returns an incorrect VP[] when the "list" contains duplicate entries (e.g. 1,2,**2,**3) :

enter image description here

My attempts to group by unique values in the list (by including ".GroupBy(grp => grp.Value)") have failed to yield the desired result (Value =2, & Percentile = 0.666) :

enter image description here

All suggestions are welcome. Including whether this is an efficient approach given the repeated iteration with "list.Count(y => x > y)".

As always, thanks Shannon

like image 366
shansen Avatar asked Nov 10 '22 21:11

shansen


1 Answers

I'm not sure I understand the requirements of this question. When I ran the accepted answer's code I got this result:

original result

But if I change the input to this:

var dataSet = new List<double> { 1, 1, 1, 1, 2, 3, 3, 3, 2 };

...I then get this result:

updated result

With the line "The min() and max() of the list are necessarily the 0% and 100% of the returned array percentiles." it seems to me the OP is asking for the values to be from 0 to 1, but the updated result goes beyond 1.

It also seems wrong to me that the first value should be 0% as I'm not sure what that means in context to the data.

After reading the linked Wikipedia page it seems that the OP is actually trying to do the reverse calculation to computing the percentile value. In fact the article says that the percentile for 0 is undefined. That makes sense because a percentile of 0 would be the empty set of values - and what is the maximum value of an empty set?

The OP seems to be computing the percentile from the values. So, in that sense, and knowing that 0 is undefined, it seems that the most appropriate value to compute is the percentage of values that are equal to or below each distinct value in the set.

Now, if I use the Microsoft's Reactive Framework Team's Interactive Extensions (NuGet "Ix-Main") then I can run this code:

var dataSet = new List<double> { 1, 1, 1, 1, 2, 3, 3, 3, 2 };

var result =
    dataSet
        .GroupBy(x => x)
        .Scan(
            new VP()
            {
                Value = double.MinValue, Proportion = 0.0
            },
            (a, x) =>
                new VP()
                {
                    Value = x.Key,
                    Proportion = a.Proportion + (double)x.Count() / dataSet.Count
                });

I get this result:

result

This tells me that approximately 44% of the values are 1; that approximately 67% of the values are 1 or 2; and 100% of the values are either 1, 2, or 3.

This seems to me to be the most logical computation for the requirements.

like image 107
Enigmativity Avatar answered Nov 14 '22 23:11

Enigmativity