Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java Apache Commons getPercentile() different result that MS Excel percentile

I have an algorithm that calculates the percentile(85) with Apache Commons of a series of values (12 values), for a later evaluation with a threshold to make a decision. The result is similar to the one given by Excel, but not equal, and sometimes this is critical for my application because with excel the result doesn't pass the threshold and with Apache Commons Math in Java it does, so I get different outputs.

Here it is an example: Internet traffic (Mbps) every 2 hours

32,7076813360000000 41,2580429776000000 45,4453940200000000 48,8044409456000000 46,7462847936000000 49,8028100056000000 54,3719451144000000 41,9708134600000000 29,4371963240000000 22,4667255616000000 20,0388452248000000 28,7807757104000000

After dividing by 1000 Mb (the capacity of the cable) I calculate the percentil(85) of the Occupation:

Excel: 0,049153870117

Apache Commons Math: 0.05003126676104001

I have found that it is possible to change the implementation of the percentile (it does not exist an official one) with setPercentileImpl(), but I couldn't find any example of how to do this, or the Excel algorithm (which is the one I was told to achieve).

Any help about this will be welcomed.

Thank you.

like image 969
Jav_Rock Avatar asked May 10 '11 09:05

Jav_Rock


5 Answers

Class org.apache.commons.math3.stat.descriptive.rank.Percentile already supports Excel style interpolation, you just need to enable it with EstimationType.R_7

public class PercentileExcel extends Percentile {
    public PercentileExcel() throws MathIllegalArgumentException {

    super(50.0,
          EstimationType.R_7, // use excel style interpolation
          NaNStrategy.REMOVED,
          new KthSelector(new MedianOf3PivotingStrategy()));
    }
}
like image 113
pbirnie Avatar answered Oct 21 '22 19:10

pbirnie


The following alternative with no new class needed works in 3.6:

DescriptiveStatistics ds = new DescriptiveStatistics();
Percentile p = new Percentile(50.0).withEstimationType(EstimationType.R_7)
                .withNaNStrategy(NaNStrategy.REMOVED)
                .withKthSelector(new KthSelector(new 
                  MedianOf3PivotingStrategy()))
ds.setPercentileImpl(p);
like image 20
Chris94 Avatar answered Oct 21 '22 18:10

Chris94


The difference is subtle and due to assumptions. It is easiest to explain with the 3 element case. Suppose you have three elements(N=3) a=x[0] < b=x[1] < c=x[2]. Both the Apache and the Excel method say that element b is the 50th percentile (the median). However they differ for a and c.

The Apache method (and the method referenced by the NIST page) say a is the 25th percentile and c is the 75% percentile, because it divides the space up into N+1 blocks, that is, into quarters.

The Excel method says that a is the 0th percentile and c the 100th percentile, as the space is divided into N-1 blocks, that is, in half.

Because of this, if you want the Excel method and you don't want to code it yourself, you could just remove the smallest and the largest element from your array, and call the Apache method - it should give you exactly the same result except at percentiles beyond the end points.

If you want to code it yourself, an easy way is given below. Be aware of these issues:

  • this sorts the array (so changes it)
  • this takes O(N log(N)) time due to the sorting. The Apache method uses a fast selection algorithm so takes O(N) time (google "quickselect" if you want to know more)

Code (not tested or even compiled, but should give you an idea).

// warning - modifies data 
double excelPercentile(double [] data, double percentile) { array
    Arrays.sort(data);
    double index = percentile*(data.length-1);
    int lower = (int)Math.floor(index);
    if(lower<0) { // should never happen, but be defensive
       return data[0];
    }
    if(lower>=data.length-1) { // only in 100 percentile case, but be defensive
       return data[data.length-1);
    }
    double fraction = index-lower;
    // linear interpolation
    double result=data[lower] + fraction*(data[lower+1]-data[lower]);
    return result;
 }
like image 21
Nick Fortescue Avatar answered Oct 21 '22 19:10

Nick Fortescue


The solution was creating a class PercentileExcel which is almost a copy of percentile from commons method except for a small change on how to caculate the position:

pos=(1+p*(n-1))/100;

Then you need to add this line to the code in order to use the new class for percentile:

setPercentileImpl(PercentileExcel);
like image 25
Jav_Rock Avatar answered Oct 21 '22 18:10

Jav_Rock


There is no unique definition of a percentile computed from a data set. See the Wikipedia page for the most common definitions in use.

like image 22
Rob Hyndman Avatar answered Oct 21 '22 19:10

Rob Hyndman