Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reliably retrieve the reverse of the quantile function

I have read other posts (such as here) on getting the "reverse" of quantile -- that is, to get the percentile that corresponds to a certain value in a series of values.

However, the answers don't give me the same value as quantile for the same data series.

I have also researched that quantile provides 9 different algorithms to calculate percentile.

So my question: is there a reliable way to get the reverse of the quantile function? ecdf does not take a "type" argument so it doesn't seem that one can make sure they are using the same method.

Reproducible example:

# Simple data
x = 0:10
pcntile = 0.5


# Get value corresponding to a percentile using quantile
(pcntile_value <- quantile(x, pcntile))     

# 50%    
# 5               # returns 5 as expected for 50% percentile     



# Get percentile corresponding to a value using ecdf function
(pcntile_rev <- ecdf(x)(5))                


# [1] 0.5454545   #returns 54.54% as the percentile for the value 5


# Not the same answer as quantile produces
like image 745
dave_in_newengland Avatar asked Jun 23 '19 13:06

dave_in_newengland


2 Answers

The answer in the link is really good, but perhaps it helps, to have a look at ecdf Just run the following code:

# Simple data
x = 0:10
p0 = 0.5

# Get value corresponding to a percentile using quantile
sapply(c(1:7), function(i) quantile(x, p0, type = i))
# 50% 50% 50% 50% 50% 50% 50% 
# 5.0 5.0 5.0 4.5 5.0 5.0 5.0 

Thus, it is not a question of type. You can step into the function using debug:

# Get percentile corresponding to a value using ecdf function
debug(ecdf)
my_ecdf <- ecdf(x)

The crucial part is

rval <- approxfun(vals, cumsum(tabulate(match(x, vals)))/n, 
    method = "constant", yleft = 0, yright = 1, f = 0, ties = "ordered")

After this you can check

data.frame(x = vals, y = round(cumsum(tabulate(match(x, vals)))/n, 3), stringsAsFactors = FALSE)

and as you devide by n=11 the result is not surprising. As said, for theory have a look at the other answer.

By the way, you can also plot the function

plot(my_ecdf)

Concerning your comment. I think it's not a question of reliability but a question of how to define the "inverse distribution function, if it does not exist":

enter image description here

enter image description here

enter image description here

A good reference for generalized inverses: Paul Embrechts, Marius Hofert: "A note on generalized inverses", Math Meth Oper Res (2013) 77:423–432 DOI

like image 125
Christoph Avatar answered Sep 30 '22 06:09

Christoph


ecdf is giving the result of the formula in the documentation.

x <- 0:10
Fn <- ecdf(x)

Now, the object Fn is an interpolating step function.

str(Fn)
#function (v)  
# - attr(*, "class")= chr [1:3] "ecdf" "stepfun" "function"
# - attr(*, "call")= language ecdf(x)

And it keeps the original x values and the corresponding y values.

environment(Fn)$x
# [1]  0  1  2  3  4  5  6  7  8  9 10

environment(Fn)$y
# [1] 0.09090909 0.18181818 0.27272727 0.36363636 0.45454545 0.54545455
# [7] 0.63636364 0.72727273 0.81818182 0.90909091 1.00000000

The latter are exactly the same values as the result of what the documentation says is the formula used to compute them. From help('ecdf'):

For observations x= (x1,x2, ... xn), Fn is the fraction of
observations less or equal to t, i.e.,

Fn(t) = #{xi <= t}/n = 1/n sum(i=1,n) Indicator(xi <= t).

Instead of 1:length(x) I will use seq_along.

seq_along(x)/length(x)
# [1] 0.09090909 0.18181818 0.27272727 0.36363636 0.45454545 0.54545455
# [7] 0.63636364 0.72727273 0.81818182 0.90909091 1.00000000
Fn(x)
# [1] 0.09090909 0.18181818 0.27272727 0.36363636 0.45454545 0.54545455
# [7] 0.63636364 0.72727273 0.81818182 0.90909091 1.00000000
like image 43
Rui Barradas Avatar answered Sep 30 '22 05:09

Rui Barradas