Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find quantiles of an empirical cumulative density function (ECDF)

I am using ecdf() function to calculate empirical cumulative density function (ECDF) from some random samples:

set.seed(0)
X = rnorm(100)
P = ecdf(X)

Now P gives ECDF and we could plot it:

plot(P)
abline(h = 0.6, lty = 3)

ecdf

My question is: how can I find the sample value x, such that P(x) = 0.6, i.e., the 0.6-quantile of ECDF, or the x-coordinate of the intersection point of ECDF and h = 0.6?

like image 423
Yang Yang Avatar asked Jan 06 '23 16:01

Yang Yang


1 Answers

In the following, I will not use ecdf(), as it is easy to obtain empirical cumulative density function (ECDF) ourselves.

First, we sort samples X in ascending order:

X <- sort(X)

ECDF at those samples takes function values:

e_cdf <- 1:length(X) / length(X)

We could then sketch ECDF by:

plot(X, e_cdf, type = "s")
abline(h = 0.6, lty = 3)

enter image description here

Now, we are looking for the first value of X, such that P(X) >= 0.6. This is just:

X[which(e_cdf >= 0.6)[1]]
# [1] 0.2290196

Since our data are sampled from standard normal distribution, the theoretical quantile is

qnorm(0.6)
# [1] 0.2533471

So our result is quite close.


Extension

Since the inverse of CDF is quantile function (for example, the inverse of pnorm() is qnorm()), one may guess the inverse of ECDF as sample quantile, i,e, the inverse ecdf() is quantile(). This is not true!

ECDF is a staircase / step function, and it does not have inverse. If we rotate ECDF around y = x, the resulting curve is not a mathematical function. So sample quantile is has nothing to do with ECDF.

For n sorted samples, sample quantile function is in fact a linear interpolation function of (x, y), with:

  • x-values being seq(0, 1, length = n);
  • y-values being sorted samples.

We can define our own version of sample quantile function by:

my_quantile <- function(x, prob) {
  if (is.unsorted(x)) x <- sort(x)
  n <- length(x)
  approx(seq(0, 1, length = n), x, prob)$y
  }

Let's have a test:

my_quantile(X, 0.6)
# [1] 0.2343171

quantile(X, prob = 0.6, names = FALSE)
# [1] 0.2343171

Note that result is different from what we get from X[which(e_cdf >= 0.6)[1]].

It is for this reason, that I refuse to use quantile() in my answer.

like image 134
Zheyuan Li Avatar answered Jan 21 '23 19:01

Zheyuan Li