Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compute area under density estimation curve, i.e., probability

I have a density estimate (using density function) for my data learningTime (see figure below), and I need to find probability Pr(learningTime > c), i.e., the the area under density curve from a given number c (the red vertical line) to the end of curve. Any idea?

enter image description here

like image 243
Eric Avatar asked Nov 28 '16 18:11

Eric


People also ask

What is the area under a probability density curve equal to?

A density curve is a graph that shows probability. The area under the curve is equal to 100 percent of all probabilities. As we usually use decimals in probabilities you can also say that the area is equal to 1 (because 100% as a decimal is 1). The above density curve is a graph of how body weights are distributed.

How do you find the probability of a probability density plot?

Therefore, probability is simply the multiplication between probability density values (Y-axis) and tips amount (X-axis). The multiplication is done on each evaluation point and these multiplied values will then be summed up to calculate the final probability.

How do you find q1 and q3 on a density curve?

The quartiles divide the area under the curve into quarters. 1/4 of the area is to the left of Q1 and 3/4 of the are is to the left of Q3. You can roughly locate the median and quartiles of any density curve by eye by dividing the area under the curve into four equal parts.


2 Answers

Computing areas under a density estimation curve is not a difficult job. Here is a reproducible example.

Suppose we have some observed data x that are, for simplicity, normally distributed:

set.seed(0)
x <- rnorm(1000)

We perform a density estimation (with some customization, see ?density):

d <- density.default(x, n = 512, cut = 3)
str(d)
#    List of 7
# $ x        : num [1:512] -3.91 -3.9 -3.88 -3.87 -3.85 ...
# $ y        : num [1:512] 2.23e-05 2.74e-05 3.35e-05 4.07e-05 4.93e-05 ...
# ... truncated ...

We want to compute the area under the curve to the right of x = 1:

plot(d); abline(v = 1, col = 2)

Mathematically this is an numerical integration of the estimated density curve on [1, Inf].

The estimated density curve is stored in discrete format in d$x and d$y:

xx <- d$x  ## 512 evenly spaced points on [min(x) - 3 * d$bw, max(x) + 3 * d$bw]
dx <- xx[2L] - xx[1L]  ## spacing / bin size
yy <- d$y  ## 512 density values for `xx`

There are two methods for the numerical integration.

method 1: Riemann Sum

The area under the estimated density curve is:

C <- sum(yy) * dx  ## sum(yy * dx)
# [1] 1.000976

Since Riemann Sum is only an approximation, this deviates from 1 (total probability) a little bit. We call this C value a "normalizing constant".

Numerical integration on [1, Inf] can be approximated by

p.unscaled <- sum(yy[xx >= 1]) * dx
# [1] 0.1691366

which should be further scaled it by C for a proper probability estimation:

p.scaled <- p.unscaled / C
# [1] 0.1689718

Since the true density of our simulated x is know, we can compare this estimate with the true value:

pnorm(x0, lower.tail = FALSE)
# [1] 0.1586553

which is fairly close.

method 2: trapezoidal rule

We do a linear interpolation of (xx, yy) and apply numerical integration on this linear interpolant.

f <- approxfun(xx, yy)
C <- integrate(f, min(xx), max(xx))$value
p.unscaled <- integrate(f, 1, max(xx))$value
p.scaled <- p.unscaled / C
#[1] 0.1687369

Regarding Robin's answer

The answer is legitimate but probably cheating. OP's question starts with a density estimation but the answer bypasses it altogether. If this is allowed, why not simply do the following?

set.seed(0)
x <- rnorm(1000)
mean(x > 1)
#[1] 0.163
like image 147
Zheyuan Li Avatar answered Sep 20 '22 20:09

Zheyuan Li


The Empirical Cumulative Distribution Function ecdf() in base R makes it very easy. Using 李哲源's example...

#Reproducible sample data 
set.seed(0)
x <- rnorm(1000)

#Create empirical cumulative distribution function from sample data
d_fun <- ecdf (x)

#Assume a value for the "red vertical line"
x0 <- 1

#Area under curve less than, equal to x0
d_fun(x0) 
# [1] 0.837

#Area under curve greater than x0
1 - d_fun(x0)
# [1] 0.163

Regarding 李哲源's response to my answer. Their answer assumes you only have the density estimate curve. My answer assumes you have the original data, which is applicable to the the OP's question since they used density() to get the density estimate curve.

like image 38
Robin Avatar answered Sep 16 '22 20:09

Robin