I have a density estimate (using <code>density</code> function) for my data <code>learningTime</code> (see figure below), and I need to find probability <code>Pr(learningTime > c)</code>, i.e., the the area under density curve from a given number <code>c</code> (the red vertical line) to the end of curve. Any idea? <img src="https://i.stack.imgur.com/9YrBE.jpg" alt="enter image description here">

Computing areas under a density estimation curve is not a difficult job. Here is a reproducible example. Suppose we have some observed data <code>x</code> that are, for simplicity, normally distributed: <pre class="prettyprint"><code>set.seed(0) x <- rnorm(1000) </code></pre> We perform a density estimation (with some customization, see <code>?density</code>): <pre class="prettyprint"><code>d <- density.default(x, n = 512, cut = 3) str(d) # List of 7 # $ x : num [1:512] -3.91 -3.9 -3.88 -3.87 -3.85 ... # $ y : num [1:512] 2.23e-05 2.74e-05 3.35e-05 4.07e-05 4.93e-05 ... # ... truncated ... </code></pre> We want to compute the area under the curve to the right of <code>x = 1</code>: <pre class="prettyprint"><code>plot(d); abline(v = 1, col = 2) </code></pre> <img src="https://i.stack.imgur.com/mbK8Y.jpg" alt=""> Mathematically this is an numerical integration of the estimated density curve on <code>[1, Inf]</code>. The estimated density curve is stored in discrete format in <code>d$x</code> and <code>d$y</code>: <pre class="prettyprint"><code>xx <- d$x ## 512 evenly spaced points on [min(x) - 3 * d$bw, max(x) + 3 * d$bw] dx <- xx[2L] - xx[1L] ## spacing / bin size yy <- d$y ## 512 density values for `xx` </code></pre> There are two methods for the numerical integration. method 1: Riemann Sum The area under the estimated density curve is: <pre class="prettyprint"><code>C <- sum(yy) * dx ## sum(yy * dx) # [1] 1.000976 </code></pre> Since Riemann Sum is only an approximation, this deviates from 1 (total probability) a little bit. We call this <code>C</code> value a "normalizing constant". Numerical integration on <code>[1, Inf]</code> can be approximated by <pre class="prettyprint"><code>p.unscaled <- sum(yy[xx >= 1]) * dx # [1] 0.1691366 </code></pre> which should be further scaled it by <code>C</code> for a proper probability estimation: <pre class="prettyprint"><code>p.scaled <- p.unscaled / C # [1] 0.1689718 </code></pre> Since the true density of our simulated <code>x</code> is know, we can compare this estimate with the true value: <pre class="prettyprint"><code>pnorm(x0, lower.tail = FALSE) # [1] 0.1586553 </code></pre> which is fairly close. method 2: trapezoidal rule We do a linear interpolation of <code>(xx, yy)</code> and apply numerical integration on this linear interpolant. <pre class="prettyprint"><code>f <- approxfun(xx, yy) C <- integrate(f, min(xx), max(xx))$value p.unscaled <- integrate(f, 1, max(xx))$value p.scaled <- p.unscaled / C #[1] 0.1687369 </code></pre> <hr> Regarding Robin's answer The answer is legitimate but probably cheating. OP's question starts with a density estimation but the answer bypasses it altogether. If this is allowed, why not simply do the following? <pre class="prettyprint"><code>set.seed(0) x <- rnorm(1000) mean(x > 1) #[1] 0.163 </code></pre>

Compute area under density estimation curve, i.e., probability

Tags:

r

probability

density-plot

kernel-density

probability-density

I have a density estimate (using density function) for my data learningTime (see figure below), and I need to find probability Pr(learningTime > c), i.e., the the area under density curve from a given number c (the red vertical line) to the end of curve. Any idea?

enter image description here

243

asked Nov 28 '16 18:11

Eric

2 Answers

Computing areas under a density estimation curve is not a difficult job. Here is a reproducible example.

Suppose we have some observed data x that are, for simplicity, normally distributed:

set.seed(0)
x <- rnorm(1000)

We perform a density estimation (with some customization, see ?density):

d <- density.default(x, n = 512, cut = 3)
str(d)
#    List of 7
# $ x        : num [1:512] -3.91 -3.9 -3.88 -3.87 -3.85 ...
# $ y        : num [1:512] 2.23e-05 2.74e-05 3.35e-05 4.07e-05 4.93e-05 ...
# ... truncated ...

We want to compute the area under the curve to the right of x = 1:

plot(d); abline(v = 1, col = 2)

Mathematically this is an numerical integration of the estimated density curve on [1, Inf].

The estimated density curve is stored in discrete format in d$x and d$y:

xx <- d$x  ## 512 evenly spaced points on [min(x) - 3 * d$bw, max(x) + 3 * d$bw]
dx <- xx[2L] - xx[1L]  ## spacing / bin size
yy <- d$y  ## 512 density values for `xx`

There are two methods for the numerical integration.

method 1: Riemann Sum

The area under the estimated density curve is:

C <- sum(yy) * dx  ## sum(yy * dx)
# [1] 1.000976

Since Riemann Sum is only an approximation, this deviates from 1 (total probability) a little bit. We call this C value a "normalizing constant".

Numerical integration on [1, Inf] can be approximated by

p.unscaled <- sum(yy[xx >= 1]) * dx
# [1] 0.1691366

which should be further scaled it by C for a proper probability estimation:

p.scaled <- p.unscaled / C
# [1] 0.1689718

Since the true density of our simulated x is know, we can compare this estimate with the true value:

pnorm(x0, lower.tail = FALSE)
# [1] 0.1586553

which is fairly close.

method 2: trapezoidal rule

We do a linear interpolation of (xx, yy) and apply numerical integration on this linear interpolant.

f <- approxfun(xx, yy)
C <- integrate(f, min(xx), max(xx))$value
p.unscaled <- integrate(f, 1, max(xx))$value
p.scaled <- p.unscaled / C
#[1] 0.1687369

Regarding Robin's answer

The answer is legitimate but probably cheating. OP's question starts with a density estimation but the answer bypasses it altogether. If this is allowed, why not simply do the following?

set.seed(0)
x <- rnorm(1000)
mean(x > 1)
#[1] 0.163

147

answered Sep 20 '22 20:09

Zheyuan Li

The Empirical Cumulative Distribution Function ecdf() in base R makes it very easy. Using 李哲源's example...

#Reproducible sample data 
set.seed(0)
x <- rnorm(1000)

#Create empirical cumulative distribution function from sample data
d_fun <- ecdf (x)

#Assume a value for the "red vertical line"
x0 <- 1

#Area under curve less than, equal to x0
d_fun(x0) 
# [1] 0.837

#Area under curve greater than x0
1 - d_fun(x0)
# [1] 0.163

Regarding 李哲源's response to my answer. Their answer assumes you only have the density estimate curve. My answer assumes you have the original data, which is applicable to the the OP's question since they used density() to get the density estimate curve.

answered Sep 16 '22 20:09

Robin

Related questions
                            
                                Parallelization doesn't work with the foreach package
                            
                                How to check that element of a list of lists matches a condition?
                            
                                disabling/enabling sidebar from server side
                            
                                Error: attempt to use zero-length variable name
                            
                                How to check whether a column contains only identical elements in R?
                            
                                How to optimize parameters using genetic algorithms
                            
                                r - hierarchical data frame from child/parent relations
                            
                                How to convert values in column matching pattern in R
                            
                                Add leading zeros within string
                            
                                Caption font color with kable
                            
                                SSL connect error in httr / curl
                            
                                How to filter an R data.table by index and condition
                            
                                Convert an integer to base36
                            
                                Importing a .csv into R with UTF-8 encoding error?
                            
                                Convert Classes ‘tbl_df’, ‘tbl’ and 'data.frame into dataframe with R
                            
                                Call R functions in Rcpp [duplicate]
                            
                                Make Scrollbar appear in RMarkdown code chunks (html view)
                            
                                plotly regression line R
                            
                                plot regression line in R
                            
                                Error when using mice object: No applicable method for 'complete_'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With