Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is the sum of the area under density curve always greater than 1 (R)?

Tags:

r

I found codes to calculate the sum of the area under a density curve in R. Unfortunately, I don't understand why there is always an extra ~"0.000976" at the area...

nb.data = 500000
y = rnorm(nb.data,10,2)

de = density(y)

require(zoo)
sum(diff(de$x[order(de$x)])*rollmean(de$y[order(de$x)],2))

[1] 1.000976

Why is that so?

It should be equal to 1, right?

like image 748
M. Beausoleil Avatar asked Aug 15 '17 21:08

M. Beausoleil


People also ask

Why is the area under a density curve always 1?

The area under a density curve represents probability. The area under a density curve = 1. These two rules go hand in hand because probability has a range of 0 (impossible) to 1 (certain). Hence, the total area under a density curve, which represents probability, must equal 1.

Do all density curves have area 1 underneath?

A density curve is a curve that is always on or above the horizontal axis, and has area exactly 1 underneath it. When considering a specific data point, there is area to the left and area to the right. A NORMAL curve is one that mimics a symmetric histogram and the mean and median are EQUAL.

What does the area under a density curve mean?

Areas under a density curve represent proportions of the total number of observations. The median is the point with half the observations on either side. So the median of a density curve is the equal-areas point—the point with half the area under the curve to its left and the remaining half of the area to its right.

What are the two properties of a density curve?

Density curves have the following properties: The area under the curve always adds up to 100%. The curve will never dip below the x-axis.


2 Answers

That's calculus. Use higher n (default is 512) for more accurate result

set.seed(42)
de = density(rnorm(500000, 10, 2))
sum(diff(sort(de$x)) * 0.5 * (de$y[-1] + head(de$y, -1)))
#[1] 1.00098

set.seed(42)
de = density(rnorm(500000, 10, 2), n = 1000)
sum(diff(sort(de$x)) * 0.5 * (de$y[-1] + head(de$y, -1)))
#[1] 1.000491

set.seed(42)
de = density(rnorm(500000, 10, 2), n = 10000)
sum(diff(sort(de$x)) * 0.5 * (de$y[-1] + head(de$y, -1)))
#[1] 1.000031

set.seed(42)
de = density(rnorm(500000, 10, 2), n = 100000)
sum(diff(sort(de$x)) * 0.5 * (de$y[-1] + head(de$y, -1)))
#[1] 1.000004

set.seed(42)
de = density(rnorm(500000, 10, 2), n = 1000000)
sum(diff(sort(de$x)) * 0.5 * (de$y[-1] + head(de$y, -1)))
#[1] 1
like image 172
d.b Avatar answered Oct 01 '22 19:10

d.b


This discrepancy is not just due to rounding errors or floating-point arithmetic. You are effectively interpolating linearly between the points computed by density and then computing the area under this approximation to the original function (i.e. you are integrating the curve using the trapzoidal rule), which means that you are overestimating the area in regions of the curve that are concave up and underestimating it in regions that are concave down. Here's an example image from the Wikipedia article demonstrating the systematic error:


Trapezoidal rule illustration

Image by Intégration_num_trapèzes.svg: Scalerderivative work: Cdang (talk) - Intégration_num_trapèzes.svg, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=8541370


Since the normal distribution has more concave up areas (i.e. both tails), the overall estimate is too high. As mentioned in another answer, using a higher resolution (i.e. increasing N) helps to minimize the error. You might also get better results using a different method for numerical integration (e.g. Simpson's rule).

However, there is no numerical integration method that is going to give you an exact answer, and even if there was, the return value of density is only an approximation of the real distribution anyway. (And for real data, the true distribution is unknown.)

If all you want is the satisfaction of seeing a known density function integrating to 1, you can use integrate on the normal density function:

> integrate(dnorm, lower=-Inf, upper=Inf, mean=10, sd=2)
1 with absolute error < 4.9e-06
like image 39
Ryan C. Thompson Avatar answered Oct 01 '22 19:10

Ryan C. Thompson