I found codes to calculate the sum of the area under a density curve in R. Unfortunately, I don't understand why there is always an extra ~"0.000976" at the area...
nb.data = 500000
y = rnorm(nb.data,10,2)
de = density(y)
require(zoo)
sum(diff(de$x[order(de$x)])*rollmean(de$y[order(de$x)],2))
[1] 1.000976
Why is that so?
It should be equal to 1, right?
The area under a density curve represents probability. The area under a density curve = 1. These two rules go hand in hand because probability has a range of 0 (impossible) to 1 (certain). Hence, the total area under a density curve, which represents probability, must equal 1.
A density curve is a curve that is always on or above the horizontal axis, and has area exactly 1 underneath it. When considering a specific data point, there is area to the left and area to the right. A NORMAL curve is one that mimics a symmetric histogram and the mean and median are EQUAL.
Areas under a density curve represent proportions of the total number of observations. The median is the point with half the observations on either side. So the median of a density curve is the equal-areas point—the point with half the area under the curve to its left and the remaining half of the area to its right.
Density curves have the following properties: The area under the curve always adds up to 100%. The curve will never dip below the x-axis.
That's calculus. Use higher n
(default is 512) for more accurate result
set.seed(42)
de = density(rnorm(500000, 10, 2))
sum(diff(sort(de$x)) * 0.5 * (de$y[-1] + head(de$y, -1)))
#[1] 1.00098
set.seed(42)
de = density(rnorm(500000, 10, 2), n = 1000)
sum(diff(sort(de$x)) * 0.5 * (de$y[-1] + head(de$y, -1)))
#[1] 1.000491
set.seed(42)
de = density(rnorm(500000, 10, 2), n = 10000)
sum(diff(sort(de$x)) * 0.5 * (de$y[-1] + head(de$y, -1)))
#[1] 1.000031
set.seed(42)
de = density(rnorm(500000, 10, 2), n = 100000)
sum(diff(sort(de$x)) * 0.5 * (de$y[-1] + head(de$y, -1)))
#[1] 1.000004
set.seed(42)
de = density(rnorm(500000, 10, 2), n = 1000000)
sum(diff(sort(de$x)) * 0.5 * (de$y[-1] + head(de$y, -1)))
#[1] 1
This discrepancy is not just due to rounding errors or floating-point arithmetic. You are effectively interpolating linearly between the points computed by density
and then computing the area under this approximation to the original function (i.e. you are integrating the curve using the trapzoidal rule), which means that you are overestimating the area in regions of the curve that are concave up and underestimating it in regions that are concave down. Here's an example image from the Wikipedia article demonstrating the systematic error:
Image by Intégration_num_trapèzes.svg: Scalerderivative work: Cdang (talk) - Intégration_num_trapèzes.svg, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=8541370
Since the normal distribution has more concave up areas (i.e. both tails), the overall estimate is too high. As mentioned in another answer, using a higher resolution (i.e. increasing N) helps to minimize the error. You might also get better results using a different method for numerical integration (e.g. Simpson's rule).
However, there is no numerical integration method that is going to give you an exact answer, and even if there was, the return value of density
is only an approximation of the real distribution anyway. (And for real data, the true distribution is unknown.)
If all you want is the satisfaction of seeing a known density function integrating to 1, you can use integrate
on the normal density function:
> integrate(dnorm, lower=-Inf, upper=Inf, mean=10, sd=2)
1 with absolute error < 4.9e-06
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With