Why is the sum of the area under density curve always greater than 1 (R)?

Tags:

r

I found codes to calculate the sum of the area under a density curve in R. Unfortunately, I don't understand why there is always an extra ~"0.000976" at the area...

Click to copy

nb.data = 500000
y = rnorm(nb.data,10,2)

de = density(y)

require(zoo)
sum(diff(de$x[order(de$x)])*rollmean(de$y[order(de$x)],2))

[1] 1.000976

Why is that so?

It should be equal to 1, right?

748

asked Aug 15 '17 21:08

2 Answers

That's calculus. Use higher n (default is 512) for more accurate result

Click to copy

set.seed(42)
de = density(rnorm(500000, 10, 2))
sum(diff(sort(de$x)) * 0.5 * (de$y[-1] + head(de$y, -1)))
#[1] 1.00098

set.seed(42)
de = density(rnorm(500000, 10, 2), n = 1000)
sum(diff(sort(de$x)) * 0.5 * (de$y[-1] + head(de$y, -1)))
#[1] 1.000491

set.seed(42)
de = density(rnorm(500000, 10, 2), n = 10000)
sum(diff(sort(de$x)) * 0.5 * (de$y[-1] + head(de$y, -1)))
#[1] 1.000031

set.seed(42)
de = density(rnorm(500000, 10, 2), n = 100000)
sum(diff(sort(de$x)) * 0.5 * (de$y[-1] + head(de$y, -1)))
#[1] 1.000004

set.seed(42)
de = density(rnorm(500000, 10, 2), n = 1000000)
sum(diff(sort(de$x)) * 0.5 * (de$y[-1] + head(de$y, -1)))
#[1] 1

172

answered Oct 01 '22 19:10

This discrepancy is not just due to rounding errors or floating-point arithmetic. You are effectively interpolating linearly between the points computed by density and then computing the area under this approximation to the original function (i.e. you are integrating the curve using the trapzoidal rule), which means that you are overestimating the area in regions of the curve that are concave up and underestimating it in regions that are concave down. Here's an example image from the Wikipedia article demonstrating the systematic error:

^{Image by Intégration_num_trapèzes.svg: Scalerderivative work: Cdang (talk) - Intégration_num_trapèzes.svg, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=8541370}

Since the normal distribution has more concave up areas (i.e. both tails), the overall estimate is too high. As mentioned in another answer, using a higher resolution (i.e. increasing N) helps to minimize the error. You might also get better results using a different method for numerical integration (e.g. Simpson's rule).

However, there is no numerical integration method that is going to give you an exact answer, and even if there was, the return value of density is only an approximation of the real distribution anyway. (And for real data, the true distribution is unknown.)

If all you want is the satisfaction of seeing a known density function integrating to 1, you can use integrate on the normal density function:

Click to copy

> integrate(dnorm, lower=-Inf, upper=Inf, mean=10, sd=2)
1 with absolute error < 4.9e-06

answered Oct 01 '22 19:10

Ryan C. Thompson

Related questions
                            
                                Cannot install library(xlsx) in R and look for an alternative
                            
                                Keep first row by multiple columns in an R data.table
                            
                                Why can't one have several `value.var` in `dcast`?
                            
                                R string containing only one type of character
                            
                                R How do I extract first row of each matrix within a list?
                            
                                How to replace NAs with the linear interpolation between known observations? [duplicate]
                            
                                How do I loop through column names and make a ggplot scatteplot for each one
                            
                                Use data.table set() to convert all columns from integer to numeric
                            
                                Remove all objects of a given type in R
                            
                                Data.table - left outer join on multiple tables
                            
                                How to draw ellipsoid with plotly
                            
                                How to plot reverse (complementary) ecdf using ggplot?
                            
                                'Proper' way to do row-wise replacement
                            
                                Is there a R function that convert p.value to significance code?
                            
                                Plot lines in ggplot from a list of dataframes
                            
                                Parallel processing in R with H2O
                            
                                Shiny node reactivity dependency tree
                            
                                Flow map(Travel Path) Using Lat and Long in R
                            
                                Deploy shiny app in rocker/shiny docker
                            
                                merge list of data frames by different ids

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is the sum of the area under density curve always greater than 1 (R)?

Tags:

r

M. Beausoleil

People also ask

2 Answers

d.b

Ryan C. Thompson

Recent Activity

Donate For Us