Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Observation in a bivariate Ellipse

Tags:

r

enter image description hereI am trying find the probability that a point lies within an ellipse? For eg if I was plotting the bivariate data (x,y) for 300 datasets in an 95% ellipsoid region, how do I calculate how many times out of 300 will my points fall inside the ellipse?

Heres the code I am using

   library(MASS)
   seed<-1234
   x<-NULL
   k<-1
   Sigma2 <- matrix(c(.72,.57,.57,.46),2,2)
   Sigma2
   rho <- Sigma2[1,2]/sqrt(Sigma2[1,1]*Sigma2[2,2])
   rho
   eta1<-replicate(300,mvrnorm(k, mu=c(-1.59,-2.44), Sigma2))

   library(car)
   dataEllipse(eta1[1,],eta1[2,], levels=c(0.05, 0.95))

Thanks for your help.

like image 954
user1560215 Avatar asked Jul 12 '13 22:07

user1560215


1 Answers

I don't see why people are jumping on the OP. In context, it's clearly a programming question: it's about getting the empirical frequency of data points within a given ellipse, not a theoretical probability. The OP even posted code and a graph showing what they're trying to obtain.

It may be that they don't fully understand the statistical theory behind a 95% ellipse, but they didn't ask about that. Besides, making plots and calculating frequencies like this is an excellent way of coming to grips with the theory.

Anyway, here's some code that answers the narrowly-defined question of how to count the points within an ellipse obtained via a normal distribution (which is what underlies dataEllipse). The idea is to transform your data to the unit circle via principal components, then get the points within a certain radius of the origin.

within.ellipse <- function(x, y, plot.ellipse=TRUE)
{
    if(missing(y) && is.matrix(x) && ncol(x) == 2)
    {
        y <- x[,2]
        x <- x[,1]
    }

    if(plot.ellipse)
        dataEllipse(x, y, levels=0.95)

    d <- scale(prcomp(cbind(x, y), scale.=TRUE)$x)
    rad <- sqrt(2 * qf(.95, 2, nrow(d) - 1))
    mean(sqrt(d[,1]^2 + d[,2]^2) < rad)
}

It was also commented that a 95% data ellipse contains 95% of the data by definition. This is certainly not true, at least for normal-theory ellipses. If your distribution is particularly bad, the coverage frequency may not even converge to the assumed level as the sample size increases. Consider a generalised pareto distribution, for example:

library(evd) # for rgpd

# generalised pareto has no variance for shape > 0.5
z <- sapply(1:1000, function(...) within.ellipse(rgpd(100, shape=5), rgpd(100, shape=5), FALSE))
mean(z)
[[1] 0.97451

z <- sapply(1:1000, function(...) within.ellipse(rgpd(10000, shape=5), rgpd(10000, shape=5), FALSE))
mean(z)
[1] 0.9995808
like image 187
Hong Ooi Avatar answered Oct 23 '22 08:10

Hong Ooi