I want to create a scatter plot of bivariate normal distribution with 95% "exact" confidence ellipse.
library(mvtnorm)
library(ggplot2)
set.seed(1)
n <- 1e3
c95 <- qchisq(.95, df=2)
rho <- 0.8 #correlation
Sigma <- matrix(c(1, rho, rho, 1), 2, 2) # Covariance matrix
I generated 1000 observations from bivariate normal with mean zero and variance =Sigma
x <- rmvnorm(n, mean=c(0, 0), Sigma)
z <- p95 <- rep(NA, n)
for(i in 1:n){
z[i] <- x[i, ] %*% solve(Sigma, x[i, ])
p95[i] <- (z[i] < c95)
}
We can draw the 95% confidence ellipse on the top of scatterplot of the generated data with ease using stat_ellipse
. Resulting figure is completely satisfactory until you note that the several of the red points lie inside the confidence ellipse. I guess that this discrepancy comes from the estimation of some parameters, and disappears as the sample size gets larger.
data <- data.frame(x, z, p95)
p <- ggplot(data, aes(X1, X2)) + geom_point(aes(colour = p95))
p + stat_ellipse(type = "norm")
Is there any way to fine tune stat_ellipse()
so that it depicts the "exact" confidence ellipse as shown in the figure below which was created using "hand-made" ellips
function?
ellips <- function(center = c(0,0), c=c95, rho=-0.8, npoints = 100){
t <- seq(0, 2*pi, len=npoints)
Sigma <- matrix(c(1, rho, rho, 1), 2, 2)
a <- sqrt(c*eigen(Sigma)$values[2])
b <- sqrt(c*eigen(Sigma)$values[1])
x <- center[1] + a*cos(t)
y <- center[2] + b*sin(t)
X <- cbind(x, y)
R <- eigen(Sigma)$vectors
data.frame(X%*%R)
}
dat <- ellips(center=c(0, 0), c=c95, rho, npoints=100)
p + geom_path(data=dat, aes(x=X1, y=X2), colour='blue')
This is not a real answer, but it might help.
By exploring stat_ellipse
with the following commands,
stat_ellipse
ls(ggplot2:::StatEllipse)
ggplot2:::StatEllipse$calculate
ggplot2:::calculate_ellipse
?cov.wt
it seems that cov.wt
is estimating the covariance matrix from the simulated data:
cov.wt(data[, c(1, 2)])$cov
# X1 X2
# X1 1.1120267 0.8593946
# X2 0.8593946 1.0372208
# True covariance matrix:
Sigma
# [,1] [,2]
# [1,] 1.0 0.8
# [2,] 0.8 1.0
You may consider calculating your p95
values using the estimated covariance matrix. Or just stick with your own well-executed ellipse drawing code.
The ellipse code proposed in the original question is wrong. It works when the X1 and X2 variables have a mean of 0 and a standard deviation of 1, but not in the general case.
Here is an alternative implementation, adapted from the stat_ellipse source code. It takes as argument the vector of means, the covariance matrix, the radius (computed with the confidence level for instance) and the number of segments for the shape.
calculate_ellipse <- function(center, shape, radius, segments){
# Adapted from https://github.com/tidyverse/ggplot2/blob/master/R/stat-ellipse.R
chol_decomp <- chol(shape)
angles <- (0:segments) * 2 * pi/segments
unit.circle <- cbind(cos(angles), sin(angles))
ellipse <- t(center + radius * t(unit.circle %*% chol_decomp))
colnames(ellipse) <- c("X1","X2")
as.data.frame(ellipse)
}
Let's compare both implementations:
library(ggplot2)
library(MASS) # mvrnorm function, to sample multivariate normal variables
set.seed(42)
mu = c(10, 20) # vector of means
rho = -0.7 # correlation coefficient
correlation = matrix(c(1, rho, rho, 1), 2) # correlation matrix
std = c(1, 10) # vector of standard deviations
sigma = diag(std) %*% correlation %*% diag(std) # covariance matrix
N = 1000 # number of points
confidence = 0.95 # confidence level for the ellipse
df = data.frame(mvrnorm(n=N, mu=mu, Sigma=sigma))
radius = sqrt(2 * stats::qf(confidence, 2, Inf)) # radius of the ellipse
ellips <- function(center = c(0,0), c=c95, rho=-0.8, npoints = 100){
# Original proposal
t <- seq(0, 2*pi, len=npoints)
Sigma <- matrix(c(1, rho, rho, 1), 2, 2)
a <- sqrt(c*eigen(Sigma)$values[2])
b <- sqrt(c*eigen(Sigma)$values[1])
x <- center[1] + a*cos(t)
y <- center[2] + b*sin(t)
X <- cbind(x, y)
R <- eigen(Sigma)$vectors
data.frame(X%*%R)
}
calculate_ellipse <- function(center, shape, radius, segments){
# Adapted from https://github.com/tidyverse/ggplot2/blob/master/R/stat-ellipse.R
chol_decomp <- chol(shape)
angles <- (0:segments) * 2 * pi/segments
unit.circle <- cbind(cos(angles), sin(angles))
ellipse <- t(center + radius * t(unit.circle %*% chol_decomp))
colnames(ellipse) <- c("X1","X2")
as.data.frame(ellipse)
}
ggplot(df) +
aes(x=X1, y=X2) +
theme_bw() +
geom_point() +
geom_path(aes(color="new implementation"), data=calculate_ellipse(mu, sigma, radius, 100)) +
geom_path(aes(color="original implementation"), data=ellips(mu, confidence, rho, 100))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With