Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fine tuning stat_ellipse() in ggplot2

Tags:

r

ggplot2

I want to create a scatter plot of bivariate normal distribution with 95% "exact" confidence ellipse.

library(mvtnorm)
library(ggplot2)
set.seed(1)
n <- 1e3
c95 <- qchisq(.95, df=2) 
rho <- 0.8  #correlation 
Sigma <- matrix(c(1, rho, rho, 1), 2, 2) # Covariance matrix

I generated 1000 observations from bivariate normal with mean zero and variance =Sigma

x <- rmvnorm(n, mean=c(0, 0), Sigma)
z  <- p95 <- rep(NA, n)
for(i in 1:n){
  z[i] <- x[i, ] %*% solve(Sigma, x[i, ])
  p95[i] <- (z[i] < c95)
}

We can draw the 95% confidence ellipse on the top of scatterplot of the generated data with ease using stat_ellipse. Resulting figure is completely satisfactory until you note that the several of the red points lie inside the confidence ellipse. I guess that this discrepancy comes from the estimation of some parameters, and disappears as the sample size gets larger.

data <- data.frame(x, z, p95)
p <- ggplot(data, aes(X1, X2)) + geom_point(aes(colour = p95))
p + stat_ellipse(type = "norm")

Fig.1

Is there any way to fine tune stat_ellipse() so that it depicts the "exact" confidence ellipse as shown in the figure below which was created using "hand-made" ellips function? enter image description here

ellips <- function(center = c(0,0), c=c95, rho=-0.8, npoints = 100){
  t <- seq(0, 2*pi, len=npoints)
  Sigma <- matrix(c(1, rho, rho, 1), 2, 2)
  a <- sqrt(c*eigen(Sigma)$values[2])
  b <- sqrt(c*eigen(Sigma)$values[1])
  x <- center[1] + a*cos(t)
  y <- center[2] + b*sin(t)
  X <- cbind(x, y)
  R <- eigen(Sigma)$vectors
  data.frame(X%*%R)
}
dat <- ellips(center=c(0, 0), c=c95, rho, npoints=100)
p + geom_path(data=dat, aes(x=X1, y=X2), colour='blue')
like image 903
Khashaa Avatar asked Dec 09 '14 15:12

Khashaa


Video Answer


2 Answers

This is not a real answer, but it might help.

By exploring stat_ellipse with the following commands,

stat_ellipse
ls(ggplot2:::StatEllipse)
ggplot2:::StatEllipse$calculate
ggplot2:::calculate_ellipse
?cov.wt

it seems that cov.wt is estimating the covariance matrix from the simulated data:

cov.wt(data[, c(1, 2)])$cov
#           X1        X2
# X1 1.1120267 0.8593946
# X2 0.8593946 1.0372208

# True covariance matrix:
Sigma
#      [,1] [,2]
# [1,]  1.0  0.8
# [2,]  0.8  1.0

You may consider calculating your p95 values using the estimated covariance matrix. Or just stick with your own well-executed ellipse drawing code.

like image 143
bdemarest Avatar answered Oct 18 '22 02:10

bdemarest


The ellipse code proposed in the original question is wrong. It works when the X1 and X2 variables have a mean of 0 and a standard deviation of 1, but not in the general case.

Here is an alternative implementation, adapted from the stat_ellipse source code. It takes as argument the vector of means, the covariance matrix, the radius (computed with the confidence level for instance) and the number of segments for the shape.

calculate_ellipse <- function(center, shape, radius, segments){
# Adapted from https://github.com/tidyverse/ggplot2/blob/master/R/stat-ellipse.R
    chol_decomp <- chol(shape)
    angles <- (0:segments) * 2 * pi/segments
    unit.circle <- cbind(cos(angles), sin(angles))
    ellipse <- t(center + radius * t(unit.circle %*% chol_decomp))
    colnames(ellipse) <- c("X1","X2")
    as.data.frame(ellipse)
}

Let's compare both implementations:

library(ggplot2)
library(MASS)  # mvrnorm function, to sample multivariate normal variables
set.seed(42)
mu = c(10, 20)  # vector of means
rho = -0.7      # correlation coefficient
correlation = matrix(c(1, rho, rho, 1), 2)  # correlation matrix
std = c(1, 10)  # vector of standard deviations
sigma =  diag(std) %*% correlation %*% diag(std)  # covariance matrix
N = 1000  # number of points
confidence = 0.95  # confidence level for the ellipse

df = data.frame(mvrnorm(n=N, mu=mu, Sigma=sigma))

radius = sqrt(2 * stats::qf(confidence, 2, Inf))  # radius of the ellipse

ellips <- function(center = c(0,0), c=c95, rho=-0.8, npoints = 100){
# Original proposal
    t <- seq(0, 2*pi, len=npoints)
    Sigma <- matrix(c(1, rho, rho, 1), 2, 2)
    a <- sqrt(c*eigen(Sigma)$values[2])
    b <- sqrt(c*eigen(Sigma)$values[1])
    x <- center[1] + a*cos(t)
    y <- center[2] + b*sin(t)
    X <- cbind(x, y)
    R <- eigen(Sigma)$vectors
    data.frame(X%*%R)
}

calculate_ellipse <- function(center, shape, radius, segments){
# Adapted from https://github.com/tidyverse/ggplot2/blob/master/R/stat-ellipse.R
    chol_decomp <- chol(shape)
    angles <- (0:segments) * 2 * pi/segments
    unit.circle <- cbind(cos(angles), sin(angles))
    ellipse <- t(center + radius * t(unit.circle %*% chol_decomp))
    colnames(ellipse) <- c("X1","X2")
    as.data.frame(ellipse)
}

ggplot(df) +
    aes(x=X1, y=X2) +
    theme_bw() +
    geom_point() +
    geom_path(aes(color="new implementation"), data=calculate_ellipse(mu, sigma, radius, 100)) +
    geom_path(aes(color="original implementation"), data=ellips(mu, confidence, rho, 100))

enter image description here

like image 2
Tom Cornebize Avatar answered Oct 18 '22 02:10

Tom Cornebize